Building an effective Voice Agent for our own docs

Jan 21, 2025 • 10 minutes reading time

Successfully resolving >80% of user inquiries

At ElevenLabs, we recently embedded a Conversational AI agent in our docs to help reduce the support burden for documentation-related questions (Test it out here). Our support agent is now successfully handling over 80% of user inquiries across 200 calls per day. These results demonstrate the potential for AI to augment traditional documentation support while highlighting the continued importance of human support for complex queries. In this post, I will detail our iterative process you can follow to replicate our success.

Our goals

We set out to build an agent that can:

Resolve support questions that can be answered from the context of our product and support documentation
Redirect users to relevant documentation sections
Forward complex queries to email/discord support when needed
Have a fluid and natural conversation, with low latency and realistic interruption handling

Results and Impact

We implemented two layers of evaluation:

(1) AI Evaluation Tooling: For each call, our built-in evaluation tooling runs through the finished conversation and evaluates if the agent has been successful. The criteria is fully customizable. We ask if the agent solved the user inquiry, or was able to redirect them to a relevant support channel.

We have been able to steadily improve the ability of the LLM to solve or redirect the inquiry successfully, reaching 80% according to our evaluation tooling.

Excluding calls with less than 1 turn in the conversation, which imply no question / issue was raised by the caller.

Now, it’s important to consider that not all types of support queries or questions can be solved by an LLM, especially for a startup that builds fast and innovates constantly, and with extremely technical and creative users. As an additional disclaimer, an evaluation LLM will not evaluate correctly 100% of the time.

(2) Human Validation: To contrast the efficacy of our LLM validation tooling, we conducted a human validation of 150 conversations, using the same evaluation criteria provided to the LLM tooling:

solved_user_inquiry: defined as success when the agent answered the user questions with relevant information or was able to redirect to the relevant page / support channel.
- The LLM and the Human agreed on 81% of cases
hallucination_kb: this criteria will check the final transcript and verify if the answers given to by the LLM about ElevenLabs products adhere to the information in the knowledge base or go beyond it.
- The LLM and the Human agreed on 83% of cases

The human evaluation also revealed that 89% of relevant support questions were answered or redirected correctly by the Documentation agent.

Other findings:

Several callers just wanted to play around and try talking in different languages without asking a support question.
- Currently, our Conversational AI supports various languages, but these have to be defined at the start of the conversation.
Several callers engage in conversations not relevant to the objective of the agent to talk about ElevenLabs, its products and documentation. Prompt guard rails helped most of the time, but not always.
Several callers were looking for coding or debugging support.

Strengths and Limitations

Strengths

The LLM-powered agent is adept at resolving clear and specific questions that can be answered with our documentation, pointing callers to the relevant documentation, and providing some initial guidance on more complex queries. In most of these cases, the agent provides quick, straightforward, and correct answers that are immediately helpful.

Questions include:

Does ElevenLabs have an API endpoint for deleting a voice?
How can I configure conversation overrides in my agent?
How do I integrate with telephony?
Does ElevenLabs support the spanish language?

Recommendations:

Target an audience that will mostly have clear / specific questions that an LLM with documentation and tools is good at answering.
Leverage redirects to other channels for the vague questions / those requiring investigation. This helps a lot!
Add evaluation tooling to capture all questions asked and monitor those -> adjust prompt with learnings. Add evaluation tooling for success and hallucinations/deviations from the knowledge base.

Limitations

On the flip side, the agent is less helpful with account issues, pricing/discount questions, or non-specific questions that would benefit from deeper investigation / querying. Also, issues that are fairly vague and generic -> despite being prompted to ask questions, the LLM usually favours answering with something that might seem relevant from the documentation.

Questions include:

The verification step of my PVC is repeatedly failing. Why?
How much will an AI agent cost? Can I have a discount?