How To Build The Best Voice Agent in Real-Time (with Voice & Avatar)

Feb 18, 2025

The day that I watched the news about OpenAI’s real-time API, I said to myself, “This new Voice agent is really cool; it’s going to change the AI community and how they interact with AI chatbots.”

Not long ago, I developed a voice assistant, but the experience was actually not very good. There are two main reasons: the first reason is AI’s ability to understand (this part has been greatly improved after the release of LLM), and the second problem is the delay.

Voice assistants usually go through three steps: convert user speech to text, use AI to process text and generate replies, and convert text replies to speech and play them on your device.

One of the big problems I faced was delay (ultra-low latency), which leads to a poor user experience, When the delay of human-to-human interaction can be controlled within hundreds or even tens of milliseconds, the bad experience will disappear and will be replaced by better user experience and product value..

With the development of the real-time API, I think there will be a new dawn for voice assistants, not to mention that it is really a “brainwork” to integrate all these technologies seamlessly to create a human-like interactive experience.

However, here’s the good news! With TEN (Transformative Extensions Network), developers no longer have to wrack their brains! TEN is a real-time voice agent framework for creating conversational AI agents. It not only reduces development pain points but also makes it easy to interrupt conversations with the TEN Agent and seamlessly build the next generation of AI applications from scratch.

So, Let me give you a quick demo of a live chatbot to show you what I mean.

I went to the Ten Agent GitHub page and clicked the link to explore the chatbot demo. I selected an agent demo, which included options like Voice Agent Gemini, a voice agent with Dify (a custom STT/TTS), and a voice storyteller with an image generator. After selecting the demo, I chose English as the conversation language by default Even after selecting a default “language” TEN can still understand other languages. Then, I clicked on a small icon to configure greeting messages and prompts to create AI tutor assistance. Finally, I clicked ‘Connect’ to start the conversation.

Stay tuned until the end because I will show you how to Building Your Own Voice Agent with Local Deployment

When I started chatting with the agent, the voice agent used automatic voice detection to determine when I finished speaking, ensuring the interaction felt natural. The agent then converted the audio into text to achieve high-quality semantic understanding with faster real-time performance. Next, the agent used an LLM to understand the user’s intent and generate natural answer text.

During this process, the agent planned the reasoning steps to break down the problem. Finally, the agent took the output and converted the text into natural speech.

In this video, I’ll quickly go over the Document so you are 100% up to speed on what Ten Agent is, what it features, how it works, and even We’ll be installing applications on-screen that you can copy, paste, and adapt for your uses

Before we start! 🦸🏻‍♀️

If you like this topic and you want to support me:

like my article; that will really help me out.👏

Follow me on my YouTube channel
Subscribe to me to get the latest article.

What is TEN-Agent :

The TEN Framework is an open-source tool that helps developers quickly create real-time multimodal agents. These agents can work with voice, video, data streams, images, and text. It makes it simple for developers to try out ideas, use large language models, and build features they can reuse.

With TEN, you can create voice chatbots, AI tools for meeting notes, language learning apps, and much more.

TEN makes it easy for you! It gives you various AI tools and features to design, test, and launch advanced AI agents that can think, listen, see, and interact in real-time.

Main Features :

Multimodal interaction :

It supports the interaction of voice, text and images, providing a more natural human-computer communication.
It suits various scenarios such as intelligent customer service, real-time voice assistant, etc.

2. Real-time communication :

Integrate RTC technology to achieve low-latency voice and video interaction, ensuring users have a smooth experience.
High-performance real-time communication can be achieved without additional configuration.

3. Extensive feature support :

Weather query: You can quickly obtain current or future weather information.
Web search: Helps users quickly find the information they need.
Visual Recognition: Ability to process and analyze image content.

4. Dynamic response and state management :

Provide real-time agent state management, enabling AI agents to respond dynamically to user interactions

5. Edge computing and cloud support :

It supports both edge computing and cloud deployment to meet the needs of different application scenarios

6. Prebuilt modules for STT, LLM, and TTS:

Prebuilt modules for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) enable quick iteration.
As shown in the graph below, many ready-to-use extensions are supported, with more on the way.
Developers can easily customize these modules to suit specific needs.

How it works :

The “Ten Agents” architecture has four extensions: ASR Extension, LLM Extension, TTS Extension, and RTC Extension. The TEN framework is based on extensions, which can be built in different programming languages and combined to create apps.

A Graph in the TEN framework shows how data moves between extensions. It controls data flow and defines connections. For example, it can send Speech-to-Text (STT) output to a Large Language Model (LLM) for further processing.

Path to Directory: (All Graph) : [link]

When you start a conversation with an agent, you can use audio, video, or screenshots in a real-time communication (RTC) network. The RTC network acts like a highway, carrying data to the backend server. and Real-Time Messaging (RTM) is an SDK that allows you to send instant signaling messages.

Path to Directory: [link]

When you start speaking, the agent needs to know when you’ve finished. It uses voice-automatic detection (interrupt_detector) to identify when the audio ends. Once the audio is detected, the system collects the message and converts it into text using Deepgram.

Path to Directory: [link]

The text is then passed to a large language model to process and generate an output. The response is converted back into speech using ElevenLabs (TTS).

Path to Directory: [Link]

Finally, when the agent completes its task, the response travels back through the RTC network to the user’s app.

Everything is managed by the TEN Manager, which handles tasks like uploading, sharing, and installing extensions. It automatically manages dependencies and their environment. This makes installing and publishing extensions easy, helping developers work more efficiently in the TEN framework.

Step-by-Step Process:

Let’s run the TEN Agent together before we start. Make sure you have Docker and Docker Compose installed, obtain the Agora App ID and App Certificate (if the certificate is enabled in the Agora console), and get an OpenAI API key, as well as API keys for Deepgram ASR and FishAudio TTS.

AGORA_APP_ID=
AGORA_APP_CERTIFICATE=
DEEPGRAM_API_KEY=
FISH_AUDIO_TTS_KEY=
OPEN_API_KEY=

Once you prepared all the APIs, we open the terminal and clone down the TEN Agent repository.

git clone https://github.com/TEN-framework/TEN-Agent.git

In the project root directory, use

cp .env.example .env

the command to create .enva file.

Open .envthe file and fill in the required API key and configuration.

Start the container

Run the command in the project root directory docker compose upto start the container.

Or docker compose up -dstart the container in detached mode using the command

After That, We Open another window in the same env enter the container and build the agent.

docker exec -it ten_agent_dev bash

Once you run docker successfully you will see the App in your terminal to start building the app. We use

task use

Once the build is complete, run the server on port 8080

task run

Open it in your browser

http://localhost:3000/

to start using TEN Agent.

Once the browser opens, we will select a graph. There are different types available: voice_assistant, voice assistant real-time, and storyteller. What draws my attention is Storyteller, which is a use case for developers who want to build a storyteller with voice interaction quickly. They can directly use it and customize it.

Our roadmap also has more use cases, such as computer use, virtual beings with avatars, and PSTN for phones.

First, we customize our extension modules. In this video, we will use Deepgram for speech-to-text and a large language model to process our data. and, we have a function call where we use tools like Bing Search or a vision tool. Finally, we will use ElevenLabs for text-to-speech. but feel free to customize them based on your preferences

Once you save the changes, let’s connect the agent and test our chatbot.

But let me tell you what surprised me when I played with TEN Agent. I think many of us will like it because they have this cool feature where you can have a conversation with a customized avatar. If you’re as curious as I was, give me a few seconds of your time to show you what I mean.

In this demo, you’ll see one of TEN Agent’s team members having a conversation with an avatar, asking her to dance and even change the music and background. Not only that, but They also made another demo with a different dog avatar, where he asked the avatar to change his accent. I’ll let you listen to the conversation.

But here’s the kicker: TEN’s Graph Designer is an excellent tool for designing conversation flows with a drag-and-drop interface. It connects to a property.json and manifest.json file in the backend, where we use demuxers, muxers, and FFmpeg to manage and work with video and audio files.

Demuxer: If you only want the audio from a video file, you need a demuxer.
Muxer: If you edit a video and add new audio, you use a muxer to package them together.

We also use FFmpeg to extract audio (demux), combine streams (mux), or even convert the file to a different format. This feature will be launched soon.

Guys This project is really awesome! I’ve always thought about creating something like this, but when I found Ten Agent, it helped me a lot and saved me time.

I can mimic the project and adapt it for our use case. It’s definitely worth giving it a GitHub star to support open-source and useful content. In fact, after a short experience, I really want to shout:

“All Agents Need TEN Agent Real-time.”

Conclusion :

TEN-Agent is not only a powerful multimodal AI agent framework but also provides developers with powerful tools to build efficient, real-time interactive applications. Its rich functions and flexible application scenarios make it an ideal choice for enterprises and individual developers to build the next generation of Voice AI Applications.

Reference :

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or Book a 1-on-1 Consulting Call With Me.

Gao Dalie (高達烈)

Discussion about this post