Anatomy of a Real-Time Voice Agent

Introduction#

In the rapidly evolving landscape of conversational AI, it is easy to get lost in the specifics of its applications — whether it’s a voice agent for healthcare or an appointment scheduler. For this case study, I chose to pivot from the application to the abstraction.

Key insight

The voice interface is not the product — the orchestration pipeline behind it is. Understanding this architecture is what separates a prototype from a production-grade voice agent.

Rather than detailing a single use case, I am focusing on the fundamental architecture that powers the entire ecosystem. Why? Because beneath the surface, nearly every modern voice-agent platform relies on the same core structural patterns. The distinction between a ‘therapist bot’ and a ‘sales bot’ often comes down to just three variable components: the system prompt, the tool-execution layer, and the SaaS wrapper.

High-level architecture diagram of a real-time voice agent — Fig 1 — High-level architecture, from microphone to speaker

The data journey

Every turn is a relay race. Audio leaves the user, fans out through the orchestrator to each specialist service, and comes back as speech. Here’s a single packet making the full round-trip:

User

you

Orchestrator

the conductor

STT

the ears

LLM

the brain

RAG

the memory

TTS

the mouth

Audio stream

Fig. 02 · Data journey

01 / 12User → Orchestrator

Live — one full turn, looping

Streaming voicebots vs. turn-based chatbots#

While voicebots and chatbots share the same “brain” (the LLM), their nervous systems are entirely different. To the end-user, the difference is just the medium. To an engineer, it is the difference between a discrete request-response cycle and a continuous full-duplex stream.

Feature	Chatbot (discrete)	Voicebot (streaming)
Interaction	Turn-based	Full-duplex (simultaneous)
Protocol	HTTP / REST	WebSockets / gRPC / SIP
Latency	1–3s (accepted)	< 500ms (critical)
Data flow	User types → wait → reply	Continuous audio stream
State	Stateless (per request)	Persistent & stateful

Latency target

For a voice agent to feel natural, the TTFB must be under 500ms. Human conversations have a natural turn-taking gap of ~200–300ms. Anything above 800ms feels like “thinking” — above 1.5s feels broken.

Core components#

Every real-time voice agent is built from six core modules. Think of them as LEGO blocks — interchangeable, independently upgradable, but all essential to the final product.

01The ears

Speech-to-TextSTT

Streams raw audio into text, returning interim results within milliseconds of the user speaking.

DeepgramWhisperAssemblyAI

02The brain

Large Language ModelLLM

Reads the transcript and streams its reply token-by-token, so the mouth can start before it finishes thinking.

GPT-4Llama-3Claude

03The mouth

Text-to-SpeechTTS

Turns streaming text into natural audio — speaking sentence one while sentence two is still being written.

ElevenLabsDeepgramSarvam

04The traffic cop

Voice Activity DetectionVAD

Separates speech from silence and noise, and decides whether a pause is a comma or a full stop.

turn-taking

05The conductor

The orchestrator

The backend that wires every module together, holds conversation state, and handles interrupts.

NodePythonGo

06The memory

Retrieval (RAG)

Pulls relevant context from a vector database and injects it into the prompt before the LLM answers.

PineconeQdrant

01 · Speech-to-Text (STT)

The ears. The STT engine (e.g. Deepgram, Whisper, AssemblyAI) converts raw audio streams into text. In a real-time architecture, standard transcription is insufficient — the engine must support streaming transcription via WebSockets, returning “interim results” (partial text) milliseconds after the user speaks. The final transcript is then ready almost instantly after they stop talking.

02 · Large Language Model (LLM)

The brain. The LLM receives the STT transcript and generates a response. For voice agents, the LLM must support streaming token output— emitting tokens one at a time rather than waiting for the full response. This is what lets the TTS start speaking before the LLM has finished “thinking.”

03 · Text-to-Speech (TTS)

The mouth. TTS engines like ElevenLabs, Deepgram, or Sarvam AI convert text into natural-sounding audio. Modern systems accept streaming text input and produce streaming audio output — meaning they can start speaking the first sentence while the LLM is still generating the second.

04 · Voice Activity Detection (VAD)

The traffic cop. VAD is the unsung hero of voice interfaces. Its sole job is to distinguish human speech from silence, background noise, or non-speech vocalizations (like heavy breathing). A sophisticated VAD doesn’t just detect sound; it manages the turn-taking logic — deciding whether a pause is a comma (keep listening) or a full stop (start processing), so the agent doesn’t constantly interrupt the user.

05 · The orchestrator

The conductor. The orchestrator is the central nervous system that binds the components together — usually a custom backend service (Node.js / Python / Go) that manages the state of the conversation.

Pipeline management — it pipes the output of the STT directly into the LLM, and the output of the LLM into the TTS.
State management — it handles conversation history, executes external tool calls (APIs), and manages authentication.
Interrupt handling— it listens for “barge-in” events and instantly kills the audio stream if the user speaks.

python

# Simplified orchestrator flow
async def handle_audio_stream(websocket):
    async for audio_chunk in websocket:
        # 1. Feed to STT
        transcript = await stt.transcribe(audio_chunk)

        if transcript.is_final:
            # 2. Feed to LLM (streaming)
            async for token in llm.stream(transcript.text):
                # 3. Feed to TTS (streaming)
                audio = await tts.synthesize(token)
                await websocket.send(audio)

06 · Retrieval-Augmented Generation (RAG)

The memory. The LLM provides reasoning and language, but it lacks specific knowledge about your business or your user’s data. RAG bridges that gap by retrieving relevant context from a vector database (Pinecone, Qdrant, etc.) in real-time. Before the LLM answers, the system searches the knowledge base for semantic matches to the query and injects that context into the prompt — improving accuracy and reducing hallucinations.

The barge-in problem#

Barge-in is what happens when the user interrupts the agent mid-sentence. In human conversation this is natural — we do it constantly. For a voice agent, it’s an engineering nightmare.

When the user barges in, the system must simultaneously stop TTS playback immediately, flush any queued audio from the buffer, capture the new utterance via STT, and route the new transcript to the LLM with the correct context. All of this has to happen within ~100ms, or the user hears overlapping audio.

python

# Barge-in handling
async def on_user_speech_detected():
    # 1. Immediately stop TTS playback
    tts.stop()
    audio_buffer.flush()

    # 2. Cancel pending LLM generation
    llm.cancel_current_stream()

    # 3. Log partial response for context
    partial = tts.get_spoken_text_so_far()
    conversation.add_partial_response(partial)

    # 4. Resume listening for new utterance
    await stt.resume()

Echo cancellationadds another layer. The user’s microphone picks up the agent’s own voice from the speaker, creating a feedback loop. Without acoustic echo cancellation (AEC), the STT engine would transcribe the agent’s output as user speech and loop forever.

Echo cancellation

Most modern browsers have built-in Acoustic Echo Cancellation (AEC) via WebRTC, which automatically filters out audio coming from the speakers. For server-side processing, reference-signal-based cancellation can subtract the TTS output from the incoming microphone stream.

The secret sauce — optimizations#

Latency is the single most important metric for a voice agent. Every millisecond counts. These are the techniques that earn them back in production:

01 · Streaming tokens

Instead of waiting for a complete response, tokens are relayed to TTS in real-time. To keep intonation natural, they’re grouped into whole-sentence chunks before being handed to the TTS engine — which can synthesize sentence one while the LLM is still writing sentence two. This alone can cut TTFB by 60–70%, because audio generation runs in parallel with text generation.

02 · The latency budget

To feel conversational, the whole pipeline has to fit inside a strict budget. Human conversation accepts a gap of ~200–500ms. Above 800ms feels like “thinking,” above 1.5s feels disjointed. Here’s the breakdown:

Voice agent latency budget
Component	Target latency
VAD	50–100ms
STT	100–200ms (interim)
LLM TTFB	150–300ms
TTS TTFB	80–150ms
Network (2-way)	20–80ms (variable)
Total TTFB	400–830ms

03 · Latency masking

The cleverest trick: use filler audio(“Let me check that for you…”, “Hmm…”) or strategic pauses to mask backend processing time. The user perceives engagement instead of silence. Some systems even have the LLM generate contextually appropriate filler based on intent, buying valuable seconds for the main model to compute.

04 · Modular architecture

A monolith fails at scale. Instead, use a modular design where no specific model is hardcoded. By standardizing a common input/output schema for each component (STT, LLM, TTS), the pipeline runs consistently regardless of the engine underneath. That abstraction lets you swap GPT-4 for Llama-3, or Deepgram for Whisper — keeping the system scalable and future-proof without rebuilding the architecture.

05 · Smart caching (semantic router)

The fastest inference is the one you don’t run. In customer support, roughly 40% of queries are identical(“Reset my password”, “Where is my order?”). A semantic cacheembeds the incoming transcript and searches a vector database for similar past queries. On a high-confidence match (>0.95), we skip the LLM entirely and serve a pre-computed response — cutting latency from ~800ms to ~50ms.

Use cases#

Architecturally, a voice agent is just a generic “audio-in, audio-out” pipeline. The core infrastructure — WebSockets, VAD, STT, LLM, TTS — does not care whether it’s a doctor, a dungeon master, or a flight attendant.

Once the core engine exists, you can pivot to entirely new industries by changing just three configuration files:

The system prompt

The personality and rules injected into the LLM context.

The tools

The specific APIs the agent is allowed to call (e.g. calendar_api).

The SaaS wrapper

The UI/UX layer that surrounds the voice interaction.

By decoupling the engine from the implementation, you can build high-value, highly specific tools without rewriting the backend. Three configurations that go well beyond simple scheduling:

The customer-support automator

SaaS / Telco

“A frustrated user whose delivery is late wants to know where it is — angry, impatient, and wanting an update immediately.”

The vibe

Empathetic, decisive, efficient.

System prompt

Instruction · “You are a customer-support bot. Acknowledge frustration immediately. Pivot to providing the update the user needs. If confidence is low, escalate to a human.”

Style · “Short sentences. Confirm understanding after every step.”

Tools provided

➜get_order_status(order_id)
➜initiate_refund(order_id)
➜schedule_technician()

The wrapper

A web widget that shows a real-time progress bar of the diagnostic checks while the voice speaks (“I’m checking on your order now…”).

The language roleplay partner

EdTech

“A learner wants to practice ordering food in a busy Madrid café — without the embarrassment of failing in front of a real person.”

The vibe

Immersive, educational, patient.

System prompt

Instruction · “You are a grumpy waiter at a café in Madrid. Speak only in Spanish. If the user makes a grammar mistake, ignore it unless it changes the meaning of the order.”

Style · “Speak at 0.9× speed. Use simple vocabulary.”

Tools provided

➜grammar_check(user_input)
➜hint_generator()

The wrapper

A gamified UI with a “confidence meter” and live subtitles that blur out each word until the user hears it.

The field technician’s copilot

Industrial / Blue-collar

“A wind-turbine technician 300 feet up, wearing heavy gloves. They can’t type on a tablet, but they need technical data immediately.”

The vibe

Ultra-efficient, terse, safety-critical.

System prompt

Instruction · “You are a senior technical field assistant. No pleasantries. Be extremely concise. If a value is out of safety range, alert the user immediately.”

Style · “Output structured data first, then explanations.”

Tools provided

➜query_technical_manuals(component_id)
➜log_incident_report(severity, description)

The wrapper

A “push-to-talk” (walkie-talkie style) interface to prevent accidental triggering by loud machinery.

Want to see this in action?

I built Velox AI — a platform that implements this exact architecture: low-latency voice, modular tools, and real-time streaming, just as described in this post.

Visit Velox AI↗

Thanks for reading. If you enjoyed this deep-dive, let’s connect and build something.

LinkedIn GitHub

Next in the seriesTechnical Implementation of a Real-Time Voice Agent→

← All writing