Introduction#
In the rapidly evolving landscape of conversational AI, it is easy to get lost in the specifics of its applications — whether it’s a voice agent for healthcare or an appointment scheduler. For this case study, I chose to pivot from the application to the abstraction.
Key insight
Rather than detailing a single use case, I am focusing on the fundamental architecture that powers the entire ecosystem. Why? Because beneath the surface, nearly every modern voice-agent platform relies on the same core structural patterns. The distinction between a ‘therapist bot’ and a ‘sales bot’ often comes down to just three variable components: the system prompt, the tool-execution layer, and the SaaS wrapper.

The data journey
Every turn is a relay race. Audio leaves the user, fans out through the orchestrator to each specialist service, and comes back as speech. Here’s a single packet making the full round-trip:
Streaming voicebots vs. turn-based chatbots#
While voicebots and chatbots share the same “brain” (the LLM), their nervous systems are entirely different. To the end-user, the difference is just the medium. To an engineer, it is the difference between a discrete request-response cycle and a continuous full-duplex stream.
| Feature | Chatbot (discrete) | Voicebot (streaming) |
|---|---|---|
| Interaction | Turn-based | Full-duplex (simultaneous) |
| Protocol | HTTP / REST | WebSockets / gRPC / SIP |
| Latency | 1–3s (accepted) | < 500ms (critical) |
| Data flow | User types → wait → reply | Continuous audio stream |
| State | Stateless (per request) | Persistent & stateful |
Latency target
Core components#
Every real-time voice agent is built from six core modules. Think of them as LEGO blocks — interchangeable, independently upgradable, but all essential to the final product.
Speech-to-TextSTT
Streams raw audio into text, returning interim results within milliseconds of the user speaking.
Large Language ModelLLM
Reads the transcript and streams its reply token-by-token, so the mouth can start before it finishes thinking.
Text-to-SpeechTTS
Turns streaming text into natural audio — speaking sentence one while sentence two is still being written.
Voice Activity DetectionVAD
Separates speech from silence and noise, and decides whether a pause is a comma or a full stop.
The orchestrator
The backend that wires every module together, holds conversation state, and handles interrupts.
Retrieval (RAG)
Pulls relevant context from a vector database and injects it into the prompt before the LLM answers.
01 · Speech-to-Text (STT)
The ears. The STT engine (e.g. Deepgram, Whisper, AssemblyAI) converts raw audio streams into text. In a real-time architecture, standard transcription is insufficient — the engine must support streaming transcription via WebSockets, returning “interim results” (partial text) milliseconds after the user speaks. The final transcript is then ready almost instantly after they stop talking.
02 · Large Language Model (LLM)
The brain. The LLM receives the STT transcript and generates a response. For voice agents, the LLM must support streaming token output— emitting tokens one at a time rather than waiting for the full response. This is what lets the TTS start speaking before the LLM has finished “thinking.”
03 · Text-to-Speech (TTS)
The mouth. TTS engines like ElevenLabs, Deepgram, or Sarvam AI convert text into natural-sounding audio. Modern systems accept streaming text input and produce streaming audio output — meaning they can start speaking the first sentence while the LLM is still generating the second.
04 · Voice Activity Detection (VAD)
The traffic cop. VAD is the unsung hero of voice interfaces. Its sole job is to distinguish human speech from silence, background noise, or non-speech vocalizations (like heavy breathing). A sophisticated VAD doesn’t just detect sound; it manages the turn-taking logic — deciding whether a pause is a comma (keep listening) or a full stop (start processing), so the agent doesn’t constantly interrupt the user.
05 · The orchestrator
The conductor. The orchestrator is the central nervous system that binds the components together — usually a custom backend service (Node.js / Python / Go) that manages the state of the conversation.
- Pipeline management — it pipes the output of the STT directly into the LLM, and the output of the LLM into the TTS.
- State management — it handles conversation history, executes external tool calls (APIs), and manages authentication.
- Interrupt handling— it listens for “barge-in” events and instantly kills the audio stream if the user speaks.
# Simplified orchestrator flow
async def handle_audio_stream(websocket):
async for audio_chunk in websocket:
# 1. Feed to STT
transcript = await stt.transcribe(audio_chunk)
if transcript.is_final:
# 2. Feed to LLM (streaming)
async for token in llm.stream(transcript.text):
# 3. Feed to TTS (streaming)
audio = await tts.synthesize(token)
await websocket.send(audio)06 · Retrieval-Augmented Generation (RAG)
The memory. The LLM provides reasoning and language, but it lacks specific knowledge about your business or your user’s data. RAG bridges that gap by retrieving relevant context from a vector database (Pinecone, Qdrant, etc.) in real-time. Before the LLM answers, the system searches the knowledge base for semantic matches to the query and injects that context into the prompt — improving accuracy and reducing hallucinations.
The barge-in problem#
Barge-in is what happens when the user interrupts the agent mid-sentence. In human conversation this is natural — we do it constantly. For a voice agent, it’s an engineering nightmare.
When the user barges in, the system must simultaneously stop TTS playback immediately, flush any queued audio from the buffer, capture the new utterance via STT, and route the new transcript to the LLM with the correct context. All of this has to happen within ~100ms, or the user hears overlapping audio.
# Barge-in handling
async def on_user_speech_detected():
# 1. Immediately stop TTS playback
tts.stop()
audio_buffer.flush()
# 2. Cancel pending LLM generation
llm.cancel_current_stream()
# 3. Log partial response for context
partial = tts.get_spoken_text_so_far()
conversation.add_partial_response(partial)
# 4. Resume listening for new utterance
await stt.resume()Echo cancellationadds another layer. The user’s microphone picks up the agent’s own voice from the speaker, creating a feedback loop. Without acoustic echo cancellation (AEC), the STT engine would transcribe the agent’s output as user speech and loop forever.
Echo cancellation
The secret sauce — optimizations#
Latency is the single most important metric for a voice agent. Every millisecond counts. These are the techniques that earn them back in production:
01 · Streaming tokens
Instead of waiting for a complete response, tokens are relayed to TTS in real-time. To keep intonation natural, they’re grouped into whole-sentence chunks before being handed to the TTS engine — which can synthesize sentence one while the LLM is still writing sentence two. This alone can cut TTFB by 60–70%, because audio generation runs in parallel with text generation.
02 · The latency budget
To feel conversational, the whole pipeline has to fit inside a strict budget. Human conversation accepts a gap of ~200–500ms. Above 800ms feels like “thinking,” above 1.5s feels disjointed. Here’s the breakdown:
| Component | Target latency |
|---|---|
| VAD | 50–100ms |
| STT | 100–200ms (interim) |
| LLM TTFB | 150–300ms |
| TTS TTFB | 80–150ms |
| Network (2-way) | 20–80ms (variable) |
| Total TTFB | 400–830ms |
03 · Latency masking
The cleverest trick: use filler audio(“Let me check that for you…”, “Hmm…”) or strategic pauses to mask backend processing time. The user perceives engagement instead of silence. Some systems even have the LLM generate contextually appropriate filler based on intent, buying valuable seconds for the main model to compute.
04 · Modular architecture
A monolith fails at scale. Instead, use a modular design where no specific model is hardcoded. By standardizing a common input/output schema for each component (STT, LLM, TTS), the pipeline runs consistently regardless of the engine underneath. That abstraction lets you swap GPT-4 for Llama-3, or Deepgram for Whisper — keeping the system scalable and future-proof without rebuilding the architecture.
05 · Smart caching (semantic router)
The fastest inference is the one you don’t run. In customer support, roughly 40% of queries are identical(“Reset my password”, “Where is my order?”). A semantic cacheembeds the incoming transcript and searches a vector database for similar past queries. On a high-confidence match (>0.95), we skip the LLM entirely and serve a pre-computed response — cutting latency from ~800ms to ~50ms.
Use cases#
Architecturally, a voice agent is just a generic “audio-in, audio-out” pipeline. The core infrastructure — WebSockets, VAD, STT, LLM, TTS — does not care whether it’s a doctor, a dungeon master, or a flight attendant.
Once the core engine exists, you can pivot to entirely new industries by changing just three configuration files:
The system prompt
The personality and rules injected into the LLM context.
The tools
The specific APIs the agent is allowed to call (e.g. calendar_api).
The SaaS wrapper
The UI/UX layer that surrounds the voice interaction.
By decoupling the engine from the implementation, you can build high-value, highly specific tools without rewriting the backend. Three configurations that go well beyond simple scheduling:
The customer-support automator
SaaS / Telco“A frustrated user whose delivery is late wants to know where it is — angry, impatient, and wanting an update immediately.”
Empathetic, decisive, efficient.
System prompt
Instruction · “You are a customer-support bot. Acknowledge frustration immediately. Pivot to providing the update the user needs. If confidence is low, escalate to a human.”
Style · “Short sentences. Confirm understanding after every step.”
Tools provided
- ➜get_order_status(order_id)
- ➜initiate_refund(order_id)
- ➜schedule_technician()
A web widget that shows a real-time progress bar of the diagnostic checks while the voice speaks (“I’m checking on your order now…”).
The language roleplay partner
EdTech“A learner wants to practice ordering food in a busy Madrid café — without the embarrassment of failing in front of a real person.”
Immersive, educational, patient.
System prompt
Instruction · “You are a grumpy waiter at a café in Madrid. Speak only in Spanish. If the user makes a grammar mistake, ignore it unless it changes the meaning of the order.”
Style · “Speak at 0.9× speed. Use simple vocabulary.”
Tools provided
- ➜grammar_check(user_input)
- ➜hint_generator()
A gamified UI with a “confidence meter” and live subtitles that blur out each word until the user hears it.
The field technician’s copilot
Industrial / Blue-collar“A wind-turbine technician 300 feet up, wearing heavy gloves. They can’t type on a tablet, but they need technical data immediately.”
Ultra-efficient, terse, safety-critical.
System prompt
Instruction · “You are a senior technical field assistant. No pleasantries. Be extremely concise. If a value is out of safety range, alert the user immediately.”
Style · “Output structured data first, then explanations.”
Tools provided
- ➜query_technical_manuals(component_id)
- ➜log_incident_report(severity, description)
A “push-to-talk” (walkie-talkie style) interface to prevent accidental triggering by loud machinery.
Want to see this in action?
I built Velox AI — a platform that implements this exact architecture: low-latency voice, modular tools, and real-time streaming, just as described in this post.
Visit Velox AI↗