Introduction#
In the previous post I mapped the abstract anatomy of a real-time voice agent — STT, LLM, TTS, VAD, orchestrator, RAG, and the latency budget that ties them together. That post was the blueprint. This one is the construction site.
Every code snippet, every latency number, every architectural choice here is pulled directly out of a working voice-agent platform I built — the same one I shipped as Velox AI. So instead of whiteboard sketches, you get the production answers: how the pieces are actually wired, what surprised me, what I’d reverse if I started over, and where the milliseconds really go.
What you'll take away
The conversation lifecycle
Before we go anywhere, here’s what we’re optimising for. One full conversational turn — from the moment a user finishes speaking to the moment they hear the agent reply — passes through nine stages and crosses three networks. Every millisecond on this rail is a millisecond the user can hear:
- 0 ms
User stops speaking
silence begins
- ~30 ms
Mic → AudioWorklet → PCM
48 kHz Float32 → 16 kHz Int16
- ~80 ms
PCM frame → STT (WebSocket)
binary frames, ~20 ms each
- ~250 ms
STT fires UtteranceEnd
final transcript ready
- ~450 ms
LLM streams first token
stream=true, OpenAI-compat
- ~550 ms
First sentence boundary hits
flushed onto the TTS queue
- ~680 ms
TTS first audio byte
streamed back to the server
- ~750 ms
Server → client (binary WS)
jitter buffer absorbs ~150 ms
- ~850 ms
User hears the agent
scheduled via AudioContext clock
Roughly 850 ms end-to-end. Sounds slow on paper. Feels almost-but-not-quite-human in practice — and most of the engineering that follows is about chipping away at that number without losing audio quality or context.
One pipeline, many vendors#
Before any code: the single most important architectural decision you’ll make is whether your pipeline is vendor-locked or vendor-agnostic. Every voice-agent tutorial on the internet hard-codes one stack — “OpenAI Whisper + GPT-4 + ElevenLabs” — and ships it. That works for a demo. It does not work in production.
The reason: the optimal stack depends on language, latency budget, voice taste, and price. An English customer-support bot in the US wants Deepgram + Groq + Deepgram Aura. A Hindi receptionist wants Sarvam everywhere. A premium character-voice game NPC wants ElevenLabs. A privacy-paranoid customer wants self-hosted Piper. Lock yourself into one stack and you’ve locked yourself out of those use cases.
The fix is an adapter pattern — one interface per layer, many implementations behind it, chosen per agent at config time:
# Every STT provider implements the same protocol.
class STTAdapter(Protocol):
async def send_audio(self, pcm: bytes) -> None: ...
async def events(self) -> AsyncIterator[STTEvent]: ...
async def close(self) -> None: ...
# Picking is a one-liner at call setup:
stt = build_stt(agent.stt_provider, language=agent.stt_language)
# └─► returns DeepgramAdapter | SarvamAdapter | AssemblyAIAdapter | …
# The orchestrator never imports a specific vendor.
async for event in stt.events():
handle(event)Three layers, three adapters. STT, LLM, TTS. Swap any one without touching the other two, the orchestrator, the audio plumbing, or the UI. Here’s what I shipped — and what you can plug in instead:
Speech → Text
Deepgram Nova-3
Best English latency + endpointing; bidirectional WebSocket; semantic VAD baked in.
Drop-in alternatives
Reasoning (LLM)
Groq / Cerebras / NVIDIA NIM
Raw tokens-per-second is the whole game in voice. All three speak OpenAI-compatible streaming.
Drop-in alternatives
Text → Speech
Deepgram Aura
Cheap, fast, streams chunks as it synthesises. Solid baseline English voice.
Drop-in alternatives
Key insight
System architecture#
Zoom out. Here’s the whole system on one map — browser, dashboard backend, voice runtime, persistence, and every external provider it talks to. The split that matters is the colour: control plane (where agents are defined) versus data plane (where they run).
System map
Browser
Client · Next.js 15Dashboard UI
Agent Builder
useVoiceAgent()
WebSocket + control msgs
AudioWorklet
48k → 16k PCM
Next.js API routes
Control plane/api/agents/*
CRUD agent configs
/api/auth/*
NextAuth sessions
/api/rag (proxy)
KB upload → backend
FastAPI backend
Data plane · PythonWebSocket handler
/ws/agent/{id}
llm_orchestrator· per-call task
- •TaskManager — interrupt signal
- •Sentence-buffered LLM → TTS
- •Tool dispatch
STT
adapter · WS
LLM
adapter · HTTP
TTS
adapter · HTTP
Persistence
StatefulMongoDB Atlas
shared — agent docusers · agents · calls (recordings)
Redis
data planeagent-config cache · rate-limit counters
Qdrant Cloud
data planeagent_knowledge · 384-dim cosine
External providers
PluggableSTT via WS · LLM + TTS via HTTP
Control plane vs data plane
The clearest mental model for a voice-agent platform is the same one cloud-infra teams use: control plane vs data plane.
- Control plane = the dashboard. Where agents are defined. A user signs in, edits an agent (system prompt, provider choices, voice, knowledge base), and the result lands in MongoDB. Pure CRUD, no real-time anything, no exotic infrastructure.
- Data plane = the voice runtime. Where agents run. A WebSocket opens, the backend reads the agent config from Mongo (cached in Redis), spins up an asyncio task graph (STT loop, LLM stream, TTS worker, audio sender), and streams audio both directions until the user hangs up.
The two share exactly one thing: the agent document. They can deploy together (one box, two containers — what I actually run today on AWS) or scale independently the moment one gets hot. Because the data plane holds no cross-call in-process state — everything per-call lives on a single asyncio task graph — you can scale it horizontally by adding boxes behind a load balancer the day you need to. That decoupling is worth more than any specific tech choice.
Architectural style
services/stt/*, services/tts/*, state/, tools/). In-process function calls beat any RPC for latency, and the module boundaries are already shaped so extracting a worker pool later would be a refactor, not a rewrite.The tech stack#
Concrete choices, with the “why” in one line and the alternatives you can swap in. None of these is the One True Choice — they’re the trade-offs that fit a real-time, B2B, multi-tenant SaaS.
| Layer | My pick | Why | Alternatives |
|---|---|---|---|
| Frontend | Next.js 15 + React 19 + Tailwind | App Router + RSC for the dashboard; Framer Motion for UI animation. | Remix, Astro, SvelteKit, Vite + React |
| Dashboard API | Next.js API routes | Co-located BFF. No second service just for CRUD. | tRPC, Express, Hono, NestJS |
| Voice runtime | Python 3 + FastAPI + asyncio | Every AI SDK ships a Python client first; asyncio handles hundreds of IO-bound tasks per call. | Node + uWebSockets, Go + Gorilla, Elixir Phoenix |
| Primary DB | MongoDB Atlas | Flexible nested agent docs. (Honest: I'd pick Postgres + JSONB if I started over.) | Postgres + JSONB, Supabase, PlanetScale, DynamoDB |
| Vector DB | Qdrant Cloud | Multi-tenant via payload filter; cloud-hosted means I don't run it. | Pinecone, Weaviate, Milvus, pgvector, Chroma |
| Embeddings | all-MiniLM-L6-v2 (local) | Free, sub-50 ms, 384-dim, runs on CPU. Good enough for SMB-sized KBs. | OpenAI text-embedding-3-small, Voyage, BGE, Cohere |
| Cache / KV | Redis | Agent-config cache + per-IP rate limit. Not a queue — everything in-process is asyncio. | Dragonfly, KeyDB, Memcached, in-memory LRU |
| Auth | NextAuth.js 5 | Google/GitHub OAuth + credentials; admin runs a separate token scheme. | Clerk, Supabase Auth, Auth0, custom JWT |
| Hosting | AWS EC2 + Docker + Nginx | One box in ap-south-1 (Mumbai), two containers, TLS via Let's Encrypt. Boring on purpose. | Fly.io, Render, Railway, Hetzner, K8s |
What I'd reverse
The real-time pipeline#
Audio transport: raw WebSocket, not WebRTC
Most tutorials reach for WebRTC because it’s “the real-time standard.” For a browser-to-server voice agent you almost certainly don’t need it. WebRTC buys you three things — NAT traversal, jitter buffering, and AEC — and none are valuable here:
- NAT traversal — you have a public server. No peer-to-peer hole-punching.
- Jitter buffering— you’re going to build your own client-side ~150 ms buffer anyway (TTS chunks arrive bursty).
- AEC —
getUserMedia({ echoCancellation: true })already does it on the browser side.
Skipping WebRTC means no SDP exchange, no STUN/TURN servers, no SFU. Just one TCP+TLS connection carrying binary WebSocket frames in both directions — lower setup latency, far less infra. The audio format on the wire:
| Hop | Format | Rate |
|---|---|---|
| Mic → AudioWorklet | Float32 mono | 48 kHz |
| AudioWorklet → WebSocket | Int16 PCM (Linear16) | 16 kHz |
| Server → STT | Same Int16 | 16 kHz |
| TTS → server → client | Raw PCM bytes | 16–24 kHz (provider-dependent) |
The downsample 48 kHz Float32 → 16 kHz Int16 is the single most-important client-side detail. Doing it in the AudioWorklet (not the main thread) keeps the audio path off React’s render loop — a React re-render blocking the audio thread is what causes mysterious glitches in toy implementations.
The orchestrator loop
Everything above is plumbing. The brain is one async function that owns the call’s state machine: it consumes STT events, fires the LLM, slices its token stream into sentences, queues those for TTS, and pushes audio bytes back over the WebSocket — while listening for interruptions the whole time.
# Per-call task graph (simplified from services/llm_orchestrator.py)
task_manager = TaskManager(websocket) # audio queue, interrupt signal, llm task handle
history = [{"role": "system", "content": agent.system_prompt}]
stt = build_stt(agent.stt_provider, language=agent.stt_language)
# 1) STT loop — listens for transcripts & barge-in
async def stt_loop():
async for event in stt.events():
if event.type == "interim_transcript" and task_manager.is_busy:
await task_manager.handle_interruption() # ← see Barge-In section
elif event.type == "utterance_end":
asyncio.create_task(run_llm_and_tts(event.final_text))
# 2) The pipeline: LLM → sentence buffer → TTS queue
async def run_llm_and_tts(user_text):
history.append({"role": "user", "content": user_text})
sentence_buffer = ""
async for delta in stream_llm(history[-25:], tools=TOOL_SCHEMAS):
if delta.tool_call: # tool calls handled separately
await dispatch_tool(delta.tool_call); return
sentence_buffer += delta.content
for sentence in extract_sentences(sentence_buffer): # punctuation-boundary slice
if task_manager.interrupt_signal.is_set(): # bail on barge-in
return
await task_manager.tts_queue.put(sentence)
sentence_buffer = leftover_after_extraction
if sentence_buffer.strip():
await task_manager.tts_queue.put(sentence_buffer) # flush tail (don't drop it!)
# 3) TTS worker — drains the sentence queue
async def tts_worker():
while True:
sentence = await task_manager.tts_queue.get()
async for chunk in tts.synthesize(sentence):
if task_manager.interrupt_signal.is_set():
break
await task_manager.audio_queue.put(chunk)
# 4) Audio sender — drains the audio queue to the client
async def audio_sender():
while True:
chunk = await task_manager.audio_queue.get()
await websocket.send_bytes(chunk)
await asyncio.gather(stt_loop(), tts_worker(), audio_sender())Four concurrent tasks. Three queues between them. One shared TaskManagerholding the interrupt signal that every task checks before doing anything. That’s the entire architecture — everything else in this post is detail about how each of those four tasks behaves.
Sentence buffering — the magic trick#
Why does Velox feel snappy when most voice bots feel laggy? Sentence buffering.
A naive pipeline waits for the LLM to finish, sends the whole response to TTS, waits for TTS to synthesise all of it, then plays it. Three serial round-trips — users sit through ~2–3 seconds of dead air every turn. The fix is to chunk the token stream by punctuation boundaries and pipe each finished sentence to TTS the instant it lands:
# The whole magic, in one regex. SENTENCE_BOUNDARY = re.compile(r"([.!?:;])\s+|\n") # Stream tokens in; whenever a terminator is seen, slice off the completed # sentence and ship it to TTS. Everything after the cut stays in the buffer.
Three sub-decisions matter here:
- 01Boundary set.
.!?:;\n. Anything finer (token count, mid-clause) produces glitchy half-words from TTS. Anything coarser (whole paragraphs) starves the user of audio. - 02Hold the tail. If the LLM ends mid-sentence (no terminator), the last buffered chunk has to be flushed at end-of-stream — otherwise the final clause silently disappears. (Ask me how I know.)
- 03Drop too-short fragments.Sentences shorter than ~3 chars are held back — otherwise an “OK.” costs a full TTS round-trip to say one syllable.
Net effect: the user hears the first sentence ~600 ms after they stop speaking, not ~2.5 s. TTS for sentence two happens while sentence one is playing. The whole pipeline is naturally pipelined.
Barge-in, implemented#
Barge-in — the user talking over the agent mid-sentence — was the bug class that bit the most before it was solved. The hard part isn’t detecting that the user spoke; it’s making the agent shut up within ~100 ms, across three concurrent tasks, two queues, and one TCP stream’s worth of buffered audio.
Two engineering choices made it tractable:
1. Detect barge-in semantically, not acoustically.A naive implementation watches mic energy — anything above a threshold counts as speech. In the real world that fires on keyboard clicks, doors, AC fans, the user’s own breathing. Velox uses the STT engine’s interim transcriptsinstead: if Deepgram returns a partial that contains a recognisable word while the agent is speaking, that’s a real interrupt. Otherwise it’s noise — ignore it.
2. Treat interruption as a cascade, not a flag. When the interrupt fires, six things have to happen in order. Get any of them wrong and you hear overlapping voices, ghost audio, or the agent resuming where it left off after a two-second pause:
STT emits an interim transcript with a real word while task_manager.is_busy == true.
interrupt_signal.set() — every loop checks this flag before doing anything.
The in-flight LLM stream is cancelled. The httpx connection raises CancelledError and unwinds.
The TTS queue and audio queue are drained — anything synthesised but not yet sent is dropped on the floor.
A {"type":"control","action":"interrupt"} message goes over the WebSocket. The client clears its playback queue and stops every AudioBufferSourceNode currently scheduled.
was_interrupted = True is flagged on the call state — the next LLM call sees [User interrupted you] injected into history so the model recovers gracefully.
The audio-sender task is reset so it doesn’t drain stale futures. STT keeps listening.
Echo cancellation
getUserMedia({ echoCancellation: true })) handles this for free in-browser. Server-side you’d need reference-signal subtraction — subtract the TTS output from the inbound mic stream before it reaches STT — but for browser-first apps you genuinely don’t have to write that code.The known imperfection: when the user interrupts mid-sentence, the conversation history records the full intended response, not what the user actually heard. If they reference something the agent never finished saying, the model can get confused. The fix is to truncate history to the last word the user heard — which requires knowing which audio chunks actually made it past the jitter buffer. Not solved yet.
Knowledge base & RAG#
A voice agent without business knowledge can’t answer “what’s your return policy?” It needs RAG — retrieval-augmented generation. The pipeline is dead simple, but every parameter is a trade-off:
UploadDrag-drop in the dashboard
multipart upload → backend
ParseExtract text by file type
ChunkSplit into overlapping windows
RecursiveCharacterTextSplitter · size 500 · overlap 50
EmbedVectorise each chunk
all-MiniLM-L6-v2 → 384-dim vectors
UpsertStore in the vector DB
Qdrant · collection = agent_knowledge
payload
Choices worth defending:
- 500-char / 50-overlap chunks, character-based, not token-based. Small enough to fit 3–5 in an LLM call without bloating cost; large enough to carry real context. The recursive splitter respects paragraph → sentence → word boundaries before falling back to mid-word.
- Local embeddings (all-MiniLM-L6-v2) over OpenAI’s
text-embedding-3-small. 384-dim is plenty for SMB-sized KBs (~50–500 chunks per agent), inference is sub-50 ms on CPU, and there’s no per-token bill on the hot path. The recall gap vs OpenAI isn’t noticeable below a few thousand chunks. For technical / legal / multilingual KBs, swap in something stronger. - One Qdrant collection, payload-filter isolation. Every point has an
agent_idfield with a payload index; every query filters by it. Cheap, simple, scales to thousands of agents. If a customer ever demands physical separation, the migration is “one collection per tenant” — a refactor, not a rewrite. - Top-3 vector-only retrieval. No hybrid keyword search, no re-ranker. Top-3 with 500-char chunks puts ~1500 chars of context in the LLM — concise factual answers that read naturally when spoken aloud.
# Tenant isolation = one line of Qdrant filter
query_filter = models.Filter(must=[
models.FieldCondition(
key="agent_id",
match=models.MatchValue(value=agent_id),
)
])
results = qdrant.search(
collection_name="agent_knowledge",
query_vector=embed(user_question), # 384-dim
query_filter=query_filter,
limit=3,
)The retrieved chunks aren’t stuffed into the system prompt — they’re injected as the response of a search_knowledge_basetool call. The LLM decides when to ask for them. That’s crucial: the model can choose notto search when the user is just exchanging pleasantries, saving a Qdrant round-trip every turn. Which leads to…
Tool calls inside a streaming loop#
Tool calling on a non-streaming HTTP chat endpoint is easy. Tool calling inside a streaming voice pipeline is genuinely tricky, because tool calls arrive as partial JSON interleaved with regular content tokens, and you have to decide on the fly:
- Suppress any “narration” tokens the model emits alongside the tool call (it shouldn’t say “I’m going to look that up…” out loud while it’s already doing it).
- Mask the tool latency with a filler phrase so the user hears something.
- Execute the tool, append the result to history, then re-call the LLM — which streams the actual spoken answer back through the sentence buffer.
async def run_llm_and_tts(user_text):
history.append({"role": "user", "content": user_text})
sentence_buffer = ""
tool_call_buffer = None
async for delta in stream_llm(history[-25:], tools=TOOL_SCHEMAS):
# 1) Accumulate partial tool-call JSON across stream chunks
if delta.tool_call:
tool_call_buffer = accumulate(tool_call_buffer, delta.tool_call)
continue # don't TTS this delta
# 2) Regular content tokens flow into the sentence buffer
sentence_buffer += delta.content
for sentence in extract_sentences(sentence_buffer):
if task_manager.interrupt_signal.is_set(): return
await task_manager.tts_queue.put(sentence)
sentence_buffer = leftover_after_extraction
# 3) End of stream — if a tool was called, execute and recurse
if tool_call_buffer:
# Latency mask: queue a filler while the tool runs
await task_manager.tts_queue.put(random.choice(FILLER_PHRASES))
result = await execute_tool(tool_call_buffer) # search KB, end call, etc.
history.append({"role": "tool", "content": result,
"tool_call_id": tool_call_buffer.id})
await run_llm_and_tts(_continuation=True) # second LLM call answers
elif sentence_buffer.strip():
await task_manager.tts_queue.put(sentence_buffer) # flush tailThe filler phrase (“Let me check the knowledge base for that…”) masks ~400–700 ms of dead air on knowledge searches. Small thing, but it’s the difference between an agent that feels thoughtful and one that feels stuck. The same trick applies to slow LLMs in general — queue a generic “thinking” phrase before the call, and the user hears engagement instead of silence.
Built-in tools + obvious additions
search_knowledge_base(query) and end_call(reason). The obvious next additions for customer care: transfer_to_human, send_sms, send_email, schedule_callback, book_appointment, lookup_order(id). For user-defined tools, the clearly-right design is per-agent encrypted credentials in the DB plus server-side execution — credentials never reach the client. Not built yet, but the adapter shape is there.Latency in practice#
Real numbers from a production pipeline on a good connection (English path: Deepgram STT + NVIDIA NIM LLM + Deepgram Aura TTS, client in ap-south-1). These are estimatesfrom log-stamped TTS first-byte events — honest disclaimer: there’s no P50/P95 dashboard yet, which is genuinely the biggest operational gap left in the system.
Two observations worth taking away:
- 01The LLM is rarely the bottleneck anymore.Going in, I assumed “AI is slow” and built around hiding model latency. With Groq / Cerebras / NVIDIA NIM returning first tokens in 150–250 ms, the LLM is faster than the network round-trip to TTS. The real bottlenecks are STT endpointing waiting for silence, and TTS providers that don’t stream.
- 02Users notice silence after they speak, not delay before they’re answered.200 ms of silence between “Hi” and the agent reacting feels broken. 500 ms of audio delay once the agent is talking feels totally fine. That asymmetry should shape where you spend your optimisation budget.
The optimisation that mattered most
What’s left to attack:Sarvam’s TTS doesn’t stream — it returns full utterances in one HTTP response. For a one-sentence reply that’s 300–600 ms of dead air on Indian-language agents. The day Sarvam ships a streaming endpoint, that’s a one-line config change for a 30% latency win. STT endpointing (utterance_end_ms=1000) is the runner-up — dropping it to 500 ms shaves half a second off perceived latency at the cost of more occasional early triggers. A tunable knob, not a code change.
What surprised me#
Three things I didn’t expect going in — worth sharing because every voice-agent post on the internet was wrong about at least one of them:
- 01The LLM is the fastest part.See above. Most posts treat the LLM as the slow, expensive thing you wrap with caching and prompt compression. With modern inference providers it’s faster than your network round-trip. Spend your budget on STT endpointing and TTS streaming — that’s where the seconds live.
- 02Semantic caching isn’t worth it for voice.Theory: 40% of support queries are repeats; cache them. Reality: voice queries are wildly variable in phrasing (“how do I…” / “can you tell me…” / “what’s the way to…”), hit rates are low, and the operational complexity isn’t worth it when the uncached path is already ~700 ms.
- 03You need fewer queues than you think. Early prototypes had a queue between every stage. All you actually need is a sentence queue (LLM → TTS) and an audio queue (TTS → client). Everything else is in-process async/await — queues add latency and a place for state to drift out of sync with the interrupt signal.
Want to talk to the thing in this post?
Everything above is live at Velox AI — the platform I built that lets businesses configure their own voice agents with this exact pipeline. Try the public demo, build your own agent, or browse the library.
Open Velox AI↗