Technical Implementation of a Real-Time Voice Agent

Introduction#

In the previous post I mapped the abstract anatomy of a real-time voice agent — STT, LLM, TTS, VAD, orchestrator, RAG, and the latency budget that ties them together. That post was the blueprint. This one is the construction site.

Every code snippet, every latency number, every architectural choice here is pulled directly out of a working voice-agent platform I built — the same one I shipped as Velox AI. So instead of whiteboard sketches, you get the production answers: how the pieces are actually wired, what surprised me, what I’d reverse if I started over, and where the milliseconds really go.

What you'll take away

A concrete blueprint you can re-implement on your own stack. Every vendor I picked is named alongside 2–3 drop-in alternatives, because the architecture is what matters — the providers are just plug-ins.

The conversation lifecycle

Before we go anywhere, here’s what we’re optimising for. One full conversational turn — from the moment a user finishes speaking to the moment they hear the agent reply — passes through nine stages and crosses three networks. Every millisecond on this rail is a millisecond the user can hear:

0 ms
User stops speaking
silence begins
~30 ms
Mic → AudioWorklet → PCM
48 kHz Float32 → 16 kHz Int16
~80 ms
PCM frame → STT (WebSocket)
binary frames, ~20 ms each
~250 ms
STT fires UtteranceEnd
final transcript ready
~450 ms
LLM streams first token
stream=true, OpenAI-compat
~550 ms
First sentence boundary hits
flushed onto the TTS queue
~680 ms
TTS first audio byte
streamed back to the server
~750 ms
Server → client (binary WS)
jitter buffer absorbs ~150 ms
~850 ms
User hears the agent
scheduled via AudioContext clock

Roughly 850 ms, end to end

Roughly 850 ms end-to-end. Sounds slow on paper. Feels almost-but-not-quite-human in practice — and most of the engineering that follows is about chipping away at that number without losing audio quality or context.

One pipeline, many vendors#

Before any code: the single most important architectural decision you’ll make is whether your pipeline is vendor-locked or vendor-agnostic. Every voice-agent tutorial on the internet hard-codes one stack — “OpenAI Whisper + GPT-4 + ElevenLabs” — and ships it. That works for a demo. It does not work in production.

The reason: the optimal stack depends on language, latency budget, voice taste, and price. An English customer-support bot in the US wants Deepgram + Groq + Deepgram Aura. A Hindi receptionist wants Sarvam everywhere. A premium character-voice game NPC wants ElevenLabs. A privacy-paranoid customer wants self-hosted Piper. Lock yourself into one stack and you’ve locked yourself out of those use cases.

The fix is an adapter pattern — one interface per layer, many implementations behind it, chosen per agent at config time:

python

# Every STT provider implements the same protocol.
class STTAdapter(Protocol):
    async def send_audio(self, pcm: bytes) -> None: ...
    async def events(self) -> AsyncIterator[STTEvent]: ...
    async def close(self) -> None: ...

# Picking is a one-liner at call setup:
stt = build_stt(agent.stt_provider, language=agent.stt_language)
#       └─► returns DeepgramAdapter | SarvamAdapter | AssemblyAIAdapter | …

# The orchestrator never imports a specific vendor.
async for event in stt.events():
    handle(event)

Three layers, three adapters. STT, LLM, TTS. Swap any one without touching the other two, the orchestrator, the audio plumbing, or the UI. Here’s what I shipped — and what you can plug in instead:

Speech → Text

Deepgram Nova-3

Best English latency + endpointing; bidirectional WebSocket; semantic VAD baked in.

Drop-in alternatives

SarvamAssemblyAI Universal-2GladiaSonioxWhisper (self-host)Google STT

Reasoning (LLM)

Groq / Cerebras / NVIDIA NIM

Raw tokens-per-second is the whole game in voice. All three speak OpenAI-compatible streaming.

Drop-in alternatives

OpenAI gpt-4o-miniClaude HaikuTogether AIFireworksvLLM (self-host)

Text → Speech

Deepgram Aura

Cheap, fast, streams chunks as it synthesises. Solid baseline English voice.

Drop-in alternatives

ElevenLabs (premium)SarvamCartesia SonicPlayHTOpenAI tts-1Piper (local ONNX)

Key insight

The provider-agnostic design isn’t just future-proofing — it’s the only honest answer to “which is the best STT/LLM/TTS?” The answer is it depends, and your architecture has to make that answer cheap to act on.

System architecture#

Zoom out. Here’s the whole system on one map — browser, dashboard backend, voice runtime, persistence, and every external provider it talks to. The split that matters is the colour: control plane (where agents are defined) versus data plane (where they run).

System map

Control plane Data plane

Browser

Client · Next.js 15

Dashboard UI

Agent Builder

useVoiceAgent()

WebSocket + control msgs

AudioWorklet

48k → 16k PCM

HTTPS· NextAuth · BFF▾

WebSocket· binary PCM + JSON▾

Web Audio API· gapless playback▾

Next.js API routes

Control plane

/api/agents/*

CRUD agent configs

/api/auth/*

NextAuth sessions

/api/rag (proxy)

KB upload → backend

FastAPI backend

Data plane · Python

WebSocket handler

/ws/agent/{id}

llm_orchestrator· per-call task

•TaskManager — interrupt signal
•Sentence-buffered LLM → TTS
•Tool dispatch

STT

adapter · WS

LLM

adapter · HTTP

TTS

adapter · HTTP

reads / writes· agent docs · sessions▾

state · vectors · providers▾

Persistence

Stateful

MongoDB Atlas

shared — agent doc

users · agents · calls (recordings)

Redis

data plane

agent-config cache · rate-limit counters

Qdrant Cloud

data plane

agent_knowledge · 384-dim cosine

External providers

Pluggable

DeepgramGroqSarvamCerebrasNVIDIA NIMElevenLabs

STT via WS · LLM + TTS via HTTP

Control plane defines agents · data plane runs them · one shared agent document

Control plane vs data plane

The clearest mental model for a voice-agent platform is the same one cloud-infra teams use: control plane vs data plane.

Control plane = the dashboard. Where agents are defined. A user signs in, edits an agent (system prompt, provider choices, voice, knowledge base), and the result lands in MongoDB. Pure CRUD, no real-time anything, no exotic infrastructure.
Data plane = the voice runtime. Where agents run. A WebSocket opens, the backend reads the agent config from Mongo (cached in Redis), spins up an asyncio task graph (STT loop, LLM stream, TTS worker, audio sender), and streams audio both directions until the user hangs up.

The two share exactly one thing: the agent document. They can deploy together (one box, two containers — what I actually run today on AWS) or scale independently the moment one gets hot. Because the data plane holds no cross-call in-process state — everything per-call lives on a single asyncio task graph — you can scale it horizontally by adding boxes behind a load balancer the day you need to. That decoupling is worth more than any specific tech choice.

Architectural style

Modular monolith, not microservices. Single FastAPI process, but cleanly partitioned modules (services/stt/*, services/tts/*, state/, tools/). In-process function calls beat any RPC for latency, and the module boundaries are already shaped so extracting a worker pool later would be a refactor, not a rewrite.

The tech stack#

Concrete choices, with the “why” in one line and the alternatives you can swap in. None of these is the One True Choice — they’re the trade-offs that fit a real-time, B2B, multi-tenant SaaS.

Layer	My pick	Why	Alternatives
Frontend	Next.js 15 + React 19 + Tailwind	App Router + RSC for the dashboard; Framer Motion for UI animation.	Remix, Astro, SvelteKit, Vite + React
Dashboard API	Next.js API routes	Co-located BFF. No second service just for CRUD.	tRPC, Express, Hono, NestJS
Voice runtime	Python 3 + FastAPI + asyncio	Every AI SDK ships a Python client first; asyncio handles hundreds of IO-bound tasks per call.	Node + uWebSockets, Go + Gorilla, Elixir Phoenix
Primary DB	MongoDB Atlas	Flexible nested agent docs. (Honest: I'd pick Postgres + JSONB if I started over.)	Postgres + JSONB, Supabase, PlanetScale, DynamoDB
Vector DB	Qdrant Cloud	Multi-tenant via payload filter; cloud-hosted means I don't run it.	Pinecone, Weaviate, Milvus, pgvector, Chroma
Embeddings	all-MiniLM-L6-v2 (local)	Free, sub-50 ms, 384-dim, runs on CPU. Good enough for SMB-sized KBs.	OpenAI text-embedding-3-small, Voyage, BGE, Cohere
Cache / KV	Redis	Agent-config cache + per-IP rate limit. Not a queue — everything in-process is asyncio.	Dragonfly, KeyDB, Memcached, in-memory LRU
Auth	NextAuth.js 5	Google/GitHub OAuth + credentials; admin runs a separate token scheme.	Clerk, Supabase Auth, Auth0, custom JWT
Hosting	AWS EC2 + Docker + Nginx	One box in ap-south-1 (Mumbai), two containers, TLS via Let's Encrypt. Boring on purpose.	Fly.io, Render, Railway, Hetzner, K8s

What I'd reverse

MongoDB. The “flexible schema” benefit didn’t pay off — every agent ends up with the same fields, and the relational bits (user → agents, agent → calls, agent → KB files) want a relational store. Postgres + JSONB would have been easier to query, easier to migrate, and cheaper to operate. Starting fresh: start there.

The real-time pipeline#

Audio transport: raw WebSocket, not WebRTC

Most tutorials reach for WebRTC because it’s “the real-time standard.” For a browser-to-server voice agent you almost certainly don’t need it. WebRTC buys you three things — NAT traversal, jitter buffering, and AEC — and none are valuable here:

NAT traversal — you have a public server. No peer-to-peer hole-punching.
Jitter buffering— you’re going to build your own client-side ~150 ms buffer anyway (TTS chunks arrive bursty).
AEC — getUserMedia({ echoCancellation: true }) already does it on the browser side.

Skipping WebRTC means no SDP exchange, no STUN/TURN servers, no SFU. Just one TCP+TLS connection carrying binary WebSocket frames in both directions — lower setup latency, far less infra. The audio format on the wire:

Hop	Format	Rate
Mic → AudioWorklet	Float32 mono	48 kHz
AudioWorklet → WebSocket	Int16 PCM (Linear16)	16 kHz
Server → STT	Same Int16	16 kHz
TTS → server → client	Raw PCM bytes	16–24 kHz (provider-dependent)

The downsample 48 kHz Float32 → 16 kHz Int16 is the single most-important client-side detail. Doing it in the AudioWorklet (not the main thread) keeps the audio path off React’s render loop — a React re-render blocking the audio thread is what causes mysterious glitches in toy implementations.

The orchestrator loop

Everything above is plumbing. The brain is one async function that owns the call’s state machine: it consumes STT events, fires the LLM, slices its token stream into sentences, queues those for TTS, and pushes audio bytes back over the WebSocket — while listening for interruptions the whole time.

python

# Per-call task graph (simplified from services/llm_orchestrator.py)
task_manager = TaskManager(websocket)          # audio queue, interrupt signal, llm task handle
history = [{"role": "system", "content": agent.system_prompt}]
stt = build_stt(agent.stt_provider, language=agent.stt_language)

# 1) STT loop — listens for transcripts & barge-in
async def stt_loop():
    async for event in stt.events():
        if event.type == "interim_transcript" and task_manager.is_busy:
            await task_manager.handle_interruption()       # ← see Barge-In section
        elif event.type == "utterance_end":
            asyncio.create_task(run_llm_and_tts(event.final_text))

# 2) The pipeline: LLM → sentence buffer → TTS queue
async def run_llm_and_tts(user_text):
    history.append({"role": "user", "content": user_text})
    sentence_buffer = ""
    async for delta in stream_llm(history[-25:], tools=TOOL_SCHEMAS):
        if delta.tool_call:                                # tool calls handled separately
            await dispatch_tool(delta.tool_call); return
        sentence_buffer += delta.content
        for sentence in extract_sentences(sentence_buffer):  # punctuation-boundary slice
            if task_manager.interrupt_signal.is_set():       # bail on barge-in
                return
            await task_manager.tts_queue.put(sentence)
        sentence_buffer = leftover_after_extraction
    if sentence_buffer.strip():
        await task_manager.tts_queue.put(sentence_buffer)    # flush tail (don't drop it!)

# 3) TTS worker — drains the sentence queue
async def tts_worker():
    while True:
        sentence = await task_manager.tts_queue.get()
        async for chunk in tts.synthesize(sentence):
            if task_manager.interrupt_signal.is_set():
                break
            await task_manager.audio_queue.put(chunk)

# 4) Audio sender — drains the audio queue to the client
async def audio_sender():
    while True:
        chunk = await task_manager.audio_queue.get()
        await websocket.send_bytes(chunk)

await asyncio.gather(stt_loop(), tts_worker(), audio_sender())

Four concurrent tasks. Three queues between them. One shared TaskManagerholding the interrupt signal that every task checks before doing anything. That’s the entire architecture — everything else in this post is detail about how each of those four tasks behaves.

Sentence buffering — the magic trick#

Why does Velox feel snappy when most voice bots feel laggy? Sentence buffering.

A naive pipeline waits for the LLM to finish, sends the whole response to TTS, waits for TTS to synthesise all of it, then plays it. Three serial round-trips — users sit through ~2–3 seconds of dead air every turn. The fix is to chunk the token stream by punctuation boundaries and pipe each finished sentence to TTS the instant it lands:

1LLM tokens arriving

2Sentence buffer

3Flushed to TTS

Each finished sentence ships to TTS the instant it lands

python

# The whole magic, in one regex.
SENTENCE_BOUNDARY = re.compile(r"([.!?:;])\s+|\n")

# Stream tokens in; whenever a terminator is seen, slice off the completed
# sentence and ship it to TTS. Everything after the cut stays in the buffer.

Three sub-decisions matter here:

01Boundary set. . ! ? : ; \n. Anything finer (token count, mid-clause) produces glitchy half-words from TTS. Anything coarser (whole paragraphs) starves the user of audio.
02Hold the tail. If the LLM ends mid-sentence (no terminator), the last buffered chunk has to be flushed at end-of-stream — otherwise the final clause silently disappears. (Ask me how I know.)
03Drop too-short fragments.Sentences shorter than ~3 chars are held back — otherwise an “OK.” costs a full TTS round-trip to say one syllable.

Net effect: the user hears the first sentence ~600 ms after they stop speaking, not ~2.5 s. TTS for sentence two happens while sentence one is playing. The whole pipeline is naturally pipelined.

Barge-in, implemented#

Barge-in — the user talking over the agent mid-sentence — was the bug class that bit the most before it was solved. The hard part isn’t detecting that the user spoke; it’s making the agent shut up within ~100 ms, across three concurrent tasks, two queues, and one TCP stream’s worth of buffered audio.

Two engineering choices made it tractable:

1. Detect barge-in semantically, not acoustically.A naive implementation watches mic energy — anything above a threshold counts as speech. In the real world that fires on keyboard clicks, doors, AC fans, the user’s own breathing. Velox uses the STT engine’s interim transcriptsinstead: if Deepgram returns a partial that contains a recognisable word while the agent is speaking, that’s a real interrupt. Otherwise it’s noise — ignore it.

2. Treat interruption as a cascade, not a flag. When the interrupt fires, six things have to happen in order. Get any of them wrong and you hear overlapping voices, ghost audio, or the agent resuming where it left off after a two-second pause:

0 ms

STT emits an interim transcript with a real word while task_manager.is_busy == true.

+1 ms

interrupt_signal.set() — every loop checks this flag before doing anything.

+5 ms

The in-flight LLM stream is cancelled. The httpx connection raises CancelledError and unwinds.

+10 ms

The TTS queue and audio queue are drained — anything synthesised but not yet sent is dropped on the floor.

+12 ms

A {"type":"control","action":"interrupt"} message goes over the WebSocket. The client clears its playback queue and stops every AudioBufferSourceNode currently scheduled.

+15 ms

was_interrupted = True is flagged on the call state — the next LLM call sees [User interrupted you] injected into history so the model recovers gracefully.

+20 ms

The audio-sender task is reset so it doesn’t drain stale futures. STT keeps listening.

Echo cancellation

All of the above assumes the user’s mic isn’t capturing the agent’s own speaker output and triggering false interrupts. The browser’s built-in WebRTC AEC (via getUserMedia({ echoCancellation: true })) handles this for free in-browser. Server-side you’d need reference-signal subtraction — subtract the TTS output from the inbound mic stream before it reaches STT — but for browser-first apps you genuinely don’t have to write that code.

The known imperfection: when the user interrupts mid-sentence, the conversation history records the full intended response, not what the user actually heard. If they reference something the agent never finished saying, the model can get confused. The fix is to truncate history to the last word the user heard — which requires knowing which audio chunks actually made it past the jitter buffer. Not solved yet.

Knowledge base & RAG#

A voice agent without business knowledge can’t answer “what’s your return policy?” It needs RAG — retrieval-augmented generation. The pipeline is dead simple, but every parameter is a trade-off:

UploadDrag-drop in the dashboard

multipart upload → backend

▾

ParseExtract text by file type

PDF → PyPDFDOCX → python-docxTXT → utf-8Image → OCR

▾

ChunkSplit into overlapping windows

RecursiveCharacterTextSplitter · size 500 · overlap 50

▾

EmbedVectorise each chunk

all-MiniLM-L6-v2 → 384-dim vectors

▾

UpsertStore in the vector DB

Qdrant · collection = agent_knowledge

payload

agent_idfile_idfilenamecontentsourcechunk_index

Drag-drop to vector — five stages, every parameter a trade-off

Choices worth defending:

500-char / 50-overlap chunks, character-based, not token-based. Small enough to fit 3–5 in an LLM call without bloating cost; large enough to carry real context. The recursive splitter respects paragraph → sentence → word boundaries before falling back to mid-word.
Local embeddings (all-MiniLM-L6-v2) over OpenAI’s text-embedding-3-small. 384-dim is plenty for SMB-sized KBs (~50–500 chunks per agent), inference is sub-50 ms on CPU, and there’s no per-token bill on the hot path. The recall gap vs OpenAI isn’t noticeable below a few thousand chunks. For technical / legal / multilingual KBs, swap in something stronger.
One Qdrant collection, payload-filter isolation. Every point has an agent_idfield with a payload index; every query filters by it. Cheap, simple, scales to thousands of agents. If a customer ever demands physical separation, the migration is “one collection per tenant” — a refactor, not a rewrite.
Top-3 vector-only retrieval. No hybrid keyword search, no re-ranker. Top-3 with 500-char chunks puts ~1500 chars of context in the LLM — concise factual answers that read naturally when spoken aloud.

python

# Tenant isolation = one line of Qdrant filter
query_filter = models.Filter(must=[
    models.FieldCondition(
        key="agent_id",
        match=models.MatchValue(value=agent_id),
    )
])

results = qdrant.search(
    collection_name="agent_knowledge",
    query_vector=embed(user_question),     # 384-dim
    query_filter=query_filter,
    limit=3,
)

The retrieved chunks aren’t stuffed into the system prompt — they’re injected as the response of a search_knowledge_basetool call. The LLM decides when to ask for them. That’s crucial: the model can choose notto search when the user is just exchanging pleasantries, saving a Qdrant round-trip every turn. Which leads to…

Tool calls inside a streaming loop#

Tool calling on a non-streaming HTTP chat endpoint is easy. Tool calling inside a streaming voice pipeline is genuinely tricky, because tool calls arrive as partial JSON interleaved with regular content tokens, and you have to decide on the fly:

Suppress any “narration” tokens the model emits alongside the tool call (it shouldn’t say “I’m going to look that up…” out loud while it’s already doing it).
Mask the tool latency with a filler phrase so the user hears something.
Execute the tool, append the result to history, then re-call the LLM — which streams the actual spoken answer back through the sentence buffer.

python

async def run_llm_and_tts(user_text):
    history.append({"role": "user", "content": user_text})
    sentence_buffer = ""
    tool_call_buffer = None

    async for delta in stream_llm(history[-25:], tools=TOOL_SCHEMAS):
        # 1) Accumulate partial tool-call JSON across stream chunks
        if delta.tool_call:
            tool_call_buffer = accumulate(tool_call_buffer, delta.tool_call)
            continue                       # don't TTS this delta

        # 2) Regular content tokens flow into the sentence buffer
        sentence_buffer += delta.content
        for sentence in extract_sentences(sentence_buffer):
            if task_manager.interrupt_signal.is_set(): return
            await task_manager.tts_queue.put(sentence)
        sentence_buffer = leftover_after_extraction

    # 3) End of stream — if a tool was called, execute and recurse
    if tool_call_buffer:
        # Latency mask: queue a filler while the tool runs
        await task_manager.tts_queue.put(random.choice(FILLER_PHRASES))
        result = await execute_tool(tool_call_buffer)        # search KB, end call, etc.
        history.append({"role": "tool", "content": result,
                        "tool_call_id": tool_call_buffer.id})
        await run_llm_and_tts(_continuation=True)            # second LLM call answers
    elif sentence_buffer.strip():
        await task_manager.tts_queue.put(sentence_buffer)    # flush tail

The filler phrase (“Let me check the knowledge base for that…”) masks ~400–700 ms of dead air on knowledge searches. Small thing, but it’s the difference between an agent that feels thoughtful and one that feels stuck. The same trick applies to slow LLMs in general — queue a generic “thinking” phrase before the call, and the user hears engagement instead of silence.

Built-in tools + obvious additions

Today: search_knowledge_base(query) and end_call(reason). The obvious next additions for customer care: transfer_to_human, send_sms, send_email, schedule_callback, book_appointment, lookup_order(id). For user-defined tools, the clearly-right design is per-agent encrypted credentials in the DB plus server-side execution — credentials never reach the client. Not built yet, but the adapter shape is there.

Latency in practice#

Real numbers from a production pipeline on a good connection (English path: Deepgram STT + NVIDIA NIM LLM + Deepgram Aura TTS, client in ap-south-1). These are estimatesfrom log-stamped TTS first-byte events — honest disclaimer: there’s no P50/P95 dashboard yet, which is genuinely the biggest operational gap left in the system.

Mic → WS frame30 ms

Network (in)50 ms

STT endpointing200 ms

LLM TTFT200 ms

Sentence flush100 ms

TTS TTFB150 ms

Network (out)50 ms

Jitter buffer150 ms

Total TTFB~850 ms

Two observations worth taking away:

01The LLM is rarely the bottleneck anymore.Going in, I assumed “AI is slow” and built around hiding model latency. With Groq / Cerebras / NVIDIA NIM returning first tokens in 150–250 ms, the LLM is faster than the network round-trip to TTS. The real bottlenecks are STT endpointing waiting for silence, and TTS providers that don’t stream.
02Users notice silence after they speak, not delay before they’re answered.200 ms of silence between “Hi” and the agent reacting feels broken. 500 ms of audio delay once the agent is talking feels totally fine. That asymmetry should shape where you spend your optimisation budget.

The optimisation that mattered most

Not a latency win — a perception win: switching from energy-based VAD to transcript-based barge-in.Before: any background sound (typing, doors, fans) interrupted the agent. After: only real speech does. Users stopped reporting that the AI “sounds skittish.” Sometimes the biggest performance gain isn’t a faster path — it’s removing a path that fires when it shouldn’t.

What’s left to attack:Sarvam’s TTS doesn’t stream — it returns full utterances in one HTTP response. For a one-sentence reply that’s 300–600 ms of dead air on Indian-language agents. The day Sarvam ships a streaming endpoint, that’s a one-line config change for a 30% latency win. STT endpointing (utterance_end_ms=1000) is the runner-up — dropping it to 500 ms shaves half a second off perceived latency at the cost of more occasional early triggers. A tunable knob, not a code change.

What surprised me#

Three things I didn’t expect going in — worth sharing because every voice-agent post on the internet was wrong about at least one of them:

01The LLM is the fastest part.See above. Most posts treat the LLM as the slow, expensive thing you wrap with caching and prompt compression. With modern inference providers it’s faster than your network round-trip. Spend your budget on STT endpointing and TTS streaming — that’s where the seconds live.
02Semantic caching isn’t worth it for voice.Theory: 40% of support queries are repeats; cache them. Reality: voice queries are wildly variable in phrasing (“how do I…” / “can you tell me…” / “what’s the way to…”), hit rates are low, and the operational complexity isn’t worth it when the uncached path is already ~700 ms.
03You need fewer queues than you think. Early prototypes had a queue between every stage. All you actually need is a sentence queue (LLM → TTS) and an audio queue (TTS → client). Everything else is in-process async/await — queues add latency and a place for state to drift out of sync with the interrupt signal.

Want to talk to the thing in this post?

Everything above is live at Velox AI — the platform I built that lets businesses configure their own voice agents with this exact pipeline. Try the public demo, build your own agent, or browse the library.

Open Velox AI↗

Thanks for reading. The first half — the conceptual architecture — lives in Part 01. I’d love to hear what you’d build with it.

LinkedIn GitHub

← All writing Prev: Anatomy of a Real-Time Voice Agent