Real-Time Voice Agent — Latency Budget & Media Path

Twilio-shaped: PSTN/SIP + WebRTC ingress, media streaming, sub-second mouth-to-mouth. Companion to multimodal_fusion_design.md.

The number you defend in design review is mouth-to-mouth latency: from the instant the human stops speaking to the instant they hear the bot. Target <800ms (banks/enterprise), great is ~500ms. Everything below is about spending that budget — and the dominant cost is not a model, it's deciding the human actually finished talking.

1. The end-to-end media path — where the milliseconds live

solid = inbound media (caller → models) · green = the return leg (streamed TTS audio → playout). The amber band is the codec ceiling: the ingress format caps STT quality before any model choice.

Twilio-specific reality (the part most candidates miss)

PSTN audio is 8kHz μ-law (G.711), narrowband, ~64kbps. You do not control this — the carrier delivers it. It caps STT accuracy vs. the 48kHz Opus you'd get over WebRTC. Mention upsampling/denoise before STT.
Media Streams forks the call audio to your WebSocket in 20ms μ-law frames. ConversationRelay is the newer managed path — Twilio runs STT+TTS, you bring the LLM over WebSocket (removes two hops from your budget, at the cost of model choice).
Regional POPs / media servers — keep the media path close to the caller; a transatlantic RTT can eat 150ms by itself. Place the agent near the POP.
Carrier/PSTN setup + jitter is real and variable; WebRTC gives you a jitter buffer you can tune, PSTN does not.

2. The mouth-to-mouth budget — staged

This is the table to draw. Each row is a stage; the trick is that they pipeline, so wall-clock < sum (see §3).

Stage	What happens	Naive	Tuned	Notes
Network in (jitter buf)	frames arrive, de-jittered	40ms	20ms	WebRTC tunable; PSTN fixed
Endpointing (turn end)	decide the human stopped	700ms	300ms	DOMINATES. silence wait + turn model
STT (final)	partials stream live; finalize on endpoint	300ms	~50ms	partials already done; only the tail
LLM TTFT	time to first token	600ms	250ms	prompt cache, small model, co-locate
LLM→enough text	first sentence/clause to start TTS	400ms	~0ms	overlaps TTS — don't wait for full reply
TTS TTFB	first audio chunk out	300ms	120ms	streaming TTS (Cartesia), first chunk only
Network out + playout	encode, send, jitter buf, speaker	60ms	30ms	μ-law re-encode on PSTN
MOUTH-TO-MOUTH	perceived, with pipelining	~1.4s	~550ms	see waterfall §3

The dominant cost is endpointing, not inference. Candidates obsess over LLM speed; the human-perceived lag is mostly how long you wait in silence to be sure they finished. Halving endpointing beats halving the LLM.

3. Why you STREAM and PIPELINE — the waterfall

You never run the stages serially. STT emits partials while the user talks; the LLM can be speculatively prefilled on partials; TTS starts on the first clause while the LLM is still generating. The budget is a waterfall, not a sum.

The bars that start before the endpoint line (STT partials, speculative prefill) are work done while the user is still talking — that overlap is why wall-clock is a waterfall, not a sum. Only the endpointing wait, the final STT tail, decode, TTS first-byte, and playout fall to the right of the gate.

Streaming STT → the transcript is basically ready at endpoint; you pay only the finalization tail, not the whole utterance.
Speculative LLM prefill → start prefilling on the partial transcript; if it's stable, TTFT is near-instant at endpoint. Cancel/redo if the tail changes the meaning (cost: wasted prefill — cheap relative to the latency win).
Streaming TTS + sentence chunking → emit audio for clause 1 while the LLM writes clause 2. First-audio is what the user feels.
Prompt caching → the system prompt + conversation history are a cache hit every turn; only the new turn is fresh prefill. (KV-cache reuse — ties to 14_inference_server.)

4. Endpointing & turn-taking — the hard, high-leverage part

The core question: did they STOP, or just PAUSE?

VAD tells you "is there voice energy right now." It does not tell you "is the turn over." A 300ms gap mid-sentence ("send it to… uh… my checking account") must NOT trigger the bot. That's a separate turn-detection / endpointing model (semantic + acoustic), not raw VAD.

VAD (Silero, WebRTC) — fast, cheap, per-frame energy/voice gate. The first filter.
Endpointing — combines silence duration and linguistic completeness. "I want to…" (incomplete → wait longer) vs "transfer fifty dollars." (complete → fire fast).
The tradeoff knob: short silence threshold = snappy but interrupts people who pause to think; long threshold = polite but laggy. This is the latency-vs-correctness knob again, and it's per-deployment (a bank IVR tolerates more wait than a casual assistant).
Adaptive — shorten the threshold after a clearly complete clause; lengthen it after a filler word or rising intonation.

Barge-in (interrupt the bot)

When the caller talks over the bot, two things must happen in single-frame latency:

Stop TTS instantly — cancel playback + flush the outbound audio buffer (this is 10_audio_pipeline_interruptible.py — one asyncio.Event, race every await against it).
AEC3 must already be running so the bot's own voice in the mic isn't mistaken for barge-in. Without echo cancellation, the bot interrupts itself.
Re-sequence — the caller's interruption is a new high-priority event; discard the bot's half-spoken turn from context or mark it truncated so the LLM knows it wasn't fully heard.

5. Budget under failure & tail latency

The average is a lie; users feel p95/p99. One slow stage blows the turn.

Failure	Effect	Mitigation
LLM TTFT spike	dead air mid-conversation	filler/backchannel ("let me check…") to mask; timeout→fallback model
STT slow/wrong on μ-law	misheard intent	upsample+denoise; confidence gate → reprompt
TTS stall	cut-off speech	buffer 1–2 chunks before playout starts; cache common phrases
Network jitter	choppy audio	jitter buffer (adds latency — the tradeoff); regional POP

Dead air is the worst failure — humans read >~700ms of silence as "the line dropped." A cheap backchannel ("mm-hm", "let me pull that up") buys you LLM time and feels natural. Latency you can't remove, you can mask.

6. Scaling & economics — the resource-scheduler view

A voice session pins a GPU slot for its whole duration (STT+TTS streaming, KV-cache resident) — unlike text, which is bursty and poolable. Concurrency/GPU is the cost driver.
Self-host vs API per hop — NVIDIA Parakeet/Canary on Modal is ~1–2 orders cheaper than proprietary STT at the cost of running infra (see voice_agents_notes.md tables; ~4¢/GPU-min). TTS (Cartesia) and LLM (OpenAI) often stay API for quality/latency.
KV-cache compression (~7×) → more concurrent voice sessions per GPU + longer context at the same cost. Directly raises sessions-per-GPU.
Scheduler — sessions → GPU slots; admit by KV-cache budget, not headcount (the 14_inference_server thesis). Text and voice share the LLM pool but at different tiers/priorities.
Continuous (in-flight) batching, NOT static fill-and-fire — voice can't wait to fill a batch; the fill-wait is the latency. Admit each request into the running batch immediately and let the GPU interleave it with sequences already decoding — batch efficiency with zero fill-wait. One engine, per-class policy — NOT two systems. Voice/chat ride the same continuously-batched GPU pool as bulk; they're just admitted immediately (latency-first), while bulk/offline accumulates ("fill 100 or 40ms") and backfills the idle slots at low priority (throughput-first). The cheap Batch API tier is that backfill.
Paged KV (PagedAttention) — KV cache in fixed non-contiguous blocks; admit/evict sessions by block availability. This is the mechanism that makes continuous batching and "admit by KV-cache budget" actually work without fragmentation.
Prefill vs decode — a joining session's prefill is compute-bound and can stall everyone else's decode loop (audible jitter for active calls). Chunk or disaggregate prefill so a new session doesn't add latency to live ones.

7. What I'd actually build at the start

Each version adds the single highest-leverage change next: v0 is the measurable baseline, v1 (real endpointing) is the biggest latency win, and cost-at-scale work waits until v5.

Say out loud: "Mouth-to-mouth is a pipelined waterfall, not a sum — stream every stage, speculatively prefill on partials, start TTS on the first clause. The dominant cost is endpointing — deciding the human finished — not the models. And on Twilio the media path is PSTN 8kHz μ-law unless it's WebRTC, which sets the STT quality ceiling before any model choice."