Real-Time Voice Agent — Latency Budget & Media Path
Twilio-shaped: PSTN/SIP + WebRTC ingress, media streaming, sub-second mouth-to-mouth. Companion to multimodal_fusion_design.md.
The number you defend in design review is mouth-to-mouth latency: from the
instant the human stops speaking to the instant they hear the bot. Target
<800ms (banks/enterprise), great is ~500ms. Everything below is
about spending that budget — and the dominant cost is not a model, it's
deciding the human actually finished talking.
1. The end-to-end media path — where the milliseconds live
solid = inbound media (caller → models) · green = the return leg (streamed TTS audio → playout). The amber band is the codec ceiling: the ingress format caps STT quality before any model choice.
Twilio-specific reality (the part most candidates miss)
PSTN audio is 8kHz μ-law (G.711), narrowband, ~64kbps. You do not control this — the carrier delivers it. It caps STT accuracy vs. the 48kHz Opus you'd get over WebRTC. Mention upsampling/denoise before STT.
Media Streams forks the call audio to your WebSocket in 20ms μ-law frames. ConversationRelay is the newer managed path — Twilio runs STT+TTS, you bring the LLM over WebSocket (removes two hops from your budget, at the cost of model choice).
Regional POPs / media servers — keep the media path close to the caller; a transatlantic RTT can eat 150ms by itself. Place the agent near the POP.
Carrier/PSTN setup + jitter is real and variable; WebRTC gives you a jitter buffer you can tune, PSTN does not.
2. The mouth-to-mouth budget — staged
This is the table to draw. Each row is a stage; the trick is that they pipeline, so wall-clock < sum (see §3).
Stage
What happens
Naive
Tuned
Notes
Network in (jitter buf)
frames arrive, de-jittered
40ms
20ms
WebRTC tunable; PSTN fixed
Endpointing (turn end)
decide the human stopped
700ms
300ms
DOMINATES. silence wait + turn model
STT (final)
partials stream live; finalize on endpoint
300ms
~50ms
partials already done; only the tail
LLM TTFT
time to first token
600ms
250ms
prompt cache, small model, co-locate
LLM→enough text
first sentence/clause to start TTS
400ms
~0ms
overlaps TTS — don't wait for full reply
TTS TTFB
first audio chunk out
300ms
120ms
streaming TTS (Cartesia), first chunk only
Network out + playout
encode, send, jitter buf, speaker
60ms
30ms
μ-law re-encode on PSTN
MOUTH-TO-MOUTH
perceived, with pipelining
~1.4s
~550ms
see waterfall §3
The dominant cost is endpointing, not inference. Candidates obsess over LLM
speed; the human-perceived lag is mostly how long you wait in silence to be
sure they finished. Halving endpointing beats halving the LLM.
3. Why you STREAM and PIPELINE — the waterfall
You never run the stages serially. STT emits partials while the user talks;
the LLM can be speculatively prefilled on partials; TTS starts on the
first clause while the LLM is still generating. The budget is a waterfall, not
a sum.
The bars that start before the endpoint line (STT partials, speculative prefill) are work done while the user is still talking — that overlap is why wall-clock is a waterfall, not a sum. Only the endpointing wait, the final STT tail, decode, TTS first-byte, and playout fall to the right of the gate.
Streaming STT → the transcript is basically ready at endpoint; you pay only the finalization tail, not the whole utterance.
Speculative LLM prefill → start prefilling on the partial transcript; if it's stable, TTFT is near-instant at endpoint. Cancel/redo if the tail changes the meaning (cost: wasted prefill — cheap relative to the latency win).
Streaming TTS + sentence chunking → emit audio for clause 1 while the LLM writes clause 2. First-audio is what the user feels.
Prompt caching → the system prompt + conversation history are a cache hit every turn; only the new turn is fresh prefill. (KV-cache reuse — ties to 14_inference_server.)
4. Endpointing & turn-taking — the hard, high-leverage part
The core question: did they STOP, or just PAUSE?
VAD tells you "is there voice energy right now." It does not tell you
"is the turn over." A 300ms gap mid-sentence ("send it to… uh… my checking
account") must NOT trigger the bot. That's a separate turn-detection /
endpointing model (semantic + acoustic), not raw VAD.
VAD (Silero, WebRTC) — fast, cheap, per-frame energy/voice gate. The first filter.
Endpointing — combines silence duration and linguistic completeness. "I want to…" (incomplete → wait longer) vs "transfer fifty dollars." (complete → fire fast).
The tradeoff knob: short silence threshold = snappy but interrupts people who pause to think; long threshold = polite but laggy. This is the latency-vs-correctness knob again, and it's per-deployment (a bank IVR tolerates more wait than a casual assistant).
Adaptive — shorten the threshold after a clearly complete clause; lengthen it after a filler word or rising intonation.
Barge-in (interrupt the bot)
When the caller talks over the bot, two things must happen in single-frame
latency:
Stop TTS instantly — cancel playback + flush the outbound audio buffer (this is 10_audio_pipeline_interruptible.py — one asyncio.Event, race every await against it).
AEC3 must already be running so the bot's own voice in the mic isn't mistaken for barge-in. Without echo cancellation, the bot interrupts itself.
Re-sequence — the caller's interruption is a new high-priority event; discard the bot's half-spoken turn from context or mark it truncated so the LLM knows it wasn't fully heard.
5. Budget under failure & tail latency
The average is a lie; users feel p95/p99. One slow stage blows the turn.
Failure
Effect
Mitigation
LLM TTFT spike
dead air mid-conversation
filler/backchannel ("let me check…") to mask; timeout→fallback model
STT slow/wrong on μ-law
misheard intent
upsample+denoise; confidence gate → reprompt
TTS stall
cut-off speech
buffer 1–2 chunks before playout starts; cache common phrases
Network jitter
choppy audio
jitter buffer (adds latency — the tradeoff); regional POP
Dead air is the worst failure — humans read >~700ms of silence as "the
line dropped." A cheap backchannel ("mm-hm", "let me pull that up")
buys you LLM time and feels natural. Latency you can't remove, you can mask.
6. Scaling & economics — the resource-scheduler view
A voice session pins a GPU slot for its whole duration (STT+TTS streaming, KV-cache resident) — unlike text, which is bursty and poolable. Concurrency/GPU is the cost driver.
Self-host vs API per hop — NVIDIA Parakeet/Canary on Modal is ~1–2 orders cheaper than proprietary STT at the cost of running infra (see voice_agents_notes.md tables; ~4¢/GPU-min). TTS (Cartesia) and LLM (OpenAI) often stay API for quality/latency.
KV-cache compression (~7×) → more concurrent voice sessions per GPU + longer context at the same cost. Directly raises sessions-per-GPU.
Scheduler — sessions → GPU slots; admit by KV-cache budget, not headcount (the 14_inference_server thesis). Text and voice share the LLM pool but at different tiers/priorities.
Continuous (in-flight) batching, NOT static fill-and-fire — voice can't wait to fill a batch; the fill-wait is the latency. Admit each request into the running batch immediately and let the GPU interleave it with sequences already decoding — batch efficiency with zero fill-wait. One engine, per-class policy — NOT two systems. Voice/chat ride the same continuously-batched GPU pool as bulk; they're just admitted immediately (latency-first), while bulk/offline accumulates ("fill 100 or 40ms") and backfills the idle slots at low priority (throughput-first). The cheap Batch API tier is that backfill.
Paged KV (PagedAttention) — KV cache in fixed non-contiguous blocks; admit/evict sessions by block availability. This is the mechanism that makes continuous batching and "admit by KV-cache budget" actually work without fragmentation.
Prefill vs decode — a joining session's prefill is compute-bound and can stall everyone else's decode loop (audible jitter for active calls). Chunk or disaggregate prefill so a new session doesn't add latency to live ones.
7. What I'd actually build at the start
Each version adds the single highest-leverage change next: v0 is the measurable baseline, v1 (real endpointing) is the biggest latency win, and cost-at-scale work waits until v5.
Say out loud: "Mouth-to-mouth is a pipelined waterfall, not a sum —
stream every stage, speculatively prefill on partials, start TTS on the first
clause. The dominant cost is endpointing — deciding the human finished —
not the models. And on Twilio the media path is PSTN 8kHz μ-law unless it's
WebRTC, which sets the STT quality ceiling before any model choice."