Real-Time Voice Agent — Latency Budget & Media Path

Twilio-shaped: PSTN/SIP + WebRTC ingress, media streaming, sub-second mouth-to-mouth. Companion to multimodal_fusion_design.md.
The number you defend in design review is mouth-to-mouth latency: from the instant the human stops speaking to the instant they hear the bot. Target <800ms (banks/enterprise), great is ~500ms. Everything below is about spending that budget — and the dominant cost is not a model, it's deciding the human actually finished talking.

1. The end-to-end media path — where the milliseconds live

caller TWILIO EDGE YOUR AGENT MODELS phone or browser μ-law 8kHz (PSTN) or Opus 48kHz (WebRTC) media server regional POP orchestrator VAD · endpoint barge-in STT LLM TTS PSTN/SIP WS · 20ms frames streamed audio playout PSTN path: narrowband 8kHz G.711 μ-law — fixed by the carrier. Worse STT input. WebRTC path: wideband 48kHz Opus — better audio, but needs a browser/app.
solid = inbound media (caller → models) · green = the return leg (streamed TTS audio → playout). The amber band is the codec ceiling: the ingress format caps STT quality before any model choice.
Twilio-specific reality (the part most candidates miss)

2. The mouth-to-mouth budget — staged

This is the table to draw. Each row is a stage; the trick is that they pipeline, so wall-clock < sum (see §3).

StageWhat happensNaiveTunedNotes
Network in (jitter buf)frames arrive, de-jittered40ms20msWebRTC tunable; PSTN fixed
Endpointing (turn end)decide the human stopped700ms300msDOMINATES. silence wait + turn model
STT (final)partials stream live; finalize on endpoint300ms~50mspartials already done; only the tail
LLM TTFTtime to first token600ms250msprompt cache, small model, co-locate
LLM→enough textfirst sentence/clause to start TTS400ms~0msoverlaps TTS — don't wait for full reply
TTS TTFBfirst audio chunk out300ms120msstreaming TTS (Cartesia), first chunk only
Network out + playoutencode, send, jitter buf, speaker60ms30msμ-law re-encode on PSTN
MOUTH-TO-MOUTHperceived, with pipelining~1.4s~550mssee waterfall §3
The dominant cost is endpointing, not inference. Candidates obsess over LLM speed; the human-perceived lag is mostly how long you wait in silence to be sure they finished. Halving endpointing beats halving the LLM.

3. Why you STREAM and PIPELINE — the waterfall

You never run the stages serially. STT emits partials while the user talks; the LLM can be speculatively prefilled on partials; TTS starts on the first clause while the LLM is still generating. The budget is a waterfall, not a sum.

user stops speaking → endpoint (the gate) user speaking STT partials (live) final ~50ms tail endpointing wait 300ms "are they done?" LLM prefill (speculative) on partials → warm KV-cache before final LLM decode TTFT fast — prefill already paid TTS (first clause) TTFB 120ms audio out + playout user hears bot perceived ≈ 550ms
The bars that start before the endpoint line (STT partials, speculative prefill) are work done while the user is still talking — that overlap is why wall-clock is a waterfall, not a sum. Only the endpointing wait, the final STT tail, decode, TTS first-byte, and playout fall to the right of the gate.

4. Endpointing & turn-taking — the hard, high-leverage part

The core question: did they STOP, or just PAUSE?

VAD tells you "is there voice energy right now." It does not tell you "is the turn over." A 300ms gap mid-sentence ("send it to… uh… my checking account") must NOT trigger the bot. That's a separate turn-detection / endpointing model (semantic + acoustic), not raw VAD.

Barge-in (interrupt the bot)

When the caller talks over the bot, two things must happen in single-frame latency:

  1. Stop TTS instantly — cancel playback + flush the outbound audio buffer (this is 10_audio_pipeline_interruptible.py — one asyncio.Event, race every await against it).
  2. AEC3 must already be running so the bot's own voice in the mic isn't mistaken for barge-in. Without echo cancellation, the bot interrupts itself.
  3. Re-sequence — the caller's interruption is a new high-priority event; discard the bot's half-spoken turn from context or mark it truncated so the LLM knows it wasn't fully heard.

5. Budget under failure & tail latency

The average is a lie; users feel p95/p99. One slow stage blows the turn.

FailureEffectMitigation
LLM TTFT spikedead air mid-conversationfiller/backchannel ("let me check…") to mask; timeout→fallback model
STT slow/wrong on μ-lawmisheard intentupsample+denoise; confidence gate → reprompt
TTS stallcut-off speechbuffer 1–2 chunks before playout starts; cache common phrases
Network jitterchoppy audiojitter buffer (adds latency — the tradeoff); regional POP
Dead air is the worst failure — humans read >~700ms of silence as "the line dropped." A cheap backchannel ("mm-hm", "let me pull that up") buys you LLM time and feels natural. Latency you can't remove, you can mask.

6. Scaling & economics — the resource-scheduler view

7. What I'd actually build at the start

v0 Twilio Media Streams → WS → [ VAD → streaming STT → LLM → streaming TTS ] → back measure mouth-to-mouth end to end FIRST — you can't tune what you don't measure v1 + real endpointing model (kill the naive silence timer) biggest win v2 + barge-in (cancel + AEC3) feels human v3 + speculative prefill on partials + prompt cache shave TTFT v4 + backchannel/filler to mask p99 tail hide the spikes v5 + self-host STT on Modal, scheduler, KV-cache compression cost at scale ship order
Each version adds the single highest-leverage change next: v0 is the measurable baseline, v1 (real endpointing) is the biggest latency win, and cost-at-scale work waits until v5.
Say out loud: "Mouth-to-mouth is a pipelined waterfall, not a sum — stream every stage, speculatively prefill on partials, start TTS on the first clause. The dominant cost is endpointing — deciding the human finished — not the models. And on Twilio the media path is PSTN 8kHz μ-law unless it's WebRTC, which sets the STT quality ceiling before any model choice."