Multimodal Conversational AI — The Complete Modern Stack

Master reference. Real-time text/voice/video conversation with an LLM (e.g. Claude). Every layer named with swappable alternatives, interfaces, latency, and self-host/API economics. Ingress covers browser WebRTC and PSTN/Twilio. Companions: voice_agents_notes.md, multimodal_fusion_design.md.
0 · Thesis & the three hard problems 1 · The whole stack (diagram) 2 · Transport & ingress 3 · Audio front-end: AEC3 · NS · VAD 4 · Turn detection & endpointing 5 · STT (streaming + batch) 6 · Diarization & speaker ID 7 · Paralinguistics / emotion 8 · Vision channel 9 · Fusion & co-sequencing (diagram) 10 · Context accumulation & memory (diagram) 11 · The LLM core 12 · TTS (streaming) 13 · Orchestration (Pipecat / LiveKit) 14 · Latency budget & waterfall (diagram) 15 · Infra, GPUs & economics 16 · Batch transcription path 17 · Scaling & the scheduler 18 · Escalation tiers 19 · Failure modes & tail latency 20 · Tag schema & interfaces 21 · Build order + interview ammo

0 · Thesis & the three hard problems

Everything — typed text, speech, video — becomes tokens before the model. But the modalities are not redundant transcripts to dedup; they are complementary signals to fuse on a shared event-time timeline. The model calls are the easy 20%. The hard parts: (1) knowing when the human finished (endpointing), (2) aligning streams with different latencies (audio ~80ms vs vision ~200ms), and (3) masking the latency you can't remove.

Two structural reframes: a conversation is one persistent, transport-agnostic session served at different tiers (text = cheap/async, voice = GPU-pinned/realtime); and the whole system is a resource scheduler matching sessions to capacity (KV-cache bytes).

1 · The whole stack at a glance

CLIENT CPU bot GPU solid = data flow · dashed amber = control (endpoint → run LLM) · dashed red = barge-in cancel (flush the whole output chain) mic capture · AEC/NS Opus 48k · AEC at the edge playout + jitter buf speaker out ← clear on barge-in WebRTC-light · aiortc terminate P2P (no SFU) Opus ⇄ PCM · in-process aiortc egress same peer conn · PCM → Opus flush queue on barge-in VAD · barge-in Silero turn detect smart-turn sequencer · fusion event-time merge (§9) session store (§10) durable · admit by KV budget barge-in → CANCEL flush output · mark truncated + client playout buffer · stop LLM decode STT Parakeet / Deepgram prosody · vision · diar perception specialists LLM Claude / vLLM KV-cache = working mem TTS Cartesia / Orpheus stream first clause endpoint → run LLM barge-in: user speaks during bot TTS
solid = data flow · dashed amber = control (turn-end triggers the LLM — it does not reorder the sequencer) · dashed red = barge-in cancel, flushing the entire output chain at once (LLM decode · TTS · egress queue · client playout buffer) and marking the turn truncated. In full-duplex this red path vanishes — yielding is just the model's next-frame decision.

2 · Transport & ingress — get media in/out, low jitter

ConcernPrimaryAlternativesWhy / notes
Realtime media + agent frameworkLiveKitPipecat transport, Daily, raw WebRTCSFU + agent SDK; runs the room, handles tracks. Runs on Modal.
Lightweight WebRTCSmallWebRTCTransportaiortc, full SFUPipecat's minimal P2P transport — single-agent, no SFU overhead.
Browser/app audioWebRTC (Opus 48kHz)Wideband, tunable jitter buffer, gives AEC hooks. The good path.
PhoneTwilio Media StreamsTwilio ConversationRelay, SIP trunk, VonagePSTN = 8kHz μ-law (G.711), carrier-fixed, narrowband — caps STT quality before any model choice. Media Streams forks 20ms frames to your WS; ConversationRelay is managed STT+TTS, you bring the LLM.
Regional placementmedia POP near callerTransatlantic RTT alone ~150ms. Co-locate agent + media server.
Interface
In: RTP/media frames (Opus or μ-law), 20ms each. Out: raw PCM frames to the front-end + a control channel for events (track start/stop, DTMF, mute). Pipecat/LiveKit normalize codecs so downstream sees uniform PCM.

WebSocket vs WebRTC — when each, and why WS is first-class

WebSocket is not a downgrade — it's the actual transport in three places that matter: (1) Twilio Media Streams forks call audio to your server as μ-law frames over a WebSocket, not WebRTC; (2) service-to-service (the Modal ragbot's Tunnels were WS); (3) any server-to-server agent with no browser. WebRTC is for the last mile to a browser/app; WebSocket is for everything behind it.
AxisWebRTCWebSocket
LayerUDP + SRTP (media-grade)TCP + TLS (one framed duplex stream)
Loss handlingtolerates loss (audio degrades gracefully)head-of-line blocking — a lost packet stalls the stream (TCP)
Jitter buffer / NATbuilt in (ICE/STUN/TURN)you handle ordering/timing yourself
Built-in AEC/NSyes (browser media stack)NS server-side OK (RNNoise on inbound); AEC stays at the edge — can't align the echo loop over a cloud round-trip, so rely on handset/carrier cancellation
SetupSDP offer/answer, heavierone HTTP upgrade, trivial
Best forlast mile to browser/apptelephony bridge, service↔service, server agents
The framing to say (covers WS-first stacks and Twilio either way)
# Twilio Media Streams = a WebSocket your server accepts. Frame shape: { "event":"start", "start":{"streamSid":"...","mediaFormat":{"encoding":"audio/x-mulaw","sampleRate":8000}} } { "event":"media", "media":{"payload":"<base64 μ-law 20ms>","timestamp":"1234"} } # inbound audio { "event":"mark", "mark":{"name":"tts-clause-3"} } # playback checkpoint { "event":"stop" } # you SEND back: media frames (your TTS, base64 μ-law) + a "clear" to flush on barge-in { "event":"media", "streamSid":"...", "media":{"payload":"<base64 μ-law>"} } { "event":"clear", "streamSid":"..." } # barge-in: drop queued audio

3 · Audio front-end — clean the signal before perception

StageComponentAlternativesWhy it must be here
Echo cancellationWebRTC AEC3Speex AEC, hardware AECWithout it the bot hears itself through speaker→mic → false barge-in / self-interrupt. Must run before VAD.
Noise suppressionRNNoise / WebRTC NSDeepFilterNet, KrispPhone audio is noisy; cleans STT input. Cheap CPU.
Voice activity detectionSilero VADWebRTC VAD, kyutai STT (built-in)Per-frame "is there speech." Also creates speaker embeddings reusable for diarization. Fast first gate — but not end-of-turn.
Ordering matters
AEC3 → NS → VAD. Echo-cancel first (so suppression/VAD see clean near-end audio), then denoise, then detect voice. Classic bug: VAD fires on the bot's own echo.

4 · Turn detection & endpointing — the highest-leverage problem

VAD answers "is there voice now." Turn detection answers "is the turn over." Different models. A 300ms gap mid-sentence — "send it to… uh… my checking account" — must NOT fire the bot. This dominates mouth-to-mouth latency.
LayerComponentSignal usedNotes
Acoustic gateSilero VADenergy / voiced framesfast first filter; "silence started"
Semantic end-of-turnPipecat Smart Turnacoustic + linguistic completeness"I want to…" (incomplete→wait) vs "transfer fifty dollars." (complete→fire). Frameworks expose these as turn events.
Threshold policyadaptive timeoutcompleteness + prosodyshorten after a complete clause; lengthen after a filler / rising intonation.

The knob: short silence threshold = snappy but cuts off thinkers; long = polite but laggy. The latency-vs-correctness knob, per-deployment (a bank IVR tolerates more wait than a casual assistant).

Barge-in (interrupt the bot)

  1. Cancel TTS in single-frame latency — flush outbound buffer. (10_audio_pipeline_interruptible.py: one asyncio.Event, race every await against it.)
  2. AEC3 already running so the bot's voice isn't mistaken for the human.
  3. Re-sequence + record truncation — the interruption is a new high-priority event; mark the bot's half-spoken turn as truncated in the accumulated context (see §10) so the model knows it wasn't fully heard.

5 · STT — streaming (realtime) and batch (offline) are different problems

Open ASR leaderboard (self-host candidates)

ModelESB WER EnRTFxMultilingTimestampsVADNote
nvidia/parakeet-tdt-0.6b-v26.053386chosen: ~3× faster than any comparable-WER model
nvidia/canary-1b-flash6.351046✅ En/Fr/Es/Desister model, multilingual
kyutai/stt-2.6b-en6.488Enbuilt-in VAD

The full STT option space

UseOptionHostNotes
Realtime streaming (default)DeepgramAPIlow-latency partials; the realtime workhorse
Realtime, self-hostParakeet / Canary (NeMo)Modal GPU1–2 orders cheaper at scale; NeMo makes model-swap trivial
Realtime w/ VADkyutai stt-2.6bself-hostVAD bundled, simpler pipeline
Realtime diarizedSoniox rt-4APIstrong diarization on the realtime path
Multilingual realtimeGroqAPInatural multilingual, ~100 languages
Batch (fastest/cheapest)Parakeet/Canary on ModalModal batchRTFx 1000s → huge corpora cheap; not for realtime
Batch, high qualityMistral Transcribe 2APIvery good + fast but batch only
General baselineWhisper, voxtral-minieitherWhisper ubiquitous; voxtral newer
The streaming insight
Realtime STT emits partial hypotheses while the user is still talking. At end-of-turn you pay only the finalization tail (~50ms), not the whole utterance. Batch STT processes a complete file — a different shape (§16).

6 · Diarization & speaker ID — who is speaking, when they overlap

Interface
Annotates each STT token with speaker_id + confidence. Feeds the sequencer as another tag dimension — diarization is a label on the timeline, not a separate stream. Context accumulates per-speaker-attributed (§10).

7 · Paralinguistics / emotion — the signal in the audio but NOT the transcript

"This is fine" + clipped tone + ↑pitch = a complaint. The transcript alone reads as approval. The affective signal lives in prosody, not words. For enterprise/banks this is churn-detection signal.

8 · Vision channel — referential / deictic signal, ~200ms

9 · Fusion & co-sequencing — the merge (the genuinely novel algorithm)

Order by EVENT-TIME, not arrival. Audio ≈ 80ms, vision ≈ 200ms processing latency → simultaneous edge events arrive ~120ms apart. Calibrate by subtracting each stream's latency; hold a reorder window ≥ the slowest modality; commit on a watermark.
event-time → T₀ one real moment text arrives instantly audio +80ms → arrives here vision +200ms → arrives here calibrate: arrival − latency → recover event-time reorder window ≥ slowest modality (200ms) watermark → commit enriched token words+prosody+ref bundled @ event-time → §10
teal = latency calibration · amber = the hold window · red = the commit decision

The sequencer algorithm

A k-way merge keyed on event-time with a watermark commit — structurally the same "don't finalize while work is in flight" as the crawler's Queue.join(); here the in-flight thing is a stream that hasn't reported up to time T.

on event e from stream s: e.event_time = e.arrival_time - LATENCY[s] # audio −80ms, vision −200ms, text 0 buffer.push(e) # min-heap on event_time watermark[s] = e.event_time W = min(watermark over all ACTIVE streams) # slowest reporter gates commit while buffer.peek().event_time <= W: emit(buffer.pop()) # commit in event-time order → §10 # idle stream: heartbeat advances its watermark so silence can't freeze the timeline

Early vs late fusion: use late fusion — each specialist emits tags, models are swappable, degrades gracefully if one drops. Early fusion (one model on raw features) is richer but $$$, tightly coupled, and loses the per-channel swap that lets Deepgram or Parakeet sit behind one interface.

10 · Context accumulation & memory — where state actually lives

The question that separates a toy from a system: where does context accumulate? Five layers, and only one is the real accumulator. The rest are transient (drain) or a working copy (derived). Confusing the sequencer buffer for memory is the classic mistake.
1. stream buffers per-specialist TRANSIENT (ms) 2. sequencer window heap, ≤200ms, SLIDING DRAINS — not memory 3. committed timeline append-only fused log durable begins 4. SESSION STORE full fidelity · source of truth transport-agnostic ★ THE ACCUMULATOR 5. LLM KV-cache in-GPU working copy DERIVED (cached) compaction recent verbatim + old summarized compacted view → prompt replay / audit full log, by event-time compliance (banks) Each → is a LOSSY compression boundary you design (§20): raw audio/video → tags (drop signal) · timeline → session (keep tags) · session → prompt (summarize old turns)
#LayerLifetimeHoldsBounded by
1Stream reorder buffersms (transient)un-ordered annotations per specialiststream jitter
2Sequencer window (heap)≤200ms, drainsuncommitted events awaiting watermarkslowest modality latency
3Committed timelinedurable, append-onlyfused enriched tokens, event-time ordersession length
4Session store ★durable, source of truthfull history + tags + speaker + intent/stateretention policy
5LLM KV-cacheper-process, reused via cachemodel's working copy for this turnGPU memory / context window
The architect points (what the doc was missing)

One-liner: context accumulates durably in the session store; the sequencer window is a sliding buffer that drains, not memory; the KV-cache is a cached working copy. Each layer boundary is a lossy compression you design — and the full-fidelity log vs. the compacted prompt are two different views of the same conversation.

11 · The LLM core — the shared brain

ConcernChoiceNotes
ModelClaudeOpenAI, Qwen 2.5 7B (self-host), Groq
First-token latencyprompt cache + speculative prefillcache system prompt + history (hit every turn); prefill on STT partials so TTFT ≈ 0 at endpoint
Memory / capacityKV-cache = session working mema token is the unit of work, not a request. Admit by token budget. (Derived from §10 layer 4.)
Throughputcontinuous batchingadmit/evict per token step; voice + text share the pool at different priorities
Cost headroomKV-cache compression (~7×)more concurrent sessions/GPU + longer accumulated context, same cost (verify numbers)
Multimodal input to the LLM
The LLM consumes the fused, enriched token stream (§9) compacted into a prompt (§10) — words carrying speaker_id, prosody tags, visual refs. "this is fine" arrives annotated as "said angrily while pointing at the fee." The model reasons over signal, not just text.

Cascade vs native multimodal (speech-to-speech / omni) — the big fork

A frontier lab is likely building its own multimodal model, which can collapse STT→LLM→TTS into one speech-to-speech model. The systems problems don't disappear — they relocate. Know both, and where each wins.
Cascade (STT→LLM→TTS)Native speech-to-speech (omni)
Shape3 swappable services (most of this doc)audio in → audio out, one model
Paralinguisticslost at STT boundary (tone → flat text) unless a separate prosody tag carries it (§7)preserved end-to-end — no transcript bottleneck
Latencyserialization per hop (STT final → LLM → TTS TTFB)lower — no STT/TTS round-trips
Debug / swapeach hop inspectable + replaceableblack box — hard to swap parts or trace
Tools / RAG mid-streameasy — inject between hopsharder — must be designed into the model
Turn-taking / barge-inhandled in the orchestrator (this doc)inside the model or alongside it
The architect point

Full-duplex & the death of explicit turn-taking — where the harness goes next

The omni fork above still assumes half-duplex: model and user alternate. Full-duplex models (Moshi, NVIDIA PersonaPlex) listen and speak on the same clock — continuous audio+silence tokens both ways. This is the likely long-term substrate, and it relocates turn-taking one more time: from an orchestrator stage (§4) to an emergent behavior inside the model.

Three generations of where "when to talk" lives:

GenWhere timing livesTrained byFailure mode
v1 · turn-based + VADsilence-timeout outside the model (§4)heuristic thresholdcuts off thinkers / laggy — one knob, no backchannel concept
v2 · full-duplex SFTinside the model, token-by-token (Moshi)cross-entropy per tokenover-silent — silence is cheap under CE, so it under-backchannels
v3 · full-duplex + RLinside the model, as a policysequence-level rewardcurrent frontier — timing becomes a spec
OUTSIDE the model (orchestrator) INSIDE the model v1 VAD + silence-timeout heuristic · one knob model (content only) v2 can overlap, but timing under-weighted full-duplex model timing implicit (token CE) → over-silent v3 sequence-level goal: backchannel · don't talk over full-duplex model timing policy (RL reward) across generations the timing locus migrates inward → and becomes a reward
orange = heuristic / under-weighted timing · green = timing as a first-class learned policy · dashed line = the model boundary
Why RL, not more SFT — the load-bearing point
Turn-taking, barge-in, and backchannel are sequence-level properties, not per-token ones. "Backchannel every so often" or "never talk over the user" can't be expressed in a token-level cross-entropy loss — staying silent instead of saying "mhm" costs almost nothing in CE but is glaring to a human. RL post-training is how you express a sequence-level goal. The mechanism (the §4 endpointer) dissolves into the model; the behavior survives as a reward. (09_turn_taking.py models all three gens, plus a barge-in confirmation window: yield fast on a real interrupt, but wait out one frame so the user's own "mhm" doesn't cut the bot off.)
Soundbite: full-duplex doesn't remove turn-taking — it dissolves the turn-taking module into the model and re-poses it as a reward. Barge-in stops being "stop on voice" and becomes a learned endpointing decision with a latency-vs-false-cutoff tradeoff.
The catch: full-duplex breaks the batching economics (→ §17)
Half-duplex demand is bursty — a session generates ≈⅓ of wall-clock, so continuous batching oversubscribes the idle time. Full-duplex makes demand constant-bitrate: every live session must be ticked every frame, even in silence → the multiplexing is gone → roughly 3× the GPU slots. The fix is not VAD-gating the model (that bolts the §4 endpointer back on and kills the silence-timing behavior) — it's two tiers: a tiny always-on duplex controller decides speak/listen/backchannel every frame and wakes the large generative model only when there's content. Suppress silent audio; never suppress the model's vote on the silence.

12 · TTS — streaming; first-chunk latency is what's felt

UseComponentHostNotes
Realtime streaming (default)CartesiaAPIlow first-byte; realtime workhorse
Self-host streamingOrpheus-3bModal~200ms latency streaming server (ref impl exists)
Self-host, fastest/smallestKokoro-82MModaltiny + streaming output → low TTFB; used in the Modal ragbot ref
On-device / edge (no GPU)SupertonicCPU / browser~99M params, ONNX runtime; RTF ~0.3× on an e-reader, fast on CPU vs larger A100 baselines; studio 44.1kHz out, 31 languages, runs in browser (WebGPU/WASM) → Pi. Edge/privacy play.
Self-host generalXTTSModalvoice cloning, multilingual
Why an on-device TTS matters architecturally
Supertonic (and the edge-TTS class) move synthesis off the server entirely — zero network hop for the audio-out leg, complete privacy, no GPU. For a latency budget that's the last-mile playout latency gone; for banks it's a data-residency win. Tradeoff: you ship a model to the client and lose centralized voice control / instant updates. A hybrid (server TTS default, on-device for privacy-sensitive or offline) is the real answer.
The streaming insight
Start TTS on the first clause while the LLM is still generating clause two. First-audio-out is what the user perceives — never wait for the full reply. Buffer 1–2 chunks before playout to survive jitter. (Producer side of 10_audio_pipeline.py.)

13 · Orchestration — who wires the graph together

FrameworkRoleNotes
Pipecatpipeline + turn eventsframe-based pipeline; hooks for VAD, turn detection, barge-in, interruptions. SmallWebRTCTransport for lightweight WebRTC.
LiveKit Agentstransport + agent runtimeSFU + rooms + agent SDK; plugins for STT/TTS/LLM. Self-host or cloud; runs on Modal.

These own the event loop: media in → VAD/turn → STT → (fuse) → LLM → TTS → media out, plus the interrupt/barge-in control plane. The fusion/sequencer is a stage inside this graph.

Reference implementation — Modal + Pipecat ragbot (open-source, current)

A concrete, runnable build (modal-projects/open-source-av-ragbot + Modal's low-latency-voice-bot blog) that hits a median ~1s voice-to-voice latency. Worth knowing as the "here's what real looks like" anchor.

RoleReference usesWhy / note
Transport (client↔bot)SmallWebRTCConnectionPipecat WebRTC, E2E encrypted, low latency; SDP exchanged via an ephemeral modal.Dict
VAD + turnSilero VAD + SmartTurnPipecat local-smart-turn; emit start/stop frames
STTparakeet-tdt-0.6b-v3"hard to beat on final-transcript time + accuracy"
RAGChromaDB + all-MiniLM-L6-v2retrieval in "a few tens of ms"; loaded at container start
LLMQwen3-4B-Instruct + vLLMself-hosted via vLLM over a Tunnel
TTSKokoro-82Mstreaming output → low TTFB
The Modal architecture moves (the latency wins)
# shape of the reference (paraphrased) @app.cls(image=bot_image, region=SERVICE_REGIONS, enable_memory_snapshot=True, max_inputs=1) # 1 session / container class ModalVoiceAssistant: @modal.enter(snap=True) def load(self): self.chroma_db = ChromaVectorDB() # RAG ready at snapshot @modal.method() async def run_bot(self, d): # d = ephemeral modal.Dict conn = SmallWebRTCConnection(ice_servers) await conn.initialize(sdp=offer["sdp"], type=offer["type"]) await d.put.aio("answer", answer) # SDP answer back to frontend await run_bot(conn, self.chroma_db) # Pipecat pipeline: VAD→STT→RAG→LLM→TTS # frontend POST /offer → ModalVoiceAssistant().run_bot.spawn(d) (one FunctionCall per call)
Note vs. our reference stack
This build self-hosts everything on Modal (Parakeet/Qwen/Kokoro) for cost + control, vs. the API-heavy default (Deepgram/Claude/Cartesia) elsewhere in this doc. Same architecture, different self-host/API dial (§15). It also folds in RAG — the STT transcript queries Chroma before the LLM, adding a "few tens of ms" retrieval hop. The LLM (Qwen3-4B) is a swappable component behind the orchestrator — that's the realistic self-host choice (open, fits one GPU, vLLM-served); swap to a frontier API for quality.

Reference platform — Dograh (open-source voice-AI platform, also Pipecat)

The other build making the rounds (dograh-hq/dograh, YC, ~4.3k stars). Same Pipecat foundation as the Modal ragbot, but it's a platform, not a minimal reference — useful as the "productized" end of the spectrum.

AspectDograhContrast w/ Modal ragbot
Shapeplatform (drag-drop workflow builder)ragbot = minimal hand-wired reference
OrchestrationPipecat (git submodule)same foundation
ModelsBYO LLM/STT/TTS (or bundled)both treat models as swappable components
TransportWebRTC + telephonyTwilio, Vonage, Telnyx, Cloudonix — real PSTN ingress
DeployDocker-first, self-host or cloudragbot = Modal serverless GPU
ExtrasQA node, test-mode web calls, human transferproductization: evals + agent handoff built in
License / stackBSD-2, Python+TypeScriptno vendor lock-in
The takeaway from two refs
Both serious open voice stacks land on Pipecat as the orchestration layer with models as swappable components behind it — confirming the doc's architecture. They differ on the dial that matters at Twilio: ragbot optimizes raw latency on serverless GPU; Dograh optimizes productization (telephony providers, workflow builder, QA/evals, human transfer). A real deployment is somewhere between: Pipecat core + your model dial + the telephony + ops layer you need.

14 · Latency budget & waterfall — the number you defend

Mouth-to-mouth target <800ms (enterprise), great ~500ms. It's a pipelined waterfall, not a sum — and the dominant cost is endpointing, not the models.
user stops speaking → endpoint user speech STT partials (live) final ~50ms endpointing wait 300ms LLM prefill (speculative) on partials → TTFT ~0 at endpoint LLM decode TTS (first clause) TTFB 120ms audio out + playout user hears bot perceived ≈ 550ms
StageNaiveTunedHow it's tuned
Network in (jitter buf)40ms20msWebRTC tunable; PSTN fixed
Endpointing700ms300msturn model + adaptive threshold — biggest win
STT finalize300ms~50mspartials stream live; pay only the tail
LLM TTFT600ms~50msprompt cache + speculative prefill on partials
LLM→first clause400ms~0overlaps TTS; don't await full reply
TTS TTFB300ms120msstreaming TTS, first chunk only
Net out + playout60ms30msμ-law re-encode on PSTN
MOUTH-TO-MOUTH~1.4s~550mspipelined, not summed

15 · Infra, GPUs & economics

vs proprietary APISpeedCost
Parakeet, fastest112×60×
Parakeet, cheapest25×200×
Canary, fastest80×55×
Canary, cheapest12×152×
Proprietary API

Orange = API (buy quality/latency, pay per-use). Teal = self-host (1–2 orders cheaper, you run infra). Typical split: self-host STT (volume), API for TTS + frontier LLM (quality/latency).

16 · Batch transcription path — the OTHER system (offline, throughput)

Call recordings / analytics: a batch job optimizing throughput over a corpus you own — inverse of the realtime path. Don't conflate them.

17 · Scaling & the scheduler

Full-duplex serving — when the session never goes idle (the v3 cost, §11)

Full-duplex converts multiplexable bursty demand into non-multiplexable steady demand — the same reason constant-bitrate traffic is harder to pack than bursty. The cost shows up as ≈3× the fleet, and it's a batching problem, not a FLOPs problem.
Why batching chokes on full-duplexMechanism
Lost multiplexingcan't release the slot during silence → effective cost ≈ 1/duty-cycle (~3× at a ⅓ talk ratio)
Frame deadlineeach tick must finish within ~80ms — can't wait to fill a batch, so you run undersized batches on schedule
Pure decodeone token/frame autoregressive → memory-bandwidth bound, low arithmetic intensity (worst GPU regime; no prefill density to amortize)
KV-cache pinnedcache stays resident the whole conversation — can't evict/swap during silence as you might between turns → memory-capacity bound on concurrency
Half-duplex — bursty → multiplexable sess A sess B sess C shaded = talking (active) · white = silent (no slot needed) 1 GPU slot packs all 3 — idle time reused Full-duplex — constant-bitrate → NOT multiplexable sess A sess B sess C every frame ticked, talking or silent — the slot never frees slot 1 slot 2 slot 3 ≈3× the slots
same 3 sessions, same conversation — only the serving model differs. ⅓ duty cycle → 1 slot vs 3.
The two-tier escape — duplex behavior and good economics
frame clock · every ~80ms user audio frames in duplex controller tiny · ticks EVERY frame always resident · batches uniformly speak / listen / backchannel / yield content? generative model large · woken ON DEMAND · bursty → poolable speak emit silence / "mhm" no big-model call listen / bc content audio ack audio tiny FLOPs to DECIDE every frame · big FLOPs only to SPEAK → duplex behavior at multiplexable cost
green = always-on cheap tier · purple = on-demand expensive tier · amber = the silent-frame path that never wakes the big model
When to just eat the cost (gate the model on VAD anyway)
A product call, not a correctness one. Reactive products (command, Q&A, IVR) get ≈zero value from the model acting during silence → gate aggressively; the behavioral loss is noise and the 3× saving is real. Relational products (companion, tutor, interviewer) — the silence behavior is the product → don't gate. Cheap hybrid: gate on VAD but a single wake-timer re-triggers after N seconds of mutual silence (one scalar, not a full endpointing module).

18 · Escalation tiers — one session, upgraded

TierTransport + stackLatencyCost
TextWS/HTTP → LLMsecondscheap, poolable
VoiceWebRTC/PSTN → AEC3·VAD·turn·STT·LLM·TTSsub-secondGPU slot pinned
Video+ vision specialist (200ms)sub-second + visionGPU slot ++

Escalation = bind an existing session to a richer tier without losing state — the session store (§10) is transport-agnostic (decide day one). Triggers: explicit ask · detected frustration (measurable from prosody/visual tags) · complexity · verification · human handoff. Downgrade symmetric — the accumulated context carries across.

19 · Failure modes & tail latency — p99 is what users feel

FailureEffectMitigation
LLM TTFT spikedead airbackchannel ("let me check…") to mask; timeout→fallback model
STT wrong on μ-lawmisheard intentupsample+denoise; confidence gate → reprompt
TTS stallcut-off speechbuffer 1–2 chunks; cache common phrases
Specialist down (vision/prosody)lost signaldegrade gracefully to remaining channels (late fusion enables this)
Watermark starvationfrozen timelineidle-stream heartbeat advances watermark
Session store unavailableamnesia / lost turnwrite-ahead the committed timeline; rebuild KV-cache from store on reconnect
Network jitterchoppy audiojitter buffer (adds latency — the tradeoff); regional POP
Dead air is the worst failure
>~700ms of silence reads as "the line dropped." Latency you can't remove, you mask — a backchannel buys LLM time and feels natural. Senior instinct: design the masking, not just the speed.

20 · Tag schema & interface contracts — the data that flows between stages

Late fusion only works if every specialist emits a common annotation envelope on the shared timeline. The envelope is the contract.

Annotation = { # what every specialist emits onto the timeline event_time: float, # edge time, AFTER latency calibration (the join key) arrival_time:float, # when WE got it (for watermark + debugging) source: "stt"|"prosody"|"vision"|"text"|"diar", speaker_id: str|None, # from diarization; None for non-attributable payload: {...}, # source-specific, below confidence: float, final: bool # partial (revisable) vs final hypothesis } payload by source: stt → { text, word_timings[] } # the words prosody → { anger, joy, stress, certainty, pitch } # NOT in transcript vision → { ref, gesture, facial_affect } # referents/expression text → { text } # typed; event_time == arrival diar → { speaker_id, embedding } # attribution EnrichedToken = { # what the sequencer emits after bundling co-temporal annotations event_time, speaker_id, text, prosody:{...}, visual:{...}, # merged from all sources within ε of event_time truncated: bool # set on barge-in (§4) so the model knows }
Stage boundaryInOut
transport → front-endRTP frames (Opus/μ-law)uniform PCM, 20ms
front-end → specialistsclean PCM + VAD flags(audio) to STT/prosody/diar
specialists → sequencerraw mediaAnnotation (envelope above)
sequencer → fusionAnnotationsevent-time-ordered, watermark-committed
fusion → LLMAnnotations bundleEnrichedToken → compacted prompt (§10)
LLM → TTStoken streamtext clauses (stream on first clause)

21 · Build order + interview ammo

  1. Text chat → session store + stateless handlers + Claude. Baseline.
  2. Persistent transport-agnostic session (§10) — the load-bearing decision.
  3. Voice loop (Pipecat/LiveKit): VAD → streaming STT (Deepgram) → LLM → streaming TTS (Cartesia). Measure mouth-to-mouth first.
  4. Real endpointing model (Pipecat smart-turn) — kill the naive silence timer. Biggest win.
  5. Barge-in + AEC3 (+ truncation recorded in context).
  6. Speculative prefill on partials + prompt cache — shave TTFT.
  7. Sequencer (event-time + watermark) — the moment a 2nd stream appears.
  8. Fusion + specialists — diarization (Soniox), prosody, then vision (200ms widens the window; nothing else changes — payoff of step 7).
  9. Context compaction (§10) — recent verbatim + old summarized; keeps prompt bounded.
  10. Backchannel/filler — hide the p99 tail.
  11. Self-host STT on Modal + scheduler + KV-cache compression — cost at scale.
  12. Batch transcription path — offline analytics, separate system.
Say out loud
This stackReuses pattern from
Sequencer watermark commit09_web_crawler Queue.join()
Watermark = min over streams; straggler15_training_pipeline all-reduce barrier
Commit on watermark OR deadline14_inference_server size-OR-timer
Barge-in cancel10_audio_pipeline_interruptible
Full-duplex gens (v1/v2/v3) + barge-in confirm window09_turn_taking
Jitter buffer — reorder / loss-conceal / late-drop19_jitter_buffer
Streaming TTS producer10_audio_pipeline
KV=session, token=unit of work, continuous batching14_inference_server
Tiers = work→capacity; admit by budgetREADME resource-scheduler thesis
Event-time k-way mergealgorithms_cheatsheet heap merge
Batch ASR = throughput over owned corpus07_gpu_data_loader / inference-vs-training

References

Verify-before-quoting: KVarN ~7× numbers, exact RTFx/WER figures, and cost multipliers are as-captured from source material — confirm before citing live. The Modal ragbot details (~1s v2v, component list) are from the Modal blog + repo as of capture.