Multimodal Conversational AI — The Complete Modern Stack

Master reference. Real-time text/voice/video conversation with an LLM (e.g. Claude). Every layer named with swappable alternatives, interfaces, latency, and self-host/API economics. Ingress covers browser WebRTC and PSTN/Twilio. Companions: voice_agents_notes.md, multimodal_fusion_design.md.

0 · Thesis & the three hard problems

Everything — typed text, speech, video — becomes tokens before the model. But the modalities are not redundant transcripts to dedup; they are complementary signals to fuse on a shared event-time timeline. The model calls are the easy 20%. The hard parts: (1) knowing when the human finished (endpointing), (2) aligning streams with different latencies (audio ~80ms vs vision ~200ms), and (3) masking the latency you can't remove.

Hard problem 1 — Turn-taking. "Did they stop, or just pause?" VAD ≠ end-of-turn. Dominates perceived latency.
Hard problem 2 — Co-sequencing. Streams arrive out of order because each modality has its own processing latency. Order by event-time, not arrival, or the model sees the conversation backwards.
Hard problem 3 — Tail latency / dead air. p99, not mean, is what the user feels. >700ms silence reads as a dropped connection.

Two structural reframes: a conversation is one persistent, transport-agnostic session served at different tiers (text = cheap/async, voice = GPU-pinned/realtime); and the whole system is a resource scheduler matching sessions to capacity (KV-cache bytes).

1 · The whole stack at a glance

solid = data flow · dashed amber = control (turn-end triggers the LLM — it does not reorder the sequencer) · dashed red = barge-in cancel, flushing the entire output chain at once (LLM decode · TTS · egress queue · client playout buffer) and marking the turn truncated. In full-duplex this red path vanishes — yielding is just the model's next-frame decision.

2 · Transport & ingress — get media in/out, low jitter

Concern	Primary	Alternatives	Why / notes
Realtime media + agent framework	LiveKit	Pipecat transport, Daily, raw WebRTC	SFU + agent SDK; runs the room, handles tracks. Runs on Modal.
Lightweight WebRTC	SmallWebRTCTransport	aiortc, full SFU	Pipecat's minimal P2P transport — single-agent, no SFU overhead.
Browser/app audio	WebRTC (Opus 48kHz)	—	Wideband, tunable jitter buffer, gives AEC hooks. The good path.
Phone	Twilio Media Streams	Twilio ConversationRelay, SIP trunk, Vonage	PSTN = 8kHz μ-law (G.711), carrier-fixed, narrowband — caps STT quality before any model choice. Media Streams forks 20ms frames to your WS; ConversationRelay is managed STT+TTS, you bring the LLM.
Regional placement	media POP near caller	—	Transatlantic RTT alone ~150ms. Co-locate agent + media server.

Interface

In: RTP/media frames (Opus or μ-law), 20ms each. Out: raw PCM frames to the front-end + a control channel for events (track start/stop, DTMF, mute). Pipecat/LiveKit normalize codecs so downstream sees uniform PCM.

WebSocket vs WebRTC — when each, and why WS is first-class

WebSocket is not a downgrade — it's the actual transport in three places that matter: (1) Twilio Media Streams forks call audio to your server as μ-law frames over a WebSocket, not WebRTC; (2) service-to-service (the Modal ragbot's Tunnels were WS); (3) any server-to-server agent with no browser. WebRTC is for the last mile to a browser/app; WebSocket is for everything behind it.

Axis	WebRTC	WebSocket
Layer	UDP + SRTP (media-grade)	TCP + TLS (one framed duplex stream)
Loss handling	tolerates loss (audio degrades gracefully)	head-of-line blocking — a lost packet stalls the stream (TCP)
Jitter buffer / NAT	built in (ICE/STUN/TURN)	you handle ordering/timing yourself
Built-in AEC/NS	yes (browser media stack)	NS server-side OK (RNNoise on inbound); AEC stays at the edge — can't align the echo loop over a cloud round-trip, so rely on handset/carrier cancellation
Setup	SDP offer/answer, heavier	one HTTP upgrade, trivial
Best for	last mile to browser/app	telephony bridge, service↔service, server agents

The framing to say (covers WS-first stacks and Twilio either way)

Twilio path is WebSocket end-to-end on your side. Carrier → Twilio → your WS (Media Streams) → agent. You receive base64 μ-law 8kHz in JSON frames (media, start, stop, mark events), and you send audio back as media frames. No WebRTC in your code at all.
WS gives you a single ordered duplex pipe — simplest possible transport for a server agent: one connection carries inbound audio frames and your outbound TTS audio and control events (interrupt, mark, clear). That's all the realtime loop needs.
The tax you accept with WS: TCP head-of-line blocking. On a lossy network a dropped packet stalls everything behind it (WebRTC would just drop+conceal). Mitigate with small frames, a tight network path (regional pinning), and treating the WS as reliable-but-bursty. For server↔service and Twilio (already past the carrier) this is fine; for last-mile-to-a-phone-on-bad-wifi, WebRTC wins.
Barge-in over WS = a control message, not a media signal. You send a clear/interrupt frame to flush buffered outbound audio (Twilio: the clear message; your own protocol: an event). Same single-Event cancel as 10_audio_pipeline_interruptible, just delivered as a WS control frame.

# Twilio Media Streams = a WebSocket your server accepts. Frame shape: { "event":"start", "start":{"streamSid":"...","mediaFormat":{"encoding":"audio/x-mulaw","sampleRate":8000}} } { "event":"media", "media":{"payload":"<base64 μ-law 20ms>","timestamp":"1234"} } # inbound audio { "event":"mark", "mark":{"name":"tts-clause-3"} } # playback checkpoint { "event":"stop" } # you SEND back: media frames (your TTS, base64 μ-law) + a "clear" to flush on barge-in { "event":"media", "streamSid":"...", "media":{"payload":"<base64 μ-law>"} } { "event":"clear", "streamSid":"..." } # barge-in: drop queued audio

3 · Audio front-end — clean the signal before perception

Stage	Component	Alternatives	Why it must be here
Echo cancellation	WebRTC AEC3	Speex AEC, hardware AEC	Without it the bot hears itself through speaker→mic → false barge-in / self-interrupt. Must run before VAD.
Noise suppression	RNNoise / WebRTC NS	DeepFilterNet, Krisp	Phone audio is noisy; cleans STT input. Cheap CPU.
Voice activity detection	Silero VAD	WebRTC VAD, kyutai STT (built-in)	Per-frame "is there speech." Also creates speaker embeddings reusable for diarization. Fast first gate — but not end-of-turn.

Ordering matters

AEC3 → NS → VAD. Echo-cancel first (so suppression/VAD see clean near-end audio), then denoise, then detect voice. Classic bug: VAD fires on the bot's own echo.

4 · Turn detection & endpointing — the highest-leverage problem

VAD answers "is there voice now." Turn detection answers "is the turn over." Different models. A 300ms gap mid-sentence — "send it to… uh… my checking account" — must NOT fire the bot. This dominates mouth-to-mouth latency.

Layer	Component	Signal used	Notes
Acoustic gate	Silero VAD	energy / voiced frames	fast first filter; "silence started"
Semantic end-of-turn	Pipecat Smart Turn	acoustic + linguistic completeness	"I want to…" (incomplete→wait) vs "transfer fifty dollars." (complete→fire). Frameworks expose these as turn events.
Threshold policy	adaptive timeout	completeness + prosody	shorten after a complete clause; lengthen after a filler / rising intonation.

The knob: short silence threshold = snappy but cuts off thinkers; long = polite but laggy. The latency-vs-correctness knob, per-deployment (a bank IVR tolerates more wait than a casual assistant).

Barge-in (interrupt the bot)

Cancel TTS in single-frame latency — flush outbound buffer. (10_audio_pipeline_interruptible.py: one asyncio.Event, race every await against it.)
AEC3 already running so the bot's voice isn't mistaken for the human.
Re-sequence + record truncation — the interruption is a new high-priority event; mark the bot's half-spoken turn as truncated in the accumulated context (see §10) so the model knows it wasn't fully heard.

5 · STT — streaming (realtime) and batch (offline) are different problems

Open ASR leaderboard (self-host candidates)

Model	ESB WER En	RTFx	Multiling	Timestamps	VAD	Note
nvidia/parakeet-tdt-0.6b-v2	6.05	3386	✅	✅	❌	chosen: ~3× faster than any comparable-WER model
nvidia/canary-1b-flash	6.35	1046	✅ En/Fr/Es/De	✅	❌	sister model, multilingual
kyutai/stt-2.6b-en	6.4	88	En	✅	✅	built-in VAD

The full STT option space

Use	Option	Host	Notes
Realtime streaming (default)	Deepgram	API	low-latency partials; the realtime workhorse
Realtime, self-host	Parakeet / Canary (NeMo)	Modal GPU	1–2 orders cheaper at scale; NeMo makes model-swap trivial
Realtime w/ VAD	kyutai stt-2.6b	self-host	VAD bundled, simpler pipeline
Realtime diarized	Soniox rt-4	API	strong diarization on the realtime path
Multilingual realtime	Groq	API	natural multilingual, ~100 languages
Batch (fastest/cheapest)	Parakeet/Canary on Modal	Modal batch	RTFx 1000s → huge corpora cheap; not for realtime
Batch, high quality	Mistral Transcribe 2	API	very good + fast but batch only
General baseline	Whisper, voxtral-mini	either	Whisper ubiquitous; voxtral newer

The streaming insight

Realtime STT emits partial hypotheses while the user is still talking. At end-of-turn you pay only the finalization tail (~50ms), not the whole utterance. Batch STT processes a complete file — a different shape (§16).

6 · Diarization & speaker ID — who is speaking, when they overlap

Soniox rt-4 — realtime diarization, primary for overlapping speakers.
Silero embeddings — speaker vectors from the VAD stage; cluster to attribute turns.
Overlap is the hard case — two people at once isn't "pick one"; needs separation + per-speaker attribution.

Interface

Annotates each STT token with speaker_id + confidence. Feeds the sequencer as another tag dimension — diarization is a label on the timeline, not a separate stream. Context accumulates per-speaker-attributed (§10).

7 · Paralinguistics / emotion — the signal in the audio but NOT the transcript

"This is fine" + clipped tone + ↑pitch = a complaint. The transcript alone reads as approval. The affective signal lives in prosody, not words. For enterprise/banks this is churn-detection signal.

Prosody/emotion model on the audio stream → tags like {anger:0.8, certainty:low, stress:high}.
Continuous (per-frame) vs discrete words → a windowed join on event-time binds "this anger spike → that word" (§9).
Design the tag schema around channel-unique signal (§20): tone from audio, referents/expression from video, words from text.

8 · Vision channel — referential / deictic signal, ~200ms

Carries what can't be said: what they point at / show, facial expression. "Send it to them" + [points] — the referent only exists in video.
Sparse events (per-gesture), ~200ms latency — the slowest modality, which sets the fusion window (§9).
Emits tags: {ref:"fee_line_item", affect:"frustrated", gesture:"point"}.

9 · Fusion & co-sequencing — the merge (the genuinely novel algorithm)

Order by EVENT-TIME, not arrival. Audio ≈ 80ms, vision ≈ 200ms processing latency → simultaneous edge events arrive ~120ms apart. Calibrate by subtracting each stream's latency; hold a reorder window ≥ the slowest modality; commit on a watermark.

teal = latency calibration · amber = the hold window · red = the commit decision

The sequencer algorithm

A k-way merge keyed on event-time with a watermark commit — structurally the same "don't finalize while work is in flight" as the crawler's Queue.join(); here the in-flight thing is a stream that hasn't reported up to time T.

on event e from stream s: e.event_time = e.arrival_time - LATENCY[s] # audio −80ms, vision −200ms, text 0 buffer.push(e) # min-heap on event_time watermark[s] = e.event_time W = min(watermark over all ACTIVE streams) # slowest reporter gates commit while buffer.peek().event_time <= W: emit(buffer.pop()) # commit in event-time order → §10 # idle stream: heartbeat advances its watermark so silence can't freeze the timeline

Early vs late fusion: use late fusion — each specialist emits tags, models are swappable, degrades gracefully if one drops. Early fusion (one model on raw features) is richer but $$$, tightly coupled, and loses the per-channel swap that lets Deepgram or Parakeet sit behind one interface.

10 · Context accumulation & memory — where state actually lives

The question that separates a toy from a system: where does context accumulate? Five layers, and only one is the real accumulator. The rest are transient (drain) or a working copy (derived). Confusing the sequencer buffer for memory is the classic mistake.

#	Layer	Lifetime	Holds	Bounded by
1	Stream reorder buffers	ms (transient)	un-ordered annotations per specialist	stream jitter
2	Sequencer window (heap)	≤200ms, drains	uncommitted events awaiting watermark	slowest modality latency
3	Committed timeline	durable, append-only	fused enriched tokens, event-time order	session length
4	Session store ★	durable, source of truth	full history + tags + speaker + intent/state	retention policy
5	LLM KV-cache	per-process, reused via cache	model's working copy for this turn	GPU memory / context window

The architect points (what the doc was missing)

Two representations, not one. Session store = full fidelity (audit/replay/compliance — non-negotiable for banks). LLM context = a compacted view (recent turns verbatim, older turns summarized). You can't grow the prompt forever.
Every arrow is a lossy compression you design. specialist→tags (drop raw media) · timeline→session (keep tags) · session→prompt (summarize). The thing that survives each boundary is a design decision — same theme as fusion's "what survives compression."
The KV-cache is the accumulation that feels real but isn't durable. It's derived from session history; prompt-caching reuses it across turns so you don't re-prefill. Lose the GPU, rebuild it from the session store. (token = unit of work; KV-cache compression ~7× = longer accumulated context per GPU — §11/§17.)
Truncation must be recorded. On barge-in the bot's turn was only partially heard — the accumulated context marks it truncated, or the model assumes full delivery and the conversation desyncs from reality.
Store by event-time, per-speaker. Replay/audit must reconstruct what actually happened, in order, attributed — not a flattened arrival-order mush.

One-liner: context accumulates durably in the session store; the sequencer window is a sliding buffer that drains, not memory; the KV-cache is a cached working copy. Each layer boundary is a lossy compression you design — and the full-fidelity log vs. the compacted prompt are two different views of the same conversation.

11 · The LLM core — the shared brain

Concern	Choice	Notes
Model	Claude	OpenAI, Qwen 2.5 7B (self-host), Groq
First-token latency	prompt cache + speculative prefill	cache system prompt + history (hit every turn); prefill on STT partials so TTFT ≈ 0 at endpoint
Memory / capacity	KV-cache = session working mem	a token is the unit of work, not a request. Admit by token budget. (Derived from §10 layer 4.)
Throughput	continuous batching	admit/evict per token step; voice + text share the pool at different priorities
Cost headroom	KV-cache compression (~7×)	more concurrent sessions/GPU + longer accumulated context, same cost (verify numbers)

Multimodal input to the LLM

The LLM consumes the fused, enriched token stream (§9) compacted into a prompt (§10) — words carrying speaker_id, prosody tags, visual refs. "this is fine" arrives annotated as "said angrily while pointing at the fee." The model reasons over signal, not just text.

Cascade vs native multimodal (speech-to-speech / omni) — the big fork

A frontier lab is likely building its own multimodal model, which can collapse STT→LLM→TTS into one speech-to-speech model. The systems problems don't disappear — they relocate. Know both, and where each wins.

	Cascade (STT→LLM→TTS)	Native speech-to-speech (omni)
Shape	3 swappable services (most of this doc)	audio in → audio out, one model
Paralinguistics	lost at STT boundary (tone → flat text) unless a separate prosody tag carries it (§7)	preserved end-to-end — no transcript bottleneck
Latency	serialization per hop (STT final → LLM → TTS TTFB)	lower — no STT/TTS round-trips
Debug / swap	each hop inspectable + replaceable	black box — hard to swap parts or trace
Tools / RAG mid-stream	easy — inject between hops	harder — must be designed into the model
Turn-taking / barge-in	handled in the orchestrator (this doc)	inside the model or alongside it

The architect point

The front of the pipeline survives either way. Multi-stream fusion + co-sequencing (§9) — audio + video + text arriving at different latencies, aligned on event-time — is unchanged. An omni model fuses perception + generation; it does not solve input-stream alignment.
The harness survives too. Turn-taking, barge-in, session/context accumulation (§10), scheduling, transport (§2) are all still yours — they become the harness around the model, not parts of a cascade.
So the likely trajectory: cascade today → native multimodal for the core, with the cascade's modular concerns kept as the surrounding system. The systems problems relocate; they don't vanish. That's the line — it shows you see past "just call the model."

Full-duplex & the death of explicit turn-taking — where the harness goes next

The omni fork above still assumes half-duplex: model and user alternate. Full-duplex models (Moshi, NVIDIA PersonaPlex) listen and speak on the same clock — continuous audio+silence tokens both ways. This is the likely long-term substrate, and it relocates turn-taking one more time: from an orchestrator stage (§4) to an emergent behavior inside the model.

Three generations of where "when to talk" lives:

Gen	Where timing lives	Trained by	Failure mode
v1 · turn-based + VAD	silence-timeout outside the model (§4)	heuristic threshold	cuts off thinkers / laggy — one knob, no backchannel concept
v2 · full-duplex SFT	inside the model, token-by-token (Moshi)	cross-entropy per token	over-silent — silence is cheap under CE, so it under-backchannels
v3 · full-duplex + RL	inside the model, as a policy	sequence-level reward	current frontier — timing becomes a spec

orange = heuristic / under-weighted timing · green = timing as a first-class learned policy · dashed line = the model boundary

Why RL, not more SFT — the load-bearing point

Turn-taking, barge-in, and backchannel are sequence-level properties, not per-token ones. "Backchannel every so often" or "never talk over the user" can't be expressed in a token-level cross-entropy loss — staying silent instead of saying "mhm" costs almost nothing in CE but is glaring to a human. RL post-training is how you express a sequence-level goal. The mechanism (the §4 endpointer) dissolves into the model; the behavior survives as a reward. (09_turn_taking.py models all three gens, plus a barge-in confirmation window: yield fast on a real interrupt, but wait out one frame so the user's own "mhm" doesn't cut the bot off.)

Soundbite: full-duplex doesn't remove turn-taking — it dissolves the turn-taking module into the model and re-poses it as a reward. Barge-in stops being "stop on voice" and becomes a learned endpointing decision with a latency-vs-false-cutoff tradeoff.

The catch: full-duplex breaks the batching economics (→ §17)

Half-duplex demand is bursty — a session generates ≈⅓ of wall-clock, so continuous batching oversubscribes the idle time. Full-duplex makes demand constant-bitrate: every live session must be ticked every frame, even in silence → the multiplexing is gone → roughly 3× the GPU slots. The fix is not VAD-gating the model (that bolts the §4 endpointer back on and kills the silence-timing behavior) — it's two tiers: a tiny always-on duplex controller decides speak/listen/backchannel every frame and wakes the large generative model only when there's content. Suppress silent audio; never suppress the model's vote on the silence.

12 · TTS — streaming; first-chunk latency is what's felt

Use	Component	Host	Notes
Realtime streaming (default)	Cartesia	API	low first-byte; realtime workhorse
Self-host streaming	Orpheus-3b	Modal	~200ms latency streaming server (ref impl exists)
Self-host, fastest/smallest	Kokoro-82M	Modal	tiny + streaming output → low TTFB; used in the Modal ragbot ref
On-device / edge (no GPU)	Supertonic	CPU / browser	~99M params, ONNX runtime; RTF ~0.3× on an e-reader, fast on CPU vs larger A100 baselines; studio 44.1kHz out, 31 languages, runs in browser (WebGPU/WASM) → Pi. Edge/privacy play.
Self-host general	XTTS	Modal	voice cloning, multilingual

Why an on-device TTS matters architecturally

Supertonic (and the edge-TTS class) move synthesis off the server entirely — zero network hop for the audio-out leg, complete privacy, no GPU. For a latency budget that's the last-mile playout latency gone; for banks it's a data-residency win. Tradeoff: you ship a model to the client and lose centralized voice control / instant updates. A hybrid (server TTS default, on-device for privacy-sensitive or offline) is the real answer.

The streaming insight

Start TTS on the first clause while the LLM is still generating clause two. First-audio-out is what the user perceives — never wait for the full reply. Buffer 1–2 chunks before playout to survive jitter. (Producer side of 10_audio_pipeline.py.)

13 · Orchestration — who wires the graph together

Framework	Role	Notes
Pipecat	pipeline + turn events	frame-based pipeline; hooks for VAD, turn detection, barge-in, interruptions. `SmallWebRTCTransport` for lightweight WebRTC.
LiveKit Agents	transport + agent runtime	SFU + rooms + agent SDK; plugins for STT/TTS/LLM. Self-host or cloud; runs on Modal.

These own the event loop: media in → VAD/turn → STT → (fuse) → LLM → TTS → media out, plus the interrupt/barge-in control plane. The fusion/sequencer is a stage inside this graph.

Reference implementation — Modal + Pipecat ragbot (open-source, current)

A concrete, runnable build (modal-projects/open-source-av-ragbot + Modal's low-latency-voice-bot blog) that hits a median ~1s voice-to-voice latency. Worth knowing as the "here's what real looks like" anchor.

Role	Reference uses	Why / note
Transport (client↔bot)	SmallWebRTCConnection	Pipecat WebRTC, E2E encrypted, low latency; SDP exchanged via an ephemeral `modal.Dict`
VAD + turn	Silero VAD + SmartTurn	Pipecat `local-smart-turn`; emit start/stop frames
STT	parakeet-tdt-0.6b-v3	"hard to beat on final-transcript time + accuracy"
RAG	ChromaDB + all-MiniLM-L6-v2	retrieval in "a few tens of ms"; loaded at container start
LLM	Qwen3-4B-Instruct + vLLM	self-hosted via vLLM over a Tunnel
TTS	Kokoro-82M	streaming output → low TTFB

The Modal architecture moves (the latency wins)

CPU bot, GPU services, split apart. The Pipecat orchestrator runs CPU-only (it's just wiring frames); STT/LLM/TTS are independent Modal services that autoscale on GPUs separately. Services hold no bot-specific logic → swappable + independently scalable.
Modal Tunnels bypass the input plane. The bot talks directly to services over a Tunnel (WebSocket for STT/TTS, HTTP for vLLM) — persistent, bidirectional, skips the routing hop. This is the main latency lever beyond model choice.
Regional pinning is critical. ~1s only holds when bot + services are co-located (one metro: us-east Virginia or us-west Bay Area). Cross-region kills it.
Memory snapshots for cold start. @modal.enter(snap=True) + enable_memory_snapshot; warm the snapshot with ~50 ping calls so a new container is ready fast. max_inputs=1 = one session per container.
Session lifecycle = a spawned FunctionCall. run_bot.spawn() per conversation, cancel at end; modal.Dict for sessions. Spawn/cancel is how you reclaim autoscaling while holding a Tunnel open.

# shape of the reference (paraphrased) @app.cls(image=bot_image, region=SERVICE_REGIONS, enable_memory_snapshot=True, max_inputs=1) # 1 session / container class ModalVoiceAssistant: @modal.enter(snap=True) def load(self): self.chroma_db = ChromaVectorDB() # RAG ready at snapshot @modal.method() async def run_bot(self, d): # d = ephemeral modal.Dict conn = SmallWebRTCConnection(ice_servers) await conn.initialize(sdp=offer["sdp"], type=offer["type"]) await d.put.aio("answer", answer) # SDP answer back to frontend await run_bot(conn, self.chroma_db) # Pipecat pipeline: VAD→STT→RAG→LLM→TTS # frontend POST /offer → ModalVoiceAssistant().run_bot.spawn(d) (one FunctionCall per call)

Note vs. our reference stack

This build self-hosts everything on Modal (Parakeet/Qwen/Kokoro) for cost + control, vs. the API-heavy default (Deepgram/Claude/Cartesia) elsewhere in this doc. Same architecture, different self-host/API dial (§15). It also folds in RAG — the STT transcript queries Chroma before the LLM, adding a "few tens of ms" retrieval hop. The LLM (Qwen3-4B) is a swappable component behind the orchestrator — that's the realistic self-host choice (open, fits one GPU, vLLM-served); swap to a frontier API for quality.

Reference platform — Dograh (open-source voice-AI platform, also Pipecat)

The other build making the rounds (dograh-hq/dograh, YC, ~4.3k stars). Same Pipecat foundation as the Modal ragbot, but it's a platform, not a minimal reference — useful as the "productized" end of the spectrum.

Aspect	Dograh	Contrast w/ Modal ragbot
Shape	platform (drag-drop workflow builder)	ragbot = minimal hand-wired reference
Orchestration	Pipecat (git submodule)	same foundation
Models	BYO LLM/STT/TTS (or bundled)	both treat models as swappable components
Transport	WebRTC + telephony	Twilio, Vonage, Telnyx, Cloudonix — real PSTN ingress
Deploy	Docker-first, self-host or cloud	ragbot = Modal serverless GPU
Extras	QA node, test-mode web calls, human transfer	productization: evals + agent handoff built in
License / stack	BSD-2, Python+TypeScript	no vendor lock-in

The takeaway from two refs

Both serious open voice stacks land on Pipecat as the orchestration layer with models as swappable components behind it — confirming the doc's architecture. They differ on the dial that matters at Twilio: ragbot optimizes raw latency on serverless GPU; Dograh optimizes productization (telephony providers, workflow builder, QA/evals, human transfer). A real deployment is somewhere between: Pipecat core + your model dial + the telephony + ops layer you need.

14 · Latency budget & waterfall — the number you defend

Mouth-to-mouth target <800ms (enterprise), great ~500ms. It's a pipelined waterfall, not a sum — and the dominant cost is endpointing, not the models.

Stage	Naive	Tuned	How it's tuned
Network in (jitter buf)	40ms	20ms	WebRTC tunable; PSTN fixed
Endpointing	700ms	300ms	turn model + adaptive threshold — biggest win
STT finalize	300ms	~50ms	partials stream live; pay only the tail
LLM TTFT	600ms	~50ms	prompt cache + speculative prefill on partials
LLM→first clause	400ms	~0	overlaps TTS; don't await full reply
TTS TTFB	300ms	120ms	streaming TTS, first chunk only
Net out + playout	60ms	30ms	μ-law re-encode on PSTN
MOUTH-TO-MOUTH	~1.4s	~550ms	pipelined, not summed

15 · Infra, GPUs & economics

Modal — serverless GPU; provision per job/session. A100/L40S ≈ 4¢/GPU-min. Runs LiveKit, NeMo STT, TTS endpoints. Endpoints seen: Whisper ASR, Nemotron Embed, Qwen 2.5 7B, XTTS.
NeMo — NVIDIA framework; HF Open-ASR code → distributed batch service is a small lift; trivial swap between Parakeet/Canary.

vs proprietary API	Speed	Cost
Parakeet, fastest	112×	60×
Parakeet, cheapest	25×	200×
Canary, fastest	80×	55×
Canary, cheapest	12×	152×
Proprietary API	1×	1×

Orange = API (buy quality/latency, pay per-use). Teal = self-host (1–2 orders cheaper, you run infra). Typical split: self-host STT (volume), API for TTS + frontier LLM (quality/latency).

16 · Batch transcription path — the OTHER system (offline, throughput)

Call recordings / analytics: a batch job optimizing throughput over a corpus you own — inverse of the realtime path. Don't conflate them.

Audio segments as WAV on a Modal Volume; spin up GPUs per job, fan out.
Mental model: the sort phase of MapReduce — chunk → transcribe in parallel → gather/order.
Parakeet/Canary RTFx in the thousands → transcribe enormous archives cheaply; Mistral Transcribe 2 for higher-quality batch.

17 · Scaling & the scheduler

A voice session pins a GPU slot for its whole duration (STT+TTS streaming, KV-cache resident) — unlike text, which is bursty + poolable. Sessions-per-GPU is the cost driver.
Scheduler = sessions → capacity, admit by KV-cache budget not headcount. Text and voice share the LLM pool at different priorities. (README resource-scheduler thesis.)
KV-cache compression ~7× → directly raises concurrent sessions/GPU and how much §10 context you can keep hot.

Full-duplex serving — when the session never goes idle (the v3 cost, §11)

Full-duplex converts multiplexable bursty demand into non-multiplexable steady demand — the same reason constant-bitrate traffic is harder to pack than bursty. The cost shows up as ≈3× the fleet, and it's a batching problem, not a FLOPs problem.

Why batching chokes on full-duplex	Mechanism
Lost multiplexing	can't release the slot during silence → effective cost ≈ 1/duty-cycle (~3× at a ⅓ talk ratio)
Frame deadline	each tick must finish within ~80ms — can't wait to fill a batch, so you run undersized batches on schedule
Pure decode	one token/frame autoregressive → memory-bandwidth bound, low arithmetic intensity (worst GPU regime; no prefill density to amortize)
KV-cache pinned	cache stays resident the whole conversation — can't evict/swap during silence as you might between turns → memory-capacity bound on concurrency

same 3 sessions, same conversation — only the serving model differs. ⅓ duty cycle → 1 slot vs 3.

The two-tier escape — duplex behavior and good economics

Small always-resident duplex model runs every frame: decode-light, uniform → batches beautifully. Decides speak / listen / backchannel / yield.
Large generative model invoked only when there's content to produce → back to bursty, back to multiplexable.
Pay tiny-model FLOPs to decide, big-model FLOPs only to speak. The serving-grade version of "suppress silent audio, never the model's vote."

green = always-on cheap tier · purple = on-demand expensive tier · amber = the silent-frame path that never wakes the big model

When to just eat the cost (gate the model on VAD anyway)

A product call, not a correctness one. Reactive products (command, Q&A, IVR) get ≈zero value from the model acting during silence → gate aggressively; the behavioral loss is noise and the 3× saving is real. Relational products (companion, tutor, interviewer) — the silence behavior is the product → don't gate. Cheap hybrid: gate on VAD but a single wake-timer re-triggers after N seconds of mutual silence (one scalar, not a full endpointing module).

18 · Escalation tiers — one session, upgraded

Tier	Transport + stack	Latency	Cost
Text	WS/HTTP → LLM	seconds	cheap, poolable
Voice	WebRTC/PSTN → AEC3·VAD·turn·STT·LLM·TTS	sub-second	GPU slot pinned
Video	+ vision specialist (200ms)	sub-second + vision	GPU slot ++

Escalation = bind an existing session to a richer tier without losing state — the session store (§10) is transport-agnostic (decide day one). Triggers: explicit ask · detected frustration (measurable from prosody/visual tags) · complexity · verification · human handoff. Downgrade symmetric — the accumulated context carries across.

19 · Failure modes & tail latency — p99 is what users feel

Failure	Effect	Mitigation
LLM TTFT spike	dead air	backchannel ("let me check…") to mask; timeout→fallback model
STT wrong on μ-law	misheard intent	upsample+denoise; confidence gate → reprompt
TTS stall	cut-off speech	buffer 1–2 chunks; cache common phrases
Specialist down (vision/prosody)	lost signal	degrade gracefully to remaining channels (late fusion enables this)
Watermark starvation	frozen timeline	idle-stream heartbeat advances watermark
Session store unavailable	amnesia / lost turn	write-ahead the committed timeline; rebuild KV-cache from store on reconnect
Network jitter	choppy audio	jitter buffer (adds latency — the tradeoff); regional POP

Dead air is the worst failure

>~700ms of silence reads as "the line dropped." Latency you can't remove, you mask — a backchannel buys LLM time and feels natural. Senior instinct: design the masking, not just the speed.

20 · Tag schema & interface contracts — the data that flows between stages

Late fusion only works if every specialist emits a common annotation envelope on the shared timeline. The envelope is the contract.

Annotation = { # what every specialist emits onto the timeline event_time: float, # edge time, AFTER latency calibration (the join key) arrival_time:float, # when WE got it (for watermark + debugging) source: "stt"|"prosody"|"vision"|"text"|"diar", speaker_id: str|None, # from diarization; None for non-attributable payload: {...}, # source-specific, below confidence: float, final: bool # partial (revisable) vs final hypothesis } payload by source: stt → { text, word_timings[] } # the words prosody → { anger, joy, stress, certainty, pitch } # NOT in transcript vision → { ref, gesture, facial_affect } # referents/expression text → { text } # typed; event_time == arrival diar → { speaker_id, embedding } # attribution EnrichedToken = { # what the sequencer emits after bundling co-temporal annotations event_time, speaker_id, text, prosody:{...}, visual:{...}, # merged from all sources within ε of event_time truncated: bool # set on barge-in (§4) so the model knows }

Stage boundary	In	Out
transport → front-end	RTP frames (Opus/μ-law)	uniform PCM, 20ms
front-end → specialists	clean PCM + VAD flags	(audio) to STT/prosody/diar
specialists → sequencer	raw media	`Annotation` (envelope above)
sequencer → fusion	Annotations	event-time-ordered, watermark-committed
fusion → LLM	Annotations bundle	`EnrichedToken` → compacted prompt (§10)
LLM → TTS	token stream	text clauses (stream on first clause)

21 · Build order + interview ammo

Text chat → session store + stateless handlers + Claude. Baseline.
Persistent transport-agnostic session (§10) — the load-bearing decision.
Voice loop (Pipecat/LiveKit): VAD → streaming STT (Deepgram) → LLM → streaming TTS (Cartesia). Measure mouth-to-mouth first.
Real endpointing model (Pipecat smart-turn) — kill the naive silence timer. Biggest win.
Barge-in + AEC3 (+ truncation recorded in context).
Speculative prefill on partials + prompt cache — shave TTFT.
Sequencer (event-time + watermark) — the moment a 2nd stream appears.
Fusion + specialists — diarization (Soniox), prosody, then vision (200ms widens the window; nothing else changes — payoff of step 7).
Context compaction (§10) — recent verbatim + old summarized; keeps prompt bounded.
Backchannel/filler — hide the p99 tail.
Self-host STT on Modal + scheduler + KV-cache compression — cost at scale.
Batch transcription path — offline analytics, separate system.

Say out loud

"Modalities are complementary signals to fuse, not transcripts to dedup — late fusion, structured tags, a schema built around each channel's unique signal."
"Co-sequencing is a stream-join on event-time; audio ~80ms vs vision ~200ms → calibrate per-stream latency, size the reorder window to the slowest modality. Correctness is bounded by your slowest channel."
"Context accumulates durably in the session store; the sequencer window is a sliding buffer that drains, not memory; the KV-cache is a derived working copy. Each boundary is a lossy compression I design — full-fidelity log for audit vs. compacted prompt for the model."
"Mouth-to-mouth is a pipelined waterfall, not a sum; the dominant cost is endpointing, not the models."
"A voice session pins a GPU slot; the whole thing is a resource scheduler admitting by KV-cache budget, not request count."
"Latency you can't remove, you mask with a backchannel; dead air is the real failure."
"Full-duplex is the long-term substrate. It dissolves the turn-taking module into the model and re-poses timing as a sequence-level reward — that's why it's RL, not more SFT. But it turns bursty demand into constant-bitrate (≈3× the slots), so you split a tiny always-on duplex controller from the on-demand generator."

This stack	Reuses pattern from
Sequencer watermark commit	`09_web_crawler` Queue.join()
Watermark = min over streams; straggler	`15_training_pipeline` all-reduce barrier
Commit on watermark OR deadline	`14_inference_server` size-OR-timer
Barge-in cancel	`10_audio_pipeline_interruptible`
Full-duplex gens (v1/v2/v3) + barge-in confirm window	`09_turn_taking`
Jitter buffer — reorder / loss-conceal / late-drop	`19_jitter_buffer`
Streaming TTS producer	`10_audio_pipeline`
KV=session, token=unit of work, continuous batching	`14_inference_server`
Tiers = work→capacity; admit by budget	README resource-scheduler thesis
Event-time k-way merge	`algorithms_cheatsheet` heap merge
Batch ASR = throughput over owned corpus	`07_gpu_data_loader` / inference-vs-training

References

Modal — Low-latency voice bot (Pipecat, ~1s v2v): modal.com/blog/low-latency-voice-bot
Modal — Fast/cheap batch transcription: modal.com/blog/fast-cheap-batch-transcription
Reference code — open-source AV ragbot: github.com/modal-projects/open-source-av-ragbot
Dograh — open-source voice-AI platform (Pipecat): github.com/dograh-hq/dograh
Supertonic — on-device TTS (ONNX, ~99M): github.com/supertone-inc/supertonic
Orpheus streaming (200ms TTS): github.com/Edward-Zion-Saji/orpheus-streaming-modal
Open ASR leaderboard: huggingface.co/spaces/hf-audio/open_asr_leaderboard

Verify-before-quoting: KVarN ~7× numbers, exact RTFx/WER figures, and cost multipliers are as-captured from source material — confirm before citing live. The Modal ragbot details (~1s v2v, component list) are from the Modal blog + repo as of capture.