Multimodal Conversational AI — The Complete Modern Stack
Master reference. Real-time text/voice/video conversation with an LLM (e.g. Claude). Every layer named with swappable alternatives, interfaces, latency, and self-host/API economics. Ingress covers browser WebRTC and PSTN/Twilio. Companions: voice_agents_notes.md, multimodal_fusion_design.md.
Everything — typed text, speech, video — becomes tokens before the model.
But the modalities are not redundant transcripts to dedup; they are
complementary signals to fuse on a shared event-time timeline.
The model calls are the easy 20%. The hard parts: (1) knowing when the human
finished (endpointing), (2) aligning streams with different latencies
(audio ~80ms vs vision ~200ms), and (3) masking the latency you can't
remove.
Hard problem 1 — Turn-taking. "Did they stop, or just pause?" VAD ≠ end-of-turn. Dominates perceived latency.
Hard problem 2 — Co-sequencing. Streams arrive out of order because each modality has its own processing latency. Order by event-time, not arrival, or the model sees the conversation backwards.
Hard problem 3 — Tail latency / dead air. p99, not mean, is what the user feels. >700ms silence reads as a dropped connection.
Two structural reframes: a conversation is one persistent, transport-agnostic
session served at different tiers (text = cheap/async, voice =
GPU-pinned/realtime); and the whole system is a resource scheduler matching
sessions to capacity (KV-cache bytes).
1 · The whole stack at a glance
solid = data flow · dashed amber = control (turn-end triggers the LLM — it does not reorder the sequencer) · dashed red = barge-in cancel, flushing the entire output chain at once (LLM decode · TTS · egress queue · client playout buffer) and marking the turn truncated. In full-duplex this red path vanishes — yielding is just the model's next-frame decision.
2 · Transport & ingress — get media in/out, low jitter
Concern
Primary
Alternatives
Why / notes
Realtime media + agent framework
LiveKit
Pipecat transport, Daily, raw WebRTC
SFU + agent SDK; runs the room, handles tracks. Runs on Modal.
Lightweight WebRTC
SmallWebRTCTransport
aiortc, full SFU
Pipecat's minimal P2P transport — single-agent, no SFU overhead.
Browser/app audio
WebRTC (Opus 48kHz)
—
Wideband, tunable jitter buffer, gives AEC hooks. The good path.
Phone
Twilio Media Streams
Twilio ConversationRelay, SIP trunk, Vonage
PSTN = 8kHz μ-law (G.711), carrier-fixed, narrowband — caps STT quality before any model choice. Media Streams forks 20ms frames to your WS; ConversationRelay is managed STT+TTS, you bring the LLM.
Regional placement
media POP near caller
—
Transatlantic RTT alone ~150ms. Co-locate agent + media server.
Interface
In: RTP/media frames (Opus or μ-law), 20ms each. Out: raw PCM frames to the front-end + a control channel for events (track start/stop, DTMF, mute). Pipecat/LiveKit normalize codecs so downstream sees uniform PCM.
WebSocket vs WebRTC — when each, and why WS is first-class
WebSocket is not a downgrade — it's the actual transport in three places that
matter: (1) Twilio Media Streams forks call audio to your server as μ-law
frames over a WebSocket, not WebRTC; (2) service-to-service (the Modal
ragbot's Tunnels were WS); (3) any server-to-server agent with no browser. WebRTC
is for the last mile to a browser/app; WebSocket is for everything behind it.
Axis
WebRTC
WebSocket
Layer
UDP + SRTP (media-grade)
TCP + TLS (one framed duplex stream)
Loss handling
tolerates loss (audio degrades gracefully)
head-of-line blocking — a lost packet stalls the stream (TCP)
Jitter buffer / NAT
built in (ICE/STUN/TURN)
you handle ordering/timing yourself
Built-in AEC/NS
yes (browser media stack)
NS server-side OK (RNNoise on inbound); AEC stays at the edge — can't align the echo loop over a cloud round-trip, so rely on handset/carrier cancellation
Setup
SDP offer/answer, heavier
one HTTP upgrade, trivial
Best for
last mile to browser/app
telephony bridge, service↔service, server agents
The framing to say (covers WS-first stacks and Twilio either way)
Twilio path is WebSocket end-to-end on your side. Carrier → Twilio → your WS (Media Streams) → agent. You receive base64 μ-law 8kHz in JSON frames (media, start, stop, mark events), and you send audio back as media frames. No WebRTC in your code at all.
WS gives you a single ordered duplex pipe — simplest possible transport for a server agent: one connection carries inbound audio frames and your outbound TTS audio and control events (interrupt, mark, clear). That's all the realtime loop needs.
The tax you accept with WS: TCP head-of-line blocking. On a lossy network a dropped packet stalls everything behind it (WebRTC would just drop+conceal). Mitigate with small frames, a tight network path (regional pinning), and treating the WS as reliable-but-bursty. For server↔service and Twilio (already past the carrier) this is fine; for last-mile-to-a-phone-on-bad-wifi, WebRTC wins.
Barge-in over WS = a control message, not a media signal. You send a clear/interrupt frame to flush buffered outbound audio (Twilio: the clear message; your own protocol: an event). Same single-Event cancel as 10_audio_pipeline_interruptible, just delivered as a WS control frame.
# Twilio Media Streams = a WebSocket your server accepts. Frame shape:
{ "event":"start", "start":{"streamSid":"...","mediaFormat":{"encoding":"audio/x-mulaw","sampleRate":8000}} }
{ "event":"media", "media":{"payload":"<base64 μ-law 20ms>","timestamp":"1234"} } # inbound audio
{ "event":"mark", "mark":{"name":"tts-clause-3"} } # playback checkpoint
{ "event":"stop" }
# you SEND back: media frames (your TTS, base64 μ-law) + a "clear" to flush on barge-in
{ "event":"media", "streamSid":"...", "media":{"payload":"<base64 μ-law>"} }
{ "event":"clear", "streamSid":"..." } # barge-in: drop queued audio
3 · Audio front-end — clean the signal before perception
Stage
Component
Alternatives
Why it must be here
Echo cancellation
WebRTC AEC3
Speex AEC, hardware AEC
Without it the bot hears itself through speaker→mic → false barge-in / self-interrupt. Must run before VAD.
Noise suppression
RNNoise / WebRTC NS
DeepFilterNet, Krisp
Phone audio is noisy; cleans STT input. Cheap CPU.
Voice activity detection
Silero VAD
WebRTC VAD, kyutai STT (built-in)
Per-frame "is there speech." Also creates speaker embeddings reusable for diarization. Fast first gate — but not end-of-turn.
Ordering matters
AEC3 → NS → VAD. Echo-cancel first (so suppression/VAD see clean near-end audio), then denoise, then detect voice. Classic bug: VAD fires on the bot's own echo.
4 · Turn detection & endpointing — the highest-leverage problem
VAD answers "is there voice now." Turn detection answers "is the turn over."
Different models. A 300ms gap mid-sentence — "send it to… uh… my checking
account" — must NOT fire the bot. This dominates mouth-to-mouth latency.
Layer
Component
Signal used
Notes
Acoustic gate
Silero VAD
energy / voiced frames
fast first filter; "silence started"
Semantic end-of-turn
Pipecat Smart Turn
acoustic + linguistic completeness
"I want to…" (incomplete→wait) vs "transfer fifty dollars." (complete→fire). Frameworks expose these as turn events.
Threshold policy
adaptive timeout
completeness + prosody
shorten after a complete clause; lengthen after a filler / rising intonation.
The knob: short silence threshold = snappy but cuts off thinkers; long =
polite but laggy. The latency-vs-correctness knob, per-deployment (a bank
IVR tolerates more wait than a casual assistant).
Barge-in (interrupt the bot)
Cancel TTS in single-frame latency — flush outbound buffer. (10_audio_pipeline_interruptible.py: one asyncio.Event, race every await against it.)
AEC3 already running so the bot's voice isn't mistaken for the human.
Re-sequence + record truncation — the interruption is a new high-priority event; mark the bot's half-spoken turn as truncated in the accumulated context (see §10) so the model knows it wasn't fully heard.
5 · STT — streaming (realtime) and batch (offline) are different problems
Open ASR leaderboard (self-host candidates)
Model
ESB WER En
RTFx
Multiling
Timestamps
VAD
Note
nvidia/parakeet-tdt-0.6b-v2
6.05
3386
✅
✅
❌
chosen: ~3× faster than any comparable-WER model
nvidia/canary-1b-flash
6.35
1046
✅ En/Fr/Es/De
✅
❌
sister model, multilingual
kyutai/stt-2.6b-en
6.4
88
En
✅
✅
built-in VAD
The full STT option space
Use
Option
Host
Notes
Realtime streaming (default)
Deepgram
API
low-latency partials; the realtime workhorse
Realtime, self-host
Parakeet / Canary (NeMo)
Modal GPU
1–2 orders cheaper at scale; NeMo makes model-swap trivial
Realtime w/ VAD
kyutai stt-2.6b
self-host
VAD bundled, simpler pipeline
Realtime diarized
Soniox rt-4
API
strong diarization on the realtime path
Multilingual realtime
Groq
API
natural multilingual, ~100 languages
Batch (fastest/cheapest)
Parakeet/Canary on Modal
Modal batch
RTFx 1000s → huge corpora cheap; not for realtime
Batch, high quality
Mistral Transcribe 2
API
very good + fast but batch only
General baseline
Whisper, voxtral-mini
either
Whisper ubiquitous; voxtral newer
The streaming insight
Realtime STT emits partial hypotheses while the user is still talking. At end-of-turn you pay only the finalization tail (~50ms), not the whole utterance. Batch STT processes a complete file — a different shape (§16).
6 · Diarization & speaker ID — who is speaking, when they overlap
Soniox rt-4 — realtime diarization, primary for overlapping speakers.
Silero embeddings — speaker vectors from the VAD stage; cluster to attribute turns.
Overlap is the hard case — two people at once isn't "pick one"; needs separation + per-speaker attribution.
Interface
Annotates each STT token with speaker_id + confidence. Feeds the sequencer as another tag dimension — diarization is a label on the timeline, not a separate stream. Context accumulates per-speaker-attributed (§10).
7 · Paralinguistics / emotion — the signal in the audio but NOT the transcript
"This is fine" + clipped tone + ↑pitch = a complaint. The transcript alone
reads as approval. The affective signal lives in prosody, not words. For
enterprise/banks this is churn-detection signal.
Prosody/emotion model on the audio stream → tags like {anger:0.8, certainty:low, stress:high}.
Continuous (per-frame) vs discrete words → a windowed join on event-time binds "this anger spike → that word" (§9).
Design the tag schema around channel-unique signal (§20): tone from audio, referents/expression from video, words from text.
9 · Fusion & co-sequencing — the merge (the genuinely novel algorithm)
Order by EVENT-TIME, not arrival. Audio ≈ 80ms, vision ≈ 200ms processing
latency → simultaneous edge events arrive ~120ms apart. Calibrate by subtracting
each stream's latency; hold a reorder window ≥ the slowest modality; commit on a
watermark.
teal = latency calibration · amber = the hold window · red = the commit decision
The sequencer algorithm
A k-way merge keyed on event-time with a watermark commit — structurally the
same "don't finalize while work is in flight" as the crawler's
Queue.join(); here the in-flight thing is a stream that hasn't
reported up to time T.
on event e from stream s:
e.event_time = e.arrival_time - LATENCY[s] # audio −80ms, vision −200ms, text 0
buffer.push(e) # min-heap on event_time
watermark[s] = e.event_time
W = min(watermark over all ACTIVE streams) # slowest reporter gates commit
while buffer.peek().event_time <= W:
emit(buffer.pop()) # commit in event-time order → §10# idle stream: heartbeat advances its watermark so silence can't freeze the timeline
Early vs late fusion: use late fusion — each specialist emits tags,
models are swappable, degrades gracefully if one drops. Early fusion (one model on
raw features) is richer but $$$, tightly coupled, and loses the per-channel swap
that lets Deepgram or Parakeet sit behind one interface.
10 · Context accumulation & memory — where state actually lives
The question that separates a toy from a system: where does context
accumulate? Five layers, and only one is the real accumulator. The rest
are transient (drain) or a working copy (derived). Confusing the
sequencer buffer for memory is the classic mistake.
#
Layer
Lifetime
Holds
Bounded by
1
Stream reorder buffers
ms (transient)
un-ordered annotations per specialist
stream jitter
2
Sequencer window (heap)
≤200ms, drains
uncommitted events awaiting watermark
slowest modality latency
3
Committed timeline
durable, append-only
fused enriched tokens, event-time order
session length
4
Session store ★
durable, source of truth
full history + tags + speaker + intent/state
retention policy
5
LLM KV-cache
per-process, reused via cache
model's working copy for this turn
GPU memory / context window
The architect points (what the doc was missing)
Two representations, not one. Session store = full fidelity (audit/replay/compliance — non-negotiable for banks). LLM context = a compacted view (recent turns verbatim, older turns summarized). You can't grow the prompt forever.
Every arrow is a lossy compression you design. specialist→tags (drop raw media) · timeline→session (keep tags) · session→prompt (summarize). The thing that survives each boundary is a design decision — same theme as fusion's "what survives compression."
The KV-cache is the accumulation that feels real but isn't durable. It's derived from session history; prompt-caching reuses it across turns so you don't re-prefill. Lose the GPU, rebuild it from the session store. (token = unit of work; KV-cache compression ~7× = longer accumulated context per GPU — §11/§17.)
Truncation must be recorded. On barge-in the bot's turn was only partially heard — the accumulated context marks it truncated, or the model assumes full delivery and the conversation desyncs from reality.
Store by event-time, per-speaker. Replay/audit must reconstruct what actually happened, in order, attributed — not a flattened arrival-order mush.
One-liner:context accumulates durably in the session store; the
sequencer window is a sliding buffer that drains, not memory; the KV-cache is a
cached working copy. Each layer boundary is a lossy compression you design — and
the full-fidelity log vs. the compacted prompt are two different views of the same
conversation.
11 · The LLM core — the shared brain
Concern
Choice
Notes
Model
Claude
OpenAI, Qwen 2.5 7B (self-host), Groq
First-token latency
prompt cache + speculative prefill
cache system prompt + history (hit every turn); prefill on STT partials so TTFT ≈ 0 at endpoint
Memory / capacity
KV-cache = session working mem
a token is the unit of work, not a request. Admit by token budget. (Derived from §10 layer 4.)
Throughput
continuous batching
admit/evict per token step; voice + text share the pool at different priorities
Cost headroom
KV-cache compression (~7×)
more concurrent sessions/GPU + longer accumulated context, same cost (verify numbers)
Multimodal input to the LLM
The LLM consumes the fused, enriched token stream (§9) compacted into a
prompt (§10) — words carrying speaker_id, prosody tags, visual refs.
"this is fine" arrives annotated as "said angrily while pointing at the fee." The
model reasons over signal, not just text.
Cascade vs native multimodal (speech-to-speech / omni) — the big fork
A frontier lab is likely building its own multimodal model, which can collapse
STT→LLM→TTS into one speech-to-speech model. The systems problems don't
disappear — they relocate. Know both, and where each wins.
Cascade (STT→LLM→TTS)
Native speech-to-speech (omni)
Shape
3 swappable services (most of this doc)
audio in → audio out, one model
Paralinguistics
lost at STT boundary (tone → flat text) unless a separate prosody tag carries it (§7)
preserved end-to-end — no transcript bottleneck
Latency
serialization per hop (STT final → LLM → TTS TTFB)
lower — no STT/TTS round-trips
Debug / swap
each hop inspectable + replaceable
black box — hard to swap parts or trace
Tools / RAG mid-stream
easy — inject between hops
harder — must be designed into the model
Turn-taking / barge-in
handled in the orchestrator (this doc)
inside the model or alongside it
The architect point
The front of the pipeline survives either way. Multi-stream fusion + co-sequencing (§9) — audio + video + text arriving at different latencies, aligned on event-time — is unchanged. An omni model fuses perception + generation; it does not solve input-stream alignment.
The harness survives too. Turn-taking, barge-in, session/context accumulation (§10), scheduling, transport (§2) are all still yours — they become the harness around the model, not parts of a cascade.
So the likely trajectory: cascade today → native multimodal for the core, with the cascade's modular concerns kept as the surrounding system. The systems problems relocate; they don't vanish. That's the line — it shows you see past "just call the model."
Full-duplex & the death of explicit turn-taking — where the harness goes next
The omni fork above still assumes half-duplex: model and user alternate. Full-duplex
models (Moshi, NVIDIA PersonaPlex) listen and speak on the same clock — continuous audio+silence
tokens both ways. This is the likely long-term substrate, and it relocates turn-taking one more
time: from an orchestrator stage (§4) to an emergent behavior inside the model.
Three generations of where "when to talk" lives:
Gen
Where timing lives
Trained by
Failure mode
v1 · turn-based + VAD
silence-timeout outside the model (§4)
heuristic threshold
cuts off thinkers / laggy — one knob, no backchannel concept
v2 · full-duplex SFT
inside the model, token-by-token (Moshi)
cross-entropy per token
over-silent — silence is cheap under CE, so it under-backchannels
v3 · full-duplex + RL
inside the model, as a policy
sequence-level reward
current frontier — timing becomes a spec
orange = heuristic / under-weighted timing · green = timing as a first-class learned policy · dashed line = the model boundary
Why RL, not more SFT — the load-bearing point
Turn-taking, barge-in, and backchannel are sequence-level properties, not per-token ones.
"Backchannel every so often" or "never talk over the user" can't be expressed in a token-level
cross-entropy loss — staying silent instead of saying "mhm" costs almost nothing in CE but is
glaring to a human. RL post-training is how you express a sequence-level goal. The mechanism
(the §4 endpointer) dissolves into the model; the behavior survives as a reward.
(09_turn_taking.py models all three gens, plus a barge-in confirmation window:
yield fast on a real interrupt, but wait out one frame so the user's own "mhm" doesn't cut the bot off.)
Soundbite: full-duplex doesn't remove turn-taking — it dissolves the
turn-taking module into the model and re-poses it as a reward. Barge-in stops being
"stop on voice" and becomes a learned endpointing decision with a latency-vs-false-cutoff tradeoff.
The catch: full-duplex breaks the batching economics (→ §17)
Half-duplex demand is bursty — a session generates ≈⅓ of wall-clock, so continuous batching
oversubscribes the idle time. Full-duplex makes demand constant-bitrate: every live session
must be ticked every frame, even in silence → the multiplexing is gone → roughly 3× the
GPU slots. The fix is not VAD-gating the model (that bolts the §4 endpointer back on and
kills the silence-timing behavior) — it's two tiers: a tiny always-on duplex controller decides
speak/listen/backchannel every frame and wakes the large generative model only when there's content.
Suppress silent audio; never suppress the model's vote on the silence.
12 · TTS — streaming; first-chunk latency is what's felt
Use
Component
Host
Notes
Realtime streaming (default)
Cartesia
API
low first-byte; realtime workhorse
Self-host streaming
Orpheus-3b
Modal
~200ms latency streaming server (ref impl exists)
Self-host, fastest/smallest
Kokoro-82M
Modal
tiny + streaming output → low TTFB; used in the Modal ragbot ref
On-device / edge (no GPU)
Supertonic
CPU / browser
~99M params, ONNX runtime; RTF ~0.3× on an e-reader, fast on CPU vs larger A100 baselines; studio 44.1kHz out, 31 languages, runs in browser (WebGPU/WASM) → Pi. Edge/privacy play.
Self-host general
XTTS
Modal
voice cloning, multilingual
Why an on-device TTS matters architecturally
Supertonic (and the edge-TTS class) move synthesis off the server entirely — zero
network hop for the audio-out leg, complete privacy, no GPU. For a latency budget that's
the last-mile playout latency gone; for banks it's a data-residency win. Tradeoff:
you ship a model to the client and lose centralized voice control / instant updates. A
hybrid (server TTS default, on-device for privacy-sensitive or offline) is the real answer.
The streaming insight
Start TTS on the first clause while the LLM is still generating clause two.
First-audio-out is what the user perceives — never wait for the full reply. Buffer
1–2 chunks before playout to survive jitter. (Producer side of 10_audio_pipeline.py.)
13 · Orchestration — who wires the graph together
Framework
Role
Notes
Pipecat
pipeline + turn events
frame-based pipeline; hooks for VAD, turn detection, barge-in, interruptions. SmallWebRTCTransport for lightweight WebRTC.
LiveKit Agents
transport + agent runtime
SFU + rooms + agent SDK; plugins for STT/TTS/LLM. Self-host or cloud; runs on Modal.
These own the event loop: media in → VAD/turn → STT → (fuse) → LLM → TTS
→ media out, plus the interrupt/barge-in control plane. The fusion/sequencer is a
stage inside this graph.
A concrete, runnable build (modal-projects/open-source-av-ragbot +
Modal's low-latency-voice-bot blog) that hits a median ~1s voice-to-voice
latency. Worth knowing as the "here's what real looks like" anchor.
Role
Reference uses
Why / note
Transport (client↔bot)
SmallWebRTCConnection
Pipecat WebRTC, E2E encrypted, low latency; SDP exchanged via an ephemeral modal.Dict
VAD + turn
Silero VAD + SmartTurn
Pipecat local-smart-turn; emit start/stop frames
STT
parakeet-tdt-0.6b-v3
"hard to beat on final-transcript time + accuracy"
RAG
ChromaDB + all-MiniLM-L6-v2
retrieval in "a few tens of ms"; loaded at container start
LLM
Qwen3-4B-Instruct + vLLM
self-hosted via vLLM over a Tunnel
TTS
Kokoro-82M
streaming output → low TTFB
The Modal architecture moves (the latency wins)
CPU bot, GPU services, split apart. The Pipecat orchestrator runs CPU-only (it's just wiring frames); STT/LLM/TTS are independent Modal services that autoscale on GPUs separately. Services hold no bot-specific logic → swappable + independently scalable.
Modal Tunnels bypass the input plane. The bot talks directly to services over a Tunnel (WebSocket for STT/TTS, HTTP for vLLM) — persistent, bidirectional, skips the routing hop. This is the main latency lever beyond model choice.
Regional pinning is critical. ~1s only holds when bot + services are co-located (one metro: us-east Virginia or us-west Bay Area). Cross-region kills it.
Memory snapshots for cold start.@modal.enter(snap=True) + enable_memory_snapshot; warm the snapshot with ~50 ping calls so a new container is ready fast. max_inputs=1 = one session per container.
Session lifecycle = a spawned FunctionCall.run_bot.spawn() per conversation, cancel at end; modal.Dict for sessions. Spawn/cancel is how you reclaim autoscaling while holding a Tunnel open.
# shape of the reference (paraphrased)
@app.cls(image=bot_image, region=SERVICE_REGIONS,
enable_memory_snapshot=True, max_inputs=1) # 1 session / container
class ModalVoiceAssistant:
@modal.enter(snap=True)
def load(self): self.chroma_db = ChromaVectorDB() # RAG ready at snapshot
@modal.method()
async def run_bot(self, d): # d = ephemeral modal.Dict
conn = SmallWebRTCConnection(ice_servers)
await conn.initialize(sdp=offer["sdp"], type=offer["type"])
await d.put.aio("answer", answer) # SDP answer back to frontend
await run_bot(conn, self.chroma_db) # Pipecat pipeline: VAD→STT→RAG→LLM→TTS# frontend POST /offer → ModalVoiceAssistant().run_bot.spawn(d) (one FunctionCall per call)
Note vs. our reference stack
This build self-hosts everything on Modal (Parakeet/Qwen/Kokoro) for cost
+ control, vs. the API-heavy default (Deepgram/Claude/Cartesia) elsewhere in this
doc. Same architecture, different self-host/API dial (§15). It also folds in RAG
— the STT transcript queries Chroma before the LLM, adding a "few tens of ms" retrieval hop.
The LLM (Qwen3-4B) is a swappable component behind the orchestrator — that's the
realistic self-host choice (open, fits one GPU, vLLM-served); swap to a frontier API for quality.
Reference platform — Dograh (open-source voice-AI platform, also Pipecat)
The other build making the rounds (dograh-hq/dograh, YC, ~4.3k stars).
Same Pipecat foundation as the Modal ragbot, but it's a platform, not a
minimal reference — useful as the "productized" end of the spectrum.
Aspect
Dograh
Contrast w/ Modal ragbot
Shape
platform (drag-drop workflow builder)
ragbot = minimal hand-wired reference
Orchestration
Pipecat (git submodule)
same foundation
Models
BYO LLM/STT/TTS (or bundled)
both treat models as swappable components
Transport
WebRTC + telephony
Twilio, Vonage, Telnyx, Cloudonix — real PSTN ingress
Deploy
Docker-first, self-host or cloud
ragbot = Modal serverless GPU
Extras
QA node, test-mode web calls, human transfer
productization: evals + agent handoff built in
License / stack
BSD-2, Python+TypeScript
no vendor lock-in
The takeaway from two refs
Both serious open voice stacks land on Pipecat as the orchestration layer with
models as swappable components behind it — confirming the doc's architecture.
They differ on the dial that matters at Twilio: ragbot optimizes raw latency on
serverless GPU; Dograh optimizes productization (telephony providers,
workflow builder, QA/evals, human transfer). A real deployment is somewhere between:
Pipecat core + your model dial + the telephony + ops layer you need.
14 · Latency budget & waterfall — the number you defend
Mouth-to-mouth target <800ms (enterprise), great ~500ms. It's a
pipelined waterfall, not a sum — and the dominant cost is endpointing,
not the models.
NeMo — NVIDIA framework; HF Open-ASR code → distributed batch service is a small lift; trivial swap between Parakeet/Canary.
vs proprietary API
Speed
Cost
Parakeet, fastest
112×
60×
Parakeet, cheapest
25×
200×
Canary, fastest
80×
55×
Canary, cheapest
12×
152×
Proprietary API
1×
1×
Orange = API (buy quality/latency, pay per-use). Teal = self-host (1–2 orders cheaper, you run infra). Typical split: self-host STT (volume), API for TTS + frontier LLM (quality/latency).
16 · Batch transcription path — the OTHER system (offline, throughput)
Call recordings / analytics: a batch job optimizing throughput over a
corpus you own — inverse of the realtime path. Don't conflate them.
Audio segments as WAV on a Modal Volume; spin up GPUs per job, fan out.
Mental model: the sort phase of MapReduce — chunk → transcribe in parallel → gather/order.
Parakeet/Canary RTFx in the thousands → transcribe enormous archives cheaply; Mistral Transcribe 2 for higher-quality batch.
17 · Scaling & the scheduler
A voice session pins a GPU slot for its whole duration (STT+TTS streaming, KV-cache resident) — unlike text, which is bursty + poolable. Sessions-per-GPU is the cost driver.
Scheduler = sessions → capacity, admit by KV-cache budget not headcount. Text and voice share the LLM pool at different priorities. (README resource-scheduler thesis.)
KV-cache compression ~7× → directly raises concurrent sessions/GPU and how much §10 context you can keep hot.
Full-duplex serving — when the session never goes idle (the v3 cost, §11)
Full-duplex converts multiplexable bursty demand into non-multiplexable steady
demand — the same reason constant-bitrate traffic is harder to pack than bursty. The cost shows
up as ≈3× the fleet, and it's a batching problem, not a FLOPs problem.
Why batching chokes on full-duplex
Mechanism
Lost multiplexing
can't release the slot during silence → effective cost ≈ 1/duty-cycle (~3× at a ⅓ talk ratio)
Frame deadline
each tick must finish within ~80ms — can't wait to fill a batch, so you run undersized batches on schedule
Pure decode
one token/frame autoregressive → memory-bandwidth bound, low arithmetic intensity (worst GPU regime; no prefill density to amortize)
KV-cache pinned
cache stays resident the whole conversation — can't evict/swap during silence as you might between turns → memory-capacity bound on concurrency
same 3 sessions, same conversation — only the serving model differs. ⅓ duty cycle → 1 slot vs 3.
The two-tier escape — duplex behavior and good economics
Small always-resident duplex model runs every frame: decode-light, uniform → batches beautifully. Decides speak / listen / backchannel / yield.
Large generative model invoked only when there's content to produce → back to bursty, back to multiplexable.
Pay tiny-model FLOPs to decide, big-model FLOPs only to speak. The serving-grade version of "suppress silent audio, never the model's vote."
green = always-on cheap tier · purple = on-demand expensive tier · amber = the silent-frame path that never wakes the big model
When to just eat the cost (gate the model on VAD anyway)
A product call, not a correctness one. Reactive products (command, Q&A, IVR) get ≈zero value
from the model acting during silence → gate aggressively; the behavioral loss is noise and the 3×
saving is real. Relational products (companion, tutor, interviewer) — the silence behavior
is the product → don't gate. Cheap hybrid: gate on VAD but a single wake-timer re-triggers
after N seconds of mutual silence (one scalar, not a full endpointing module).
18 · Escalation tiers — one session, upgraded
Tier
Transport + stack
Latency
Cost
Text
WS/HTTP → LLM
seconds
cheap, poolable
Voice
WebRTC/PSTN → AEC3·VAD·turn·STT·LLM·TTS
sub-second
GPU slot pinned
Video
+ vision specialist (200ms)
sub-second + vision
GPU slot ++
Escalation = bind an existing session to a richer tier without losing
state — the session store (§10) is transport-agnostic (decide day one). Triggers:
explicit ask · detected frustration (measurable from prosody/visual tags) ·
complexity · verification · human handoff. Downgrade symmetric — the accumulated
context carries across.
19 · Failure modes & tail latency — p99 is what users feel
Failure
Effect
Mitigation
LLM TTFT spike
dead air
backchannel ("let me check…") to mask; timeout→fallback model
STT wrong on μ-law
misheard intent
upsample+denoise; confidence gate → reprompt
TTS stall
cut-off speech
buffer 1–2 chunks; cache common phrases
Specialist down (vision/prosody)
lost signal
degrade gracefully to remaining channels (late fusion enables this)
Watermark starvation
frozen timeline
idle-stream heartbeat advances watermark
Session store unavailable
amnesia / lost turn
write-ahead the committed timeline; rebuild KV-cache from store on reconnect
Network jitter
choppy audio
jitter buffer (adds latency — the tradeoff); regional POP
Dead air is the worst failure
>~700ms of silence reads as "the line dropped." Latency you can't remove, you
mask — a backchannel buys LLM time and feels natural. Senior instinct:
design the masking, not just the speed.
20 · Tag schema & interface contracts — the data that flows between stages
Late fusion only works if every specialist emits a common annotation
envelope on the shared timeline. The envelope is the contract.
Annotation = { # what every specialist emits onto the timeline
event_time: float, # edge time, AFTER latency calibration (the join key)
arrival_time:float, # when WE got it (for watermark + debugging)
source: "stt"|"prosody"|"vision"|"text"|"diar",
speaker_id: str|None, # from diarization; None for non-attributable
payload: {...}, # source-specific, below
confidence: float,
final: bool # partial (revisable) vs final hypothesis
}
payload by source:
stt → { text, word_timings[] } # the words
prosody → { anger, joy, stress, certainty, pitch } # NOT in transcript
vision → { ref, gesture, facial_affect } # referents/expression
text → { text } # typed; event_time == arrival
diar → { speaker_id, embedding } # attribution
EnrichedToken = { # what the sequencer emits after bundling co-temporal annotations
event_time, speaker_id,
text, prosody:{...}, visual:{...}, # merged from all sources within ε of event_time
truncated: bool # set on barge-in (§4) so the model knows
}
Stage boundary
In
Out
transport → front-end
RTP frames (Opus/μ-law)
uniform PCM, 20ms
front-end → specialists
clean PCM + VAD flags
(audio) to STT/prosody/diar
specialists → sequencer
raw media
Annotation (envelope above)
sequencer → fusion
Annotations
event-time-ordered, watermark-committed
fusion → LLM
Annotations bundle
EnrichedToken → compacted prompt (§10)
LLM → TTS
token stream
text clauses (stream on first clause)
21 · Build order + interview ammo
Text chat → session store + stateless handlers + Claude. Baseline.
Persistent transport-agnostic session (§10) — the load-bearing decision.
Self-host STT on Modal + scheduler + KV-cache compression — cost at scale.
Batch transcription path — offline analytics, separate system.
Say out loud
"Modalities are complementary signals to fuse, not transcripts to dedup — late fusion, structured tags, a schema built around each channel's unique signal."
"Co-sequencing is a stream-join on event-time; audio ~80ms vs vision ~200ms → calibrate per-stream latency, size the reorder window to the slowest modality. Correctness is bounded by your slowest channel."
"Context accumulates durably in the session store; the sequencer window is a sliding buffer that drains, not memory; the KV-cache is a derived working copy. Each boundary is a lossy compression I design — full-fidelity log for audit vs. compacted prompt for the model."
"Mouth-to-mouth is a pipelined waterfall, not a sum; the dominant cost is endpointing, not the models."
"A voice session pins a GPU slot; the whole thing is a resource scheduler admitting by KV-cache budget, not request count."
"Latency you can't remove, you mask with a backchannel; dead air is the real failure."
"Full-duplex is the long-term substrate. It dissolves the turn-taking module into the model and re-poses timing as a sequence-level reward — that's why it's RL, not more SFT. But it turns bursty demand into constant-bitrate (≈3× the slots), so you split a tiny always-on duplex controller from the on-demand generator."
Verify-before-quoting: KVarN ~7× numbers, exact RTFx/WER figures, and cost multipliers are as-captured from source material — confirm before citing live. The Modal ragbot details (~1s v2v, component list) are from the Modal blog + repo as of capture.