Multimodal Conversational Agent — Fusion & Co-Sequencing Design

A design doc for a messaging service that handles text, voice, and video at once and can escalate between them, where the streams carry complementary information (words, tone, what's being shown) that must be fused on a shared event-time timeline before reaching the LLM. Centerpiece: multimodal fusion + co-sequencing. Escalation tiers and batch transcription are supporting sections.

0 · The thesis (say this first)

Everything becomes tokens at the end — but the modalities are not redundant transcripts to dedup. They are complementary signals to fuse. Each specialist model compresses its channel into structured annotations on a shared event-time timeline; a sequencer aligns and bundles co-temporal annotations into enriched tokens; the LLM consumes one fused stream. The two hard parts are (a) aligning streams that have different latencies and rates on event-time and (b) choosing what survives each channel's compression — because the whole point is to keep the signal that lives in the audio/video but not the words.

Two reframes that drive everything below:

  1. Co-sequencing is a stream-join on event-time, not processing-time. (Flink/Dataflow watermark model.)
  2. Escalation is a resource-tier upgrade of one persistent session. Text = cheap async tier; voice/video = expensive real-time tier. Same brain, different transport + latency budget + cost.

1 · Why naive ordering is wrong

The modalities don't carry the same thing:

ChannelCarriesExample signal
text / STTpropositional — the literal words"this is fine"
audio prosodyaffective / pragmatic — tone, stress, hesitationclipped, ↑pitch, anger 0.8
videoreferential / deictic — what's shown / pointed at, expressionpoints at a fee, frustrated face

"this is fine" + angry tone + frustrated face is a complaint. The transcript alone reads as approval. For a bank, that's the difference between catching a churning customer and missing them. The non-verbal signal is exactly the thing absent from the words — so you cannot flatten to text and order by arrival.

2 · The core problem: event-time vs processing-time

Streams arrive on separate async paths with different, structural latencies:

EVENT-time (one real moment at the edge) user types "actually no—" · arrives ~instant user speaks "send it to..." · +80ms STT → arrives late user gestures [points at X] · +200ms vision → arrives later processing-time (arrival order) → text speech gesture order by EVENT-time, NOT by which pipeline finished first
purple text (instant) · green speech (+80ms) · orange gesture (+200ms) — the three fire at one real moment (amber line) but reach the sequencer spread across processing-time

Order by arrival and the instant text "actually no—" lands before the slow-transcribed speech it was correcting → the LLM sees the conversation backwards. Order by event-time and it's correct.

The concrete timing model (the crux)

Processing latency is per-modality and roughly constant:

So a sound and a sight that were simultaneous at the edge arrive at the sequencer ~120 ms apart. This is a known structural offset, not random jitter — which makes it correctable by construction (same idea as A/V lip-sync in a video player).

Two consequences that define the design:

  1. Latency calibration → recover event-time. Stamp at capture, then subtract the channel's known latency:
    event_time = arrival_time − stream_latency # audio −80ms, visual −200ms
    Now a co-temporal sound + gesture land on the same event-time axis and fuse correctly.
  2. Reorder window ≥ max inter-stream skew. You must hold the timeline open at least ~120 ms (visual − audio, + jitter margin) before committing, or the slower visual event arrives after you flushed the audio it belonged with. → The commit delay is set by your SLOWEST modality (visual, 200ms), not the fastest. Your fusion window can never be tighter than your slowest channel.

This is the latency-vs-correctness knob, made physical.

3 · Architecture

EDGE — stamp event-time at capture mic camera keyboard STT + prosody ~80ms words + affect vision model ~200ms refs + affect text (0 lag) tokens SPECIALISTS each compresses its channel to structured tags annotations SEQUENCER (event-time join) calibrate: arrival − latency · buffer in a reorder window WATERMARK: commit prefix when "seen everything ≤ T" bundle co-temporal annotations into ONE enriched token FUSION (annotations → context) late fusion: merge the tags into LLM-readable context per event SHARED LLM (batched) KV-cache = session working memory SESSION STORE id · identity · history · state transport-agnostic, survives modality switch response → TTS / text
edge → specialists (each compresses its channel to tags) → sequencer (event-time merge) → fusion → shared batched LLM, bound two-way to the transport-agnostic session store

4 · The Sequencer (the heart)

A k-way merge keyed on event-time, with a watermark commit. Structurally the same "don't finalize while work is in flight" as the web-crawler's Queue.join() — here the in-flight thing is a stream that hasn't reported up to time T yet.

on event e from stream s: e.event_time = e.arrival_time - LATENCY[s] # calibrate buffer.push(e) # min-heap on event_time watermark[s] = e.event_time # this stream is current to here W = min(watermark over all active streams) # global safe time while buffer.peek().event_time <= W: # everything ≤ W has arrived emit(buffer.pop()) # commit in event-time order

5 · Fusion: early vs late

EARLY fusion raw features one big multimodal model richest, $$$, tightly coupled, one model must do everything LATE fusion ✅ (build this) each stream → its specialist structured tags fuse the tags cheap, modular, swappable models, degrades gracefully if one drops
early (left, amber): raw features → one coupled model · late (right, green): per-channel specialists emit tags, then fuse — the practical choice

Late fusion is the practical choice. Each specialist emits typed annotations; the token becomes a multi-channel record:

[t=4.10 words="this is fine"] [t=4.10 prosody={anger:0.8, certainty:low}] [t=4.10 visual={ref:"fee_line_item", affect:"frustrated"}] → fused LLM context: User said "this is fine" but tone=angry and pointing at the fee → likely sarcastic; treat as a complaint about the fee.

The two hard parts late fusion exposes

  1. Alignment across rates. Prosody is continuous (per-frame), words are discrete (per-utterance), gestures are sparse (per-event). Binding "this anger spike → that word" is a windowed join on event-time within a tolerance — the sequencer joining features to tokens, not just ordering tokens.
  2. Compression: what survives. You can't send raw audio/video to the LLM — it's tokens at the end. Each specialist compresses its channel to a few tags, and the design question is which signal survives. The emotion/referent is precisely what's in the media but not the transcript — lose it in compression and you've discarded the reason the channel exists. Design the tag schema around the signal that's unique to each channel.

6 · Escalation: tiers of one persistent session

The session store is the source of truth and is transport-agnostic — so moving text→voice→video is binding a new transport to existing state, not a rewrite. This is the load-bearing decision (make it in step 2 below).

TierTransportLatency budgetCostConcurrency/GPU
TextWS/HTTPsecondscheapmany
VoiceWebRTC: VAD→STT→LLM→TTSsub-second (mouth-to-mouth)GPU slotfew
Video+ vision specialistsub-second + 200ms visionGPU slot ++fewer

7 · Build order (the "layer a constraint" rhythm)

Each step runs before the next exists — present it this way; let the interviewer "add a requirement" and each requirement is the next step.

  1. Text messaging — session store + stateless handlers + LLM. Baseline.
  2. Persistent, transport-agnostic session{id, identity, history, state}; workers stateless. ← the decision that makes everything else cheap.
  3. Voice channel — WebRTC + VAD→STT→LLM→TTS, reusing the same store + LLM.
  4. Sequencer — event-time calibration + watermark commit; needed the moment two streams coexist.
  5. Fusion — specialists emit tags; bundle co-temporal annotations; design the tag schema around channel-unique signal.
  6. Video channel — add the 200ms vision specialist; the sequencer's window widens to its latency; nothing else changes (that's the payoff of step 4).
  7. Escalation controller — triggers + handoff + downgrade.
  8. Real-time hardening — turn-taking/endpointing, barge-in re-sequencing, AEC3.
  9. Scale & economics — scheduler (text tier vs GPU voice/video slots), batch the shared LLM, persist KV-cache across escalation (~7× headroom from KV-cache compression — see voice_agents_notes.md), cost per session per tier.

8 · Connections to the rest of the prep set

This designReuses pattern from
Sequencer watermark commit09_web_crawler Queue.join() (don't finalize in-flight)
Watermark = min over streams; idle straggler15_training_pipeline all-reduce barrier
Commit on watermark OR deadline14_inference_server size-OR-timer trigger
Barge-in cancel10_audio_pipeline_interruptible (single Event)
Shared LLM, KV-cache = session memory14_inference_server (token = unit of work)
Tiers = work units → capacity unitsREADME resource scheduler thesis
Event-time merge (heap)algorithms_cheatsheet heap / k-way merge

9 · Things to say out loud (interview ammo)

Open questions for the next pass