Multimodal Fusion & Co-Sequencing
Multimodal Conversational Agent — Fusion & Co-Sequencing Design
A design doc for a messaging service that handles text, voice, and video at once and can escalate between them, where the streams carry complementary information (words, tone, what’s being shown) that must be fused on a shared event-time timeline before reaching the LLM.
Centerpiece: multimodal fusion + co-sequencing. Escalation tiers and batch transcription are supporting sections.
0. The thesis (say this first)
Everything becomes tokens at the end — but the modalities are not redundant transcripts to dedup. They are complementary signals to fuse. Each specialist model compresses its channel into structured annotations on a shared event-time timeline; a sequencer aligns and bundles co-temporal annotations into enriched tokens; the LLM consumes one fused stream. The two hard parts are (a) aligning streams that have different latencies and rates on event-time and (b) choosing what survives each channel’s compression — because the whole point is to keep the signal that lives in the audio/video but not the words.
Two reframes that drive everything below:
- Co-sequencing is a stream-join on event-time, not processing-time. (Flink/ Dataflow watermark model.)
- Escalation is a resource-tier upgrade of one persistent session. Text = cheap async tier; voice/video = expensive real-time tier. Same brain, different transport + latency budget + cost.
1. Why naive ordering is wrong
The modalities don’t carry the same thing:
| Channel | Carries | Example signal |
|---|---|---|
| text / STT | propositional — the literal words | “this is fine” |
| audio prosody | affective / pragmatic — tone, stress, hesitation | clipped, ↑pitch, anger 0.8 |
| video | referential / deictic — what’s shown / pointed at, expression | points at a fee, frustrated face |
"this is fine" + angry tone + frustrated face is a complaint. The
transcript alone reads as approval. For a bank, that’s the difference
between catching a churning customer and missing them. The non-verbal signal is
exactly the thing absent from the words — so you cannot flatten to text and
order by arrival.
2. The core problem: event-time vs processing-time
Streams arrive on separate async paths with different, structural latencies:
user types "actually no—" ───────────────●──────────▶ ~instant
user speaks "send it to..." ──────●·······▶ +80ms STT ────▶ arrives late
user gestures [points at X] ──●·······▶ +200ms vision ─────▶ arrives later
│ │ │
EVENT EVENT EVENT time it actually happened
└─────┴───────┘
order by EVENT time, NOT by which pipeline finished first
Order by arrival and the instant text “actually no—” lands before the slow-transcribed speech it was correcting → the LLM sees the conversation backwards. Order by event-time and it’s correct.
The concrete timing model (the crux)
Processing latency is per-modality and roughly constant:
- audio integration ≈ 80 ms
- visual integration ≈ 200 ms
So a sound and a sight that were simultaneous at the edge arrive at the sequencer ~120 ms apart. This is a known structural offset, not random jitter — which makes it correctable by construction (same idea as A/V lip-sync in a video player).
Two consequences that define the design:
- Latency calibration → recover event-time. Stamp at capture, then
subtract the channel’s known latency:
event_time = arrival_time − stream_latency # audio −80ms, visual −200msNow a co-temporal sound + gesture land on the same event-time axis and fuse correctly.
- Reorder window ≥ max inter-stream skew. You must hold the timeline open at least ~120 ms (visual − audio, + jitter margin) before committing, or the slower visual event arrives after you flushed the audio it belonged with. → The commit delay is set by your SLOWEST modality (visual, 200ms), not the fastest. Your fusion window can never be tighter than your slowest channel.
This is the latency-vs-correctness knob, made physical.
3. Architecture
EDGE (stamp event-time at capture)
mic ──┐ camera ──┐ keyboard ──┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ STT + prosody│ │ vision model │ │ text (0 lag)│ ← SPECIALISTS
│ ~80ms │ │ ~200ms │ │ │ each compresses
│ words+affect │ │ refs+affect │ │ tokens │ its channel to
└──────┬───────┘ └──────┬───────┘ └──────┬──────┘ structured tags
│ annotations │ annotations │
└─────────────────┼────────────────────┘
▼
┌────────────────────────┐
│ SEQUENCER │ • calibrate: arrival − latency
│ (event-time join) │ • buffer in a reorder window
│ │ • WATERMARK: commit prefix when
│ │ "seen everything ≤ T"
│ │ • bundle co-temporal annotations
└───────────┬─────────────┘ into ONE enriched token
▼
┌────────────────────────┐
│ FUSION │ late fusion: merge the tags into
│ (annotations→context) │ LLM-readable context per event
└───────────┬─────────────┘
▼
┌────────────────────────┐ ┌──────────────────────┐
│ SHARED LLM (batched) │◀────▶│ SESSION STORE │
│ KV-cache = session │ │ id·identity·history· │
│ working memory │ │ state (transport- │
└───────────┬─────────────┘ │ agnostic, survives │
▼ │ modality switch) │
response → TTS / text └──────────────────────┘
4. The Sequencer (the heart)
A k-way merge keyed on event-time, with a watermark commit. Structurally the
same “don’t finalize while work is in flight” as the web-crawler’s Queue.join()
— here the in-flight thing is a stream that hasn’t reported up to time T yet.
on event e from stream s:
e.event_time = e.arrival_time - LATENCY[s] # calibrate
buffer.push(e) # min-heap on event_time
watermark[s] = e.event_time # this stream is current to here
W = min(watermark over all active streams) # global safe time
while buffer.peek().event_time <= W: # everything ≤ W has arrived
emit(buffer.pop()) # commit in event-time order
- Watermark
W = min over streams— you can only safely commit up to the point your slowest-reporting stream has reached. One quiet stream holds the line (same straggler dynamic as the training all-reduce barrier). - Idle-stream timeout. If a stream goes silent (user stops gesturing), its watermark would freeze the timeline forever. Advance it with a heartbeat / max-wait so a silent modality can’t stall the conversation. (Mirrors the batcher’s size-OR-timer: commit on watermark OR deadline.)
- Co-temporal bundling. Events within a small ε window at the same event-time are fused into one enriched token rather than emitted separately (the prosody spike belongs to that word).
5. Fusion: early vs late
EARLY fusion LATE fusion ✅ (build this)
raw features → one big each stream → its specialist →
multimodal model structured tags → fuse the tags
richest, $$$, tightly coupled, cheap, modular, swappable models,
one model must do everything degrades gracefully if one drops
Late fusion is the practical choice. Each specialist emits typed annotations; the token becomes a multi-channel record:
[t=4.10 words="this is fine"]
[t=4.10 prosody={anger:0.8, certainty:low}]
[t=4.10 visual={ref:"fee_line_item", affect:"frustrated"}]
→ fused LLM context:
User said "this is fine" but tone=angry and pointing at the fee
→ likely sarcastic; treat as a complaint about the fee.
The two hard parts late fusion exposes
- Alignment across rates. Prosody is continuous (per-frame), words are discrete (per-utterance), gestures are sparse (per-event). Binding “this anger spike → that word” is a windowed join on event-time within a tolerance — the sequencer joining features to tokens, not just ordering tokens.
- Compression: what survives. You can’t send raw audio/video to the LLM — it’s tokens at the end. Each specialist compresses its channel to a few tags, and the design question is which signal survives. The emotion/referent is precisely what’s in the media but not the transcript — lose it in compression and you’ve discarded the reason the channel exists. Design the tag schema around the signal that’s unique to each channel.
6. Escalation: tiers of one persistent session
The session store is the source of truth and is transport-agnostic — so moving text→voice→video is binding a new transport to existing state, not a rewrite. This is the load-bearing decision (make it in step 2 below).
| Tier | Transport | Latency budget | Cost | Concurrency/GPU |
|---|---|---|---|---|
| Text | WS/HTTP | seconds | cheap | many |
| Voice | WebRTC: VAD→STT→LLM→TTS | sub-second (mouth-to-mouth) | GPU slot | few |
| Video | + vision specialist | sub-second + 200ms vision | GPU slot ++ | fewer |
- Escalation triggers: explicit ask · detected frustration (now measurable from the prosody/visual tags!) · task complexity · identity verification · human handoff.
- Handoff: provision the tier’s GPU slot, attach the existing session (history + intent + fused affect state) — the voice tier starts already knowing the user is frustrated about the fee. Downgrade path symmetric.
- Barge-in re-sequences, not just cancels. When the human interrupts the bot,
you (1) cancel TTS instantly (
10_audio_pipeline_interruptible— single Event, single-chunk latency) AND (2) treat the interruption as a new high-priority event that may reorder/discard buffered tokens in the sequencer.
7. Build order (the “layer a constraint” rhythm)
Each step runs before the next exists — present it this way; let the interviewer “add a requirement” and each requirement is the next step.
- Text messaging — session store + stateless handlers + LLM. Baseline.
- Persistent, transport-agnostic session —
{id, identity, history, state}; workers stateless. ← the decision that makes everything else cheap. - Voice channel — WebRTC + VAD→STT→LLM→TTS, reusing the same store + LLM.
- Sequencer — event-time calibration + watermark commit; needed the moment two streams coexist.
- Fusion — specialists emit tags; bundle co-temporal annotations; design the tag schema around channel-unique signal.
- Video channel — add the 200ms vision specialist; the sequencer’s window widens to its latency; nothing else changes (that’s the payoff of step 4).
- Escalation controller — triggers + handoff + downgrade.
- Real-time hardening — turn-taking/endpointing, barge-in re-sequencing, AEC3.
- Scale & economics — scheduler (text tier vs GPU voice/video slots), batch the shared LLM, persist KV-cache across escalation (~7× headroom from KV-cache compression — see voice_agents_notes.md), cost per session per tier.
8. Connections to the rest of the prep set
| This design | Reuses pattern from |
|---|---|
| Sequencer watermark commit | 09_web_crawler Queue.join() (don’t finalize in-flight) |
| Watermark = min over streams; idle straggler | 15_training_pipeline all-reduce barrier |
| Commit on watermark OR deadline | 14_inference_server size-OR-timer trigger |
| Barge-in cancel | 10_audio_pipeline_interruptible (single Event) |
| Shared LLM, KV-cache = session memory | 14_inference_server (token = unit of work) |
| Tiers = work units → capacity units | README resource scheduler thesis |
| Event-time merge (heap) | algorithms_cheatsheet heap / k-way merge |
9. Things to say out loud (interview ammo)
- “This is a stream-join on event-time — the modalities are sources; the hard part is the reorder window and the watermark that decides when the timeline is safe to commit.”
- “Audio is ~80ms, visual ~200ms, so simultaneous edge events arrive ~120ms apart. I calibrate by subtracting per-stream latency and size the reorder window to the slowest modality — correctness is bounded by your slowest channel.”
- “I use late fusion — specialists emit structured tags — so models are swappable and it degrades gracefully. The schema is designed around the signal unique to each channel: tone from audio, referents from video.”
- “Escalation is a tier upgrade of one persistent session, not a new conversation — which is why I make the session store transport-agnostic on day one.”
- “Barge-in doesn’t just cancel TTS — it re-sequences, because the interruption is a new high-priority event in the timeline.”
Open questions for the next pass
- Mouth-to-mouth latency target (banks usually want <800ms) → budget breakdown across VAD/STT/LLM/TTS + network + the 200ms vision tax.
- Self-host vs API per hop (see voice_agents_notes.md cost tables).
- Tag schema: exact fields per modality; how much affect detail the LLM can actually use.
- Failure modes: one specialist down (degrade to remaining channels), watermark starvation, clock skew across edge devices.