From the attention math up to block allocation, swap-vs-recompute, and why barge-in cancel is cheap. Companion to voice_stack_master.html (§10 memory · §11 LLM core · §17 scheduler) and 14_inference_server.py.
The KV-cache does not store your tokens. It stores the per-token work the model already did —
each past token's key and value vectors, at every layer. Generation reuses that work
instead of redoing it. Everything else — paging, swapping, prefix reuse, barge-in — is just
memory management over those vectors, and it behaves exactly like an OS managing virtual memory.
Why it exists: without it, generating each new token re-runs the full forward pass over the whole sequence — O(N) per token, O(N²) total. The cache makes it O(1) new compute per token.
Why it's reusable: attention is causal — a token only looks backward — so a past token's K/V never change when you append more tokens. They're frozen the moment they're computed.
Why it's a systems problem: the cache is big and lives in finite GPU memory, so something must allocate, evict, swap, and rebuild it — that "something" is the inference server's scheduler.
1 · What's actually in the cache — keys and values, per token, per layer
When a token is processed, each layer projects it into three vectors: a query ("what I'm
looking for"), a key ("what I'm about"), and a value ("what I contribute"). The cache
stores the key and value of every past token — not the text, the vectors. To generate
the next token, the model computes only its query and attends over every cached key/value.
blue = cached past tokens · green = the token being generated (its k/v gets appended) · amber = the attention weighting
# one decode step — what "using the already-computed tokens" means
def decode_step(new_token, cache):
x = embed(new_token)
for layer in layers:
q, k, v = layer.project(x) # ONLY the new token's q,k,v
cache[layer].append(k, v) # store this token's k,v
K, V = cache[layer].all() # all prior tokens — read, not recomputed
attn = softmax(q @ K.T) @ V # <-- the new token USES every past token here
x = layer.feedforward(attn)
return sample(x)
2 · Why reuse is valid — causality + determinism
Attention is causal: a token only ever attends backward. So appending new tokens
cannot change any earlier token's key/value. Token 5's k/v depend on tokens 1–5, are computed
once, and are frozen forever after. That single property gives you append-only growth and exact reuse.
Property
Consequence
Append-only
decode just adds k/v at the end; nothing earlier moves or is invalidated.
Exact reuse
an identical prefix yields bit-identical k/v (same tokens → same projections), so reusing stored vectors is not an approximation — it's the same numbers.
Prefix, not arbitrary
change one early token and every k/v after it changes (they all attended to it). So a cache match is a prefix match — it breaks at the first divergence.
3 · Prefill vs decode — the same loop, batched differently
prefill fills the cache in one dense pass · decode appends one entry per generated token, reading the whole cache each step
4 · PagedAttention — virtual memory for attention
KV is stored in fixed-size blocks (e.g. 16 tokens), non-contiguous in HBM, addressed
through a per-sequence block table — exactly like an OS maps virtual pages to physical frames.
This kills fragmentation (any free block works) and lets sequences share a prefix by pointing
their block tables at the same physical blocks.
red = a physical block shared by two sequences' prefixes · grey = free blocks in the pool · the block table is the page table
5 · Loading cached tokens — prefix reuse, not recompute
In steady state a turn's prompt is mostly cached prefix + a small new suffix (the latest user
message). The runtime hashes the prompt in blocks, matches the leading blocks against resident KV, and
points the new request's block table at those existing physical blocks. Prefill then runs only
on the new suffix, attending to the reused prefix. "Loading the cached tokens" is a metadata
operation — assuming they're still resident.
Where the cached blocks are
"Load" cost
Resident in HBM
free — point the block table at them (the common turn-to-turn case)
Swapped to CPU DRAM
PCIe copy back into HBM (swap_in)
Evicted entirely
recompute — re-prefill from the token IDs (the only case that's truly "new tokens")
Why per-turn voice latency stays low
Each turn, the prior turn's prompt+reply becomes the cached prefix, so you only prefill the newest
user utterance on top of a reused prefix — not the whole conversation. That's the low-TTFT trick
(voice_stack §11/§14). On the Claude/API path it's the same idea via cache_control
breakpoints: cached tokens return as a cheap cache-read, you never see the blocks.
6 · The memory hierarchy & who copies what — it bottoms out in a memcpy
solid = a real byte copy over a bus · dashed red = no KV copy, regenerate by computation · the session store is the floor: lose everything and re-prefill from text
The chain of custody for a token's KV
allocate a block on prefill/decode → hold it while resident → under pressure evict / swap / free
→ rehydrate from CPU DRAM (PCIe), a remote tier (RDMA), or recompute from the session-store tokens.
The KV is always derived; only the text is truly persisted.
7 · Eviction & preemption — swap and recompute are alternatives, not steps
When the pool fills, the scheduler frees the cheapest-to-lose blocks. Preemption picks
one policy per sequence: swap (copy KV out to DRAM, copy back later) orrecompute
(drop the blocks, re-prefill from tokens when resumed). You never recompute something you swapped — if you
preserved the bytes, you copy them back.
Trigger (in order of cheapness)
What happens
A sequence finishes
its KV is freed immediately → blocks return to the pool. The constant churn; usually enough on its own.
Evictable prefix-cache blocks
dropped LRU — safe because recomputable from tokens.
Preempt a running sequence
under real pressure: recompute (default) or swap, resume later. Can even happen mid-generation.
The usual flow is recompute — i.e. don't swap at all
vLLM's V1 makes recomputation the default preemption mode: drop the blocks, re-prefill on resume.
For typical context lengths, re-running prefill on fast GPU compute beats dragging gigabytes of KV across
a ~32 GB/s PCIe link. Swap wins only when contexts are very long (re-prefill gets expensive), the
interconnect is fast (NVLink / Grace-Hopper C2C ~900 GB/s), or a high-reuse offload tier amortizes the copy.
8 · Who manages it, where — and self-host vs API changes the answer
Actor
Lane
Role re: KV
Serving runtime scheduler + block manager (vLLM)
GPU
the only thing that touches KV tensors: allocate/free blocks, evict, swap, preempt, prefix-cache reuse. HBM ↔ DRAM.
Orchestrator (e.g. Pipecat)
CPU
manages KV indirectly, at request granularity — submit (prefill), stream, abort (free), and choose context (what gets prefilled/cached). Never sees a tensor; talks over the network.
Session store
CPU / DB
holds no KV — the source it's rebuilt from (re-prefill on loss).
Self-host vs API — the fork that changes "who"
Self-host (vLLM):you own the KV scheduler — preemption, eviction, swap, prefix cache are yours to run and tune.
Claude / OpenAI API: KV is the provider's, fully server-side. Your only lever is prompt-cache breakpoints (cache_control) plus what you send. Abort = close the stream. You cannot swap, evict, inspect, or rewind. "Who manages the KV" = the provider.
Two schedulers, two altitudes (the resource-scheduler thesis twice): the orchestrator schedules
sessions (which conversations exist, what context each sends); the serving runtime schedules
tokens/blocks across all sessions sharing the GPU. Neither sees the other's units.
9 · Barge-in — cancel is just a free
Everything expensive here — swap, recompute, PCIe copies — is about preserving and restoring
KV. Barge-in discards it, which is the cheap direction. There is no cache surgery, no rewind,
no swap.
On barge-in
Cost
Abort the decode
remove the sequence from the running batch — same routine path as hitting end-of-sentence
Free the KV blocks
metadata op — return blocks to the pool, no copy, no recompute
Flush pending output
drop in-flight audio (TTS buffer · egress queue · client playout) — CPU/network, for responsiveness
Commit the turn
record the heard prefix (what actually played), not the generated text
The deflated model
Barge-in = abort decode (free KV) + flush pending output + commit the heard prefix. The cache is
never edited — it's freed and naturally rebuilt next turn by normal prefill (prefix-cache accelerated).
The only bookkeeping you can't skip: commit what was delivered, not generated, or the next
turn's context desyncs from what the user heard. If you commit-on-delivery rather than
commit-on-generation, even that is automatic. The only sunk cost is the FLOPs already spent on the unheard
tail — aborting promptly stops further waste, it doesn't add any.
Full-duplex footnote: there isn't even an abort — yielding is the model choosing silence on the
next frame. No batch removal, no free. Cheaper still (voice_stack §11).
10 · It's a memory allocator + scheduler — say it out loud
Say out loud
"The KV-cache stores each past token's key/value vectors, not the tokens. A new token's query attends over them — that's how prior tokens are used without recompute."
"Reuse is valid because attention is causal: appending tokens never changes earlier k/v. Append-only, and an identical prefix gives bit-identical k/v — so prefix caching is exact."
"PagedAttention is virtual memory for attention — fixed blocks, a block-table page-table, non-contiguous to avoid fragmentation, shared prefixes via copy-on-write."
"Getting a block back is swap (copy KV bytes over PCIe) or recompute (regenerate from token IDs) — and recompute usually wins because the bus is narrower than the GPU is fast."
"It's a resource scheduler: fragmentation (paging), fairness (don't starve sessions), reclamation (free/evict/swap), online (admit by token budget, not request count)."
"Barge-in cancel is a free, not a rewind — discarding KV is the cheap direction; you only pay copies/recompute when you try to keep a cache alive."
This maps to
Pattern
Admit by KV/token budget · continuous batching
14_inference_server.py
Block allocator avoiding fragmentation
README resource-scheduler thesis
KV is derived; rebuild from the store
voice_stack_master.html §10
Session pins a GPU slot = KV residency
voice_stack_master.html §17
Weights loading (safetensors/mmap) — the other GPU load
gpu_data_movement.html · 11_safetensors_mmap.py
Verify-before-quoting: PCIe/NVLink bandwidths, block sizes, and the vLLM-V1 default-to-recompute behavior are as-captured — confirm exact figures before citing live.