How the Low-Level GPU KV-Cache Works

From the attention math up to block allocation, swap-vs-recompute, and why barge-in cancel is cheap. Companion to voice_stack_master.html (§10 memory · §11 LLM core · §17 scheduler) and 14_inference_server.py.
0 · Thesis: the cache stores work, not tokens 1 · What's actually in the cache 2 · Why reuse is valid: causality + determinism 3 · Prefill vs decode 4 · PagedAttention: virtual memory for attention 5 · Loading cached tokens (prefix reuse) 6 · The memory hierarchy & who copies what 7 · Eviction & preemption: swap vs recompute 8 · Who manages it, where 9 · Barge-in: cancel is just a free 10 · It's a memory allocator + scheduler

0 · Thesis — the cache stores work, not tokens

The KV-cache does not store your tokens. It stores the per-token work the model already did — each past token's key and value vectors, at every layer. Generation reuses that work instead of redoing it. Everything else — paging, swapping, prefix reuse, barge-in — is just memory management over those vectors, and it behaves exactly like an OS managing virtual memory.

1 · What's actually in the cache — keys and values, per token, per layer

When a token is processed, each layer projects it into three vectors: a query ("what I'm looking for"), a key ("what I'm about"), and a value ("what I contribute"). The cache stores the key and value of every past token — not the text, the vectors. To generate the next token, the model computes only its query and attends over every cached key/value.

KV-cache (one layer): every past token's key + value k₁ · v₁tok 1 k₂ · v₂tok 2 k₃ · v₃tok 3 k₄ · v₄tok 4 k₅ · v₅tok 5 k₆ · v₆appended now new token → query q₆ scoreᵢ = q₆ · kᵢ softmax(q₆·K) → weights how much to pull from each vᵢ Σ weightᵢ · vᵢ → next token output Only q₆/k₆/v₆ are computed fresh. k₁..k₅ and v₁..v₅ are read from cache, never recomputed — that is the entire savings.
blue = cached past tokens · green = the token being generated (its k/v gets appended) · amber = the attention weighting
# one decode step — what "using the already-computed tokens" means def decode_step(new_token, cache): x = embed(new_token) for layer in layers: q, k, v = layer.project(x) # ONLY the new token's q,k,v cache[layer].append(k, v) # store this token's k,v K, V = cache[layer].all() # all prior tokens — read, not recomputed attn = softmax(q @ K.T) @ V # <-- the new token USES every past token here x = layer.feedforward(attn) return sample(x)

2 · Why reuse is valid — causality + determinism

Attention is causal: a token only ever attends backward. So appending new tokens cannot change any earlier token's key/value. Token 5's k/v depend on tokens 1–5, are computed once, and are frozen forever after. That single property gives you append-only growth and exact reuse.
PropertyConsequence
Append-onlydecode just adds k/v at the end; nothing earlier moves or is invalidated.
Exact reusean identical prefix yields bit-identical k/v (same tokens → same projections), so reusing stored vectors is not an approximation — it's the same numbers.
Prefix, not arbitrarychange one early token and every k/v after it changes (they all attended to it). So a cache match is a prefix match — it breaks at the first divergence.

3 · Prefill vs decode — the same loop, batched differently

PREFILL — input tokens t1 t2 t3 tN one parallel forward pass over all N N key/value pairs filled at once compute-bound matrix × matrix · high arithmetic intensity fast; this is what prompt-caching reuses DECODE — output tokens 1 new token attend to all cached k/v append this token's k/v loop ×1 per output token memory-bandwidth bound matrix × vector · low arithmetic intensity the "pure decode" regime — worst GPU utilization
prefill fills the cache in one dense pass · decode appends one entry per generated token, reading the whole cache each step

4 · PagedAttention — virtual memory for attention

KV is stored in fixed-size blocks (e.g. 16 tokens), non-contiguous in HBM, addressed through a per-sequence block table — exactly like an OS maps virtual pages to physical frames. This kills fragmentation (any free block works) and lets sequences share a prefix by pointing their block tables at the same physical blocks.
seq A · block table B0 → phys 7 B1 → phys 2 B2 → phys 9 B3 → phys 4 seq B · block table B0 → phys 7 (shared) B1 → phys 2 (shared) B2 → phys 5 physical KV blocks in HBM (non-contiguous) phys 7A·B0 / B·B0 free phys 2A·B1 / B·B1 free phys 4A·B3 phys 5B·B2 phys 9A·B2 free free free phys 7 & 2 are shared — seq A and seq B have the same prefix → reuse by reference (copy-on-write) non-contiguous mapping ⇒ any free block fits ⇒ no fragmentation, the way OS paging avoids it
red = a physical block shared by two sequences' prefixes · grey = free blocks in the pool · the block table is the page table

5 · Loading cached tokens — prefix reuse, not recompute

In steady state a turn's prompt is mostly cached prefix + a small new suffix (the latest user message). The runtime hashes the prompt in blocks, matches the leading blocks against resident KV, and points the new request's block table at those existing physical blocks. Prefill then runs only on the new suffix, attending to the reused prefix. "Loading the cached tokens" is a metadata operation — assuming they're still resident.

Where the cached blocks are"Load" cost
Resident in HBMfree — point the block table at them (the common turn-to-turn case)
Swapped to CPU DRAMPCIe copy back into HBM (swap_in)
Evicted entirelyrecompute — re-prefill from the token IDs (the only case that's truly "new tokens")
Why per-turn voice latency stays low
Each turn, the prior turn's prompt+reply becomes the cached prefix, so you only prefill the newest user utterance on top of a reused prefix — not the whole conversation. That's the low-TTFT trick (voice_stack §11/§14). On the Claude/API path it's the same idea via cache_control breakpoints: cached tokens return as a cheap cache-read, you never see the blocks.

6 · The memory hierarchy & who copies what — it bottoms out in a memcpy

GPU HBM active KV · ~2–3 TB/s attention reads here CPU host DRAM swapped-out blocks SSD / remote KV offload tier · LMCache/Mooncake tokens session store source of truth swap_in swap_out PCIe ~32 GB/s · cudaMemcpyAsync RDMA / NVMe recompute: copy tiny token IDs → re-run prefill → regenerate KV in HBM (FLOPs, not a copy) two ways to get a block back: SWAP — copy the KV bytes cost = KV_size / PCIe bandwidth KV is big · narrow bus → 10s–100s ms RECOMPUTE — regenerate from tokens cost = prefill FLOPs on fast compute usually wins for typical lengths
solid = a real byte copy over a bus · dashed red = no KV copy, regenerate by computation · the session store is the floor: lose everything and re-prefill from text
The chain of custody for a token's KV
allocate a block on prefill/decode → hold it while resident → under pressure evict / swap / free → rehydrate from CPU DRAM (PCIe), a remote tier (RDMA), or recompute from the session-store tokens. The KV is always derived; only the text is truly persisted.

7 · Eviction & preemption — swap and recompute are alternatives, not steps

When the pool fills, the scheduler frees the cheapest-to-lose blocks. Preemption picks one policy per sequence: swap (copy KV out to DRAM, copy back later) or recompute (drop the blocks, re-prefill from tokens when resumed). You never recompute something you swapped — if you preserved the bytes, you copy them back.
Trigger (in order of cheapness)What happens
A sequence finishesits KV is freed immediately → blocks return to the pool. The constant churn; usually enough on its own.
Evictable prefix-cache blocksdropped LRU — safe because recomputable from tokens.
Preempt a running sequenceunder real pressure: recompute (default) or swap, resume later. Can even happen mid-generation.
The usual flow is recompute — i.e. don't swap at all
vLLM's V1 makes recomputation the default preemption mode: drop the blocks, re-prefill on resume. For typical context lengths, re-running prefill on fast GPU compute beats dragging gigabytes of KV across a ~32 GB/s PCIe link. Swap wins only when contexts are very long (re-prefill gets expensive), the interconnect is fast (NVLink / Grace-Hopper C2C ~900 GB/s), or a high-reuse offload tier amortizes the copy.

8 · Who manages it, where — and self-host vs API changes the answer

ActorLaneRole re: KV
Serving runtime scheduler + block manager (vLLM)GPUthe only thing that touches KV tensors: allocate/free blocks, evict, swap, preempt, prefix-cache reuse. HBM ↔ DRAM.
Orchestrator (e.g. Pipecat)CPUmanages KV indirectly, at request granularity — submit (prefill), stream, abort (free), and choose context (what gets prefilled/cached). Never sees a tensor; talks over the network.
Session storeCPU / DBholds no KV — the source it's rebuilt from (re-prefill on loss).
Self-host vs API — the fork that changes "who"

Two schedulers, two altitudes (the resource-scheduler thesis twice): the orchestrator schedules sessions (which conversations exist, what context each sends); the serving runtime schedules tokens/blocks across all sessions sharing the GPU. Neither sees the other's units.

9 · Barge-in — cancel is just a free

Everything expensive here — swap, recompute, PCIe copies — is about preserving and restoring KV. Barge-in discards it, which is the cheap direction. There is no cache surgery, no rewind, no swap.
On barge-inCost
Abort the decoderemove the sequence from the running batch — same routine path as hitting end-of-sentence
Free the KV blocksmetadata op — return blocks to the pool, no copy, no recompute
Flush pending outputdrop in-flight audio (TTS buffer · egress queue · client playout) — CPU/network, for responsiveness
Commit the turnrecord the heard prefix (what actually played), not the generated text
The deflated model
Barge-in = abort decode (free KV) + flush pending output + commit the heard prefix. The cache is never edited — it's freed and naturally rebuilt next turn by normal prefill (prefix-cache accelerated). The only bookkeeping you can't skip: commit what was delivered, not generated, or the next turn's context desyncs from what the user heard. If you commit-on-delivery rather than commit-on-generation, even that is automatic. The only sunk cost is the FLOPs already spent on the unheard tail — aborting promptly stops further waste, it doesn't add any.

Full-duplex footnote: there isn't even an abort — yielding is the model choosing silence on the next frame. No batch removal, no free. Cheaper still (voice_stack §11).

10 · It's a memory allocator + scheduler — say it out loud

Say out loud
This maps toPattern
Admit by KV/token budget · continuous batching14_inference_server.py
Block allocator avoiding fragmentationREADME resource-scheduler thesis
KV is derived; rebuild from the storevoice_stack_master.html §10
Session pins a GPU slot = KV residencyvoice_stack_master.html §17
Weights loading (safetensors/mmap) — the other GPU loadgpu_data_movement.html · 11_safetensors_mmap.py

Verify-before-quoting: PCIe/NVLink bandwidths, block sizes, and the vLLM-V1 default-to-recompute behavior are as-captured — confirm exact figures before citing live.