How the Low-Level GPU KV-Cache Works

From the attention math up to block allocation, swap-vs-recompute, and why barge-in cancel is cheap. Companion to voice_stack_master.html (§10 memory · §11 LLM core · §17 scheduler) and 14_inference_server.py.

0 · Thesis: the cache stores work, not tokens 1 · What's actually in the cache 2 · Why reuse is valid: causality + determinism 3 · Prefill vs decode 4 · PagedAttention: virtual memory for attention 5 · Loading cached tokens (prefix reuse) 6 · The memory hierarchy & who copies what 7 · Eviction & preemption: swap vs recompute 8 · Who manages it, where 9 · Barge-in: cancel is just a free 10 · It's a memory allocator + scheduler

0 · Thesis — the cache stores work, not tokens

The KV-cache does not store your tokens. It stores the per-token work the model already did — each past token's key and value vectors, at every layer. Generation reuses that work instead of redoing it. Everything else — paging, swapping, prefix reuse, barge-in — is just memory management over those vectors, and it behaves exactly like an OS managing virtual memory.

Why it exists: without it, generating each new token re-runs the full forward pass over the whole sequence — O(N) per token, O(N²) total. The cache makes it O(1) new compute per token.
Why it's reusable: attention is causal — a token only looks backward — so a past token's K/V never change when you append more tokens. They're frozen the moment they're computed.
Why it's a systems problem: the cache is big and lives in finite GPU memory, so something must allocate, evict, swap, and rebuild it — that "something" is the inference server's scheduler.

1 · What's actually in the cache — keys and values, per token, per layer

When a token is processed, each layer projects it into three vectors: a query ("what I'm looking for"), a key ("what I'm about"), and a value ("what I contribute"). The cache stores the key and value of every past token — not the text, the vectors. To generate the next token, the model computes only its query and attends over every cached key/value.

blue = cached past tokens · green = the token being generated (its k/v gets appended) · amber = the attention weighting

# one decode step — what "using the already-computed tokens" means def decode_step(new_token, cache): x = embed(new_token) for layer in layers: q, k, v = layer.project(x) # ONLY the new token's q,k,v cache[layer].append(k, v) # store this token's k,v K, V = cache[layer].all() # all prior tokens — read, not recomputed attn = softmax(q @ K.T) @ V # <-- the new token USES every past token here x = layer.feedforward(attn) return sample(x)

2 · Why reuse is valid — causality + determinism

Attention is causal: a token only ever attends backward. So appending new tokens cannot change any earlier token's key/value. Token 5's k/v depend on tokens 1–5, are computed once, and are frozen forever after. That single property gives you append-only growth and exact reuse.

Property	Consequence
Append-only	decode just adds k/v at the end; nothing earlier moves or is invalidated.
Exact reuse	an identical prefix yields bit-identical k/v (same tokens → same projections), so reusing stored vectors is not an approximation — it's the same numbers.
Prefix, not arbitrary	change one early token and every k/v after it changes (they all attended to it). So a cache match is a prefix match — it breaks at the first divergence.

3 · Prefill vs decode — the same loop, batched differently

prefill fills the cache in one dense pass · decode appends one entry per generated token, reading the whole cache each step

4 · PagedAttention — virtual memory for attention

KV is stored in fixed-size blocks (e.g. 16 tokens), non-contiguous in HBM, addressed through a per-sequence block table — exactly like an OS maps virtual pages to physical frames. This kills fragmentation (any free block works) and lets sequences share a prefix by pointing their block tables at the same physical blocks.

red = a physical block shared by two sequences' prefixes · grey = free blocks in the pool · the block table is the page table

5 · Loading cached tokens — prefix reuse, not recompute

In steady state a turn's prompt is mostly cached prefix + a small new suffix (the latest user message). The runtime hashes the prompt in blocks, matches the leading blocks against resident KV, and points the new request's block table at those existing physical blocks. Prefill then runs only on the new suffix, attending to the reused prefix. "Loading the cached tokens" is a metadata operation — assuming they're still resident.

Where the cached blocks are	"Load" cost
Resident in HBM	free — point the block table at them (the common turn-to-turn case)
Swapped to CPU DRAM	PCIe copy back into HBM (`swap_in`)
Evicted entirely	recompute — re-prefill from the token IDs (the only case that's truly "new tokens")

Why per-turn voice latency stays low

Each turn, the prior turn's prompt+reply becomes the cached prefix, so you only prefill the newest user utterance on top of a reused prefix — not the whole conversation. That's the low-TTFT trick (voice_stack §11/§14). On the Claude/API path it's the same idea via cache_control breakpoints: cached tokens return as a cheap cache-read, you never see the blocks.

6 · The memory hierarchy & who copies what — it bottoms out in a memcpy

solid = a real byte copy over a bus · dashed red = no KV copy, regenerate by computation · the session store is the floor: lose everything and re-prefill from text

The chain of custody for a token's KV

allocate a block on prefill/decode → hold it while resident → under pressure evict / swap / free → rehydrate from CPU DRAM (PCIe), a remote tier (RDMA), or recompute from the session-store tokens. The KV is always derived; only the text is truly persisted.

7 · Eviction & preemption — swap and recompute are alternatives, not steps

When the pool fills, the scheduler frees the cheapest-to-lose blocks. Preemption picks one policy per sequence: swap (copy KV out to DRAM, copy back later) or recompute (drop the blocks, re-prefill from tokens when resumed). You never recompute something you swapped — if you preserved the bytes, you copy them back.

Trigger (in order of cheapness)	What happens
A sequence finishes	its KV is freed immediately → blocks return to the pool. The constant churn; usually enough on its own.
Evictable prefix-cache blocks	dropped LRU — safe because recomputable from tokens.
Preempt a running sequence	under real pressure: recompute (default) or swap, resume later. Can even happen mid-generation.

The usual flow is recompute — i.e. don't swap at all

vLLM's V1 makes recomputation the default preemption mode: drop the blocks, re-prefill on resume. For typical context lengths, re-running prefill on fast GPU compute beats dragging gigabytes of KV across a ~32 GB/s PCIe link. Swap wins only when contexts are very long (re-prefill gets expensive), the interconnect is fast (NVLink / Grace-Hopper C2C ~900 GB/s), or a high-reuse offload tier amortizes the copy.

8 · Who manages it, where — and self-host vs API changes the answer

Actor	Lane	Role re: KV
Serving runtime scheduler + block manager (vLLM)	GPU	the only thing that touches KV tensors: allocate/free blocks, evict, swap, preempt, prefix-cache reuse. HBM ↔ DRAM.
Orchestrator (e.g. Pipecat)	CPU	manages KV indirectly, at request granularity — submit (prefill), stream, abort (free), and choose context (what gets prefilled/cached). Never sees a tensor; talks over the network.
Session store	CPU / DB	holds no KV — the source it's rebuilt from (re-prefill on loss).

Self-host vs API — the fork that changes "who"

Self-host (vLLM): you own the KV scheduler — preemption, eviction, swap, prefix cache are yours to run and tune.
Claude / OpenAI API: KV is the provider's, fully server-side. Your only lever is prompt-cache breakpoints (cache_control) plus what you send. Abort = close the stream. You cannot swap, evict, inspect, or rewind. "Who manages the KV" = the provider.

Two schedulers, two altitudes (the resource-scheduler thesis twice): the orchestrator schedules sessions (which conversations exist, what context each sends); the serving runtime schedules tokens/blocks across all sessions sharing the GPU. Neither sees the other's units.

9 · Barge-in — cancel is just a free

Everything expensive here — swap, recompute, PCIe copies — is about preserving and restoring KV. Barge-in discards it, which is the cheap direction. There is no cache surgery, no rewind, no swap.

On barge-in	Cost
Abort the decode	remove the sequence from the running batch — same routine path as hitting end-of-sentence
Free the KV blocks	metadata op — return blocks to the pool, no copy, no recompute
Flush pending output	drop in-flight audio (TTS buffer · egress queue · client playout) — CPU/network, for responsiveness
Commit the turn	record the heard prefix (what actually played), not the generated text

The deflated model

Barge-in = abort decode (free KV) + flush pending output + commit the heard prefix. The cache is never edited — it's freed and naturally rebuilt next turn by normal prefill (prefix-cache accelerated). The only bookkeeping you can't skip: commit what was delivered, not generated, or the next turn's context desyncs from what the user heard. If you commit-on-delivery rather than commit-on-generation, even that is automatic. The only sunk cost is the FLOPs already spent on the unheard tail — aborting promptly stops further waste, it doesn't add any.

Full-duplex footnote: there isn't even an abort — yielding is the model choosing silence on the next frame. No batch removal, no free. Cheaper still (voice_stack §11).

10 · It's a memory allocator + scheduler — say it out loud

Say out loud

"The KV-cache stores each past token's key/value vectors, not the tokens. A new token's query attends over them — that's how prior tokens are used without recompute."
"Reuse is valid because attention is causal: appending tokens never changes earlier k/v. Append-only, and an identical prefix gives bit-identical k/v — so prefix caching is exact."
"PagedAttention is virtual memory for attention — fixed blocks, a block-table page-table, non-contiguous to avoid fragmentation, shared prefixes via copy-on-write."
"Getting a block back is swap (copy KV bytes over PCIe) or recompute (regenerate from token IDs) — and recompute usually wins because the bus is narrower than the GPU is fast."
"It's a resource scheduler: fragmentation (paging), fairness (don't starve sessions), reclamation (free/evict/swap), online (admit by token budget, not request count)."
"Barge-in cancel is a free, not a rewind — discarding KV is the cheap direction; you only pay copies/recompute when you try to keep a cache alive."

This maps to	Pattern
Admit by KV/token budget · continuous batching	`14_inference_server.py`
Block allocator avoiding fragmentation	README resource-scheduler thesis
KV is derived; rebuild from the store	`voice_stack_master.html` §10
Session pins a GPU slot = KV residency	`voice_stack_master.html` §17
Weights loading (safetensors/mmap) — the other GPU load	`gpu_data_movement.html` · `11_safetensors_mmap.py`

Verify-before-quoting: PCIe/NVLink bandwidths, block sizes, and the vLLM-V1 default-to-recompute behavior are as-captured — confirm exact figures before citing live.