LLM Inference Server — Batching & Disaggregation

The on-domain problem: thousands of single-line queries arrive concurrently; batch them onto a GPU efficiently. Code: 14_inference_server.py

The whole interview in one line: a request is not the unit of work — a token is. Everything below follows from that. When to batch, how big, how to split the pipeline, and why fixed batches waste the GPU all reduce to "account for tokens, not requests."

The Pipeline — stages with different resource profiles

The key architectural move is recognizing that the path from text → text is made of stages that want different hardware and scale on different curves. Each one wants its own pool.

Each box is its own pool, sized independently — N cheap CPU tokenizers fan into M expensive GPU workers. The GPU node receives token ids, not strings: a compact payload with no per-request CPU spike on the hot machine.

Why split. Tokenize is pure CPU (BPE merges, regex). Run it on the GPU box and it steals cycles and fights the GIL exactly when you need to keep the GPU fed — classic noisy neighbor. Split off, it scales independently: N cheap CPU workers fan into M expensive GPU workers, sized separately.

The cost. Each split adds a network hop + serialization boundary. Token IDs are small so it's cheap, but it's a real failure mode. Rule: only disaggregate a stage big enough to amortize the hop — worth it on a 70B model, not on a tiny one.

Detok is special. Detokenization is harder to fully disaggregate because of streaming: BPE means one token can be a partial UTF-8 sequence, so you can't always render token→text in isolation — you need the running decode state. Detok often stays near the decode loop or ships incremental state.

Stage 1 — When To Fire a Batch — the trigger

Fire on size ≥ cap OR timer expired — whichever first.

Timer caps tail latency when traffic is thin — a lone request doesn't wait forever for batch-mates that aren't coming.
Size cap maximizes throughput when traffic is heavy — fill the batch and go.
One background loop owns the queue + timer. Submitters just put a future and await it — they never race each other to assemble batches.

This is the part most candidates fumble: they try to make each submitter decide when to flush. Don't. A single consumer loop assembling batches is the clean pattern (_collect() in the code: block for the first item, then pull more until size cap, token budget, or deadline).

Stage 2 — How Big a Batch Can Be — capacity = memory

max_batch_size = 100 is a guess. The real limit isn't a request count — it's KV-cache bytes, and inputs are variable-length.

A batch of 100 short prompts and 100 long prompts use wildly different memory. So you admit by a token budget, not headcount:

capacity_tokens = total_kv_bytes / bytes_per_token admit next request while sum(seq_len for r in batch) + r.n_tokens ≤ capacity_tokens

In the code, _collect() tracks running tokens and, when the next request would blow max_tokens, puts it back for the following batch (deferred_for_budget). The demo's peak batch lands at 4080 tokens under a 4096 budget — full but never over.

How you actually find that number

Method	What you do	Reality
Static / offline	Sweep batch size, watch where latency knees up or you OOM, pin just under.	Where most interview answers stop. Fine for fixed-length work.
Memory-budgeted	Admit by summed sequence length vs KV-cache budget (above).	The right answer for LLMs — variable-length inputs make headcount meaningless.
Continuous	Measure free KV blocks live; admit/evict every token step.	What vLLM / TGI actually do. See Stage 4.

Stage 3 — Prefill / Decode Disaggregation — inside the GPU

The same "different profile → different pool" logic applies within the GPU work:

	Prefill	Decode
What	Process the whole prompt at once	Generate one token at a time, autoregressively
Bound by	Compute (FLOPs) — bursty, short	Memory bandwidth — sustained, long
Shape	Runs hot for a moment	Streams for the whole response

SOTA stacks (vLLM, TensorRT-LLM, DeepSeek) put these on separate GPU pools and ship the KV-cache between them. The full disaggregation story:

purple = the KV-cache shipped between pools. Prefill is compute-bound and bursty; decode is memory-bandwidth-bound and sustained — distinct curves, so SOTA stacks give each its own GPU pool.

Stage 4 — Continuous Batching — why fixed batches waste the GPU

In autoregressive decode, requests finish at different times. A fixed batch waits for the SLOWEST sequence — short replies sit idle, GPU slots wasted.

Fixed: take a batch, run until all done, then take the next. The one 500-token reply holds 99 finished 5-token replies hostage.
Continuous: each token step, evict finished sequences and admit waiting ones into the freed KV slots. Batch composition changes every step; the GPU stays full.

green = active sequences · dashed = wasted idle slots under fixed batching · indigo = waiting requests admitted into freed KV slots the moment a sequence finishes. The GPU stays full instead of waiting on the straggler.

The code's continuous_batching_demo() proves it on slot occupancy with mixed output lengths:

The Progression To Say Out Loud

Four moves, in order

Start: size + timer batcher — throughput vs tail-latency knob.
Then: capacity is memory, not headcount → admit by token budget, because inputs are variable-length.
Then: disaggregate stages with distinct profiles → CPU tokenize pool, GPU prefill pool, GPU decode pool, CPU detok. Each scales on its own curve; cost is one hop per split.
Then: continuous batching — fixed batches waste the GPU waiting on the slowest sequence; admit/evict per token step instead.

One-liner per axis: tokenize scales with request rate · prefill scales with prompt length · decode scales with concurrent sequences × output length. Different curves → different pools. And the binding capacity everywhere downstream of tokenize is KV-cache bytes, which is why a token — not a request — is the unit of work.