The on-domain problem: thousands of single-line queries arrive concurrently; batch them onto a GPU efficiently. Code: 14_inference_server.py
The whole interview in one line: a request is not the unit of work — a token is.
Everything below follows from that. When to batch, how big, how to split the
pipeline, and why fixed batches waste the GPU all reduce to "account for tokens, not requests."
The Pipeline — stages with different resource profiles
The key architectural move is recognizing that the path from text → text is made of stages that want different hardware and scale on different curves. Each one wants its own pool.
Each box is its own pool, sized independently — N cheap CPU tokenizers fan into M expensive GPU workers. The GPU node receives token ids, not strings: a compact payload with no per-request CPU spike on the hot machine.
Why split. Tokenize is pure CPU (BPE merges, regex). Run it on the GPU box and it steals cycles and fights the GIL exactly when you need to keep the GPU fed — classic noisy neighbor. Split off, it scales independently: N cheap CPU workers fan into M expensive GPU workers, sized separately.
The cost. Each split adds a network hop + serialization boundary. Token IDs are small so it's cheap, but it's a real failure mode. Rule: only disaggregate a stage big enough to amortize the hop — worth it on a 70B model, not on a tiny one.
Detok is special. Detokenization is harder to fully disaggregate because of streaming: BPE means one token can be a partial UTF-8 sequence, so you can't always render token→text in isolation — you need the running decode state. Detok often stays near the decode loop or ships incremental state.
Stage 1 — When To Fire a Batch — the trigger
Fire on size ≥ cap OR timer expired — whichever first.
Timer caps tail latency when traffic is thin — a lone request doesn't wait forever for batch-mates that aren't coming.
Size cap maximizes throughput when traffic is heavy — fill the batch and go.
One background loop owns the queue + timer. Submitters just put a future and await it — they never race each other to assemble batches.
This is the part most candidates fumble: they try to make each submitter decide when to flush. Don't. A single consumer loop assembling batches is the clean pattern (_collect() in the code: block for the first item, then pull more until size cap, token budget, or deadline).
Stage 2 — How Big a Batch Can Be — capacity = memory
max_batch_size = 100 is a guess. The real limit isn't a request count — it's KV-cache bytes, and inputs are variable-length.
A batch of 100 short prompts and 100 long prompts use wildly different memory. So you admit by a token budget, not headcount:
capacity_tokens = total_kv_bytes / bytes_per_token
admit next request while sum(seq_len for r in batch) + r.n_tokens ≤ capacity_tokens
In the code, _collect() tracks running tokens and, when the next request would blow max_tokens, puts it back for the following batch (deferred_for_budget). The demo's peak batch lands at 4080 tokens under a 4096 budget — full but never over.
How you actually find that number
Method
What you do
Reality
Static / offline
Sweep batch size, watch where latency knees up or you OOM, pin just under.
Where most interview answers stop. Fine for fixed-length work.
Memory-budgeted
Admit by summed sequence length vs KV-cache budget (above).
The right answer for LLMs — variable-length inputs make headcount meaningless.
Continuous
Measure free KV blocks live; admit/evict every token step.
The same "different profile → different pool" logic applies within the GPU work:
Prefill
Decode
What
Process the whole prompt at once
Generate one token at a time, autoregressively
Bound by
Compute (FLOPs) — bursty, short
Memory bandwidth — sustained, long
Shape
Runs hot for a moment
Streams for the whole response
SOTA stacks (vLLM, TensorRT-LLM, DeepSeek) put these on separate GPU pools and ship the KV-cache between them. The full disaggregation story:
purple = the KV-cache shipped between pools. Prefill is compute-bound and bursty; decode is memory-bandwidth-bound and sustained — distinct curves, so SOTA stacks give each its own GPU pool.
In autoregressive decode, requests finish at different times. A fixed batch waits for the SLOWEST sequence — short replies sit idle, GPU slots wasted.
Fixed: take a batch, run until all done, then take the next. The one 500-token reply holds 99 finished 5-token replies hostage.
Continuous: each token step, evict finished sequences and admit waiting ones into the freed KV slots. Batch composition changes every step; the GPU stays full.
green = active sequences · dashed = wasted idle slots under fixed batching · indigo = waiting requests admitted into freed KV slots the moment a sequence finishes. The GPU stays full instead of waiting on the straggler.
The code's continuous_batching_demo() proves it on slot occupancy with mixed output lengths:
Start: size + timer batcher — throughput vs tail-latency knob.
Then: capacity is memory, not headcount → admit by token budget, because inputs are variable-length.
Then: disaggregate stages with distinct profiles → CPU tokenize pool, GPU prefill pool, GPU decode pool, CPU detok. Each scales on its own curve; cost is one hop per split.
Then: continuous batching — fixed batches waste the GPU waiting on the slowest sequence; admit/evict per token step instead.
One-liner per axis: tokenize scales with request rate · prefill scales with prompt length · decode scales with concurrent sequences × output length. Different curves → different pools. And the binding capacity everywhere downstream of tokenize is KV-cache bytes, which is why a token — not a request — is the unit of work.