LLM Inference Server — Batching & Disaggregation

The on-domain problem: thousands of single-line queries arrive concurrently; batch them onto a GPU efficiently. Code: 14_inference_server.py
The whole interview in one line: a request is not the unit of work — a token is. Everything below follows from that. When to batch, how big, how to split the pipeline, and why fixed batches waste the GPU all reduce to "account for tokens, not requests."

The Pipeline — stages with different resource profiles

The key architectural move is recognizing that the path from text → text is made of stages that want different hardware and scale on different curves. Each one wants its own pool.

many cheap CPU workers ONE scheduler few expensive GPU workers CPU again query "hello" query "..." query "..." query "..." TOKENIZE BPE merges, regex · CPU N cheap workers scales with request RATE BATCH SCHEDULER one loop owns queue+timer admit by token budget fires on size OR timer GPU: prefill + decode KV-cache bound few expensive workers scales with TOKENS, not request count DETOKENIZE CPU scales with output RATE token ids batch token ids stream out
Each box is its own pool, sized independently — N cheap CPU tokenizers fan into M expensive GPU workers. The GPU node receives token ids, not strings: a compact payload with no per-request CPU spike on the hot machine.

Why split. Tokenize is pure CPU (BPE merges, regex). Run it on the GPU box and it steals cycles and fights the GIL exactly when you need to keep the GPU fed — classic noisy neighbor. Split off, it scales independently: N cheap CPU workers fan into M expensive GPU workers, sized separately.

The cost. Each split adds a network hop + serialization boundary. Token IDs are small so it's cheap, but it's a real failure mode. Rule: only disaggregate a stage big enough to amortize the hop — worth it on a 70B model, not on a tiny one.

Detok is special. Detokenization is harder to fully disaggregate because of streaming: BPE means one token can be a partial UTF-8 sequence, so you can't always render token→text in isolation — you need the running decode state. Detok often stays near the decode loop or ships incremental state.

Stage 1 — When To Fire a Batch — the trigger

Fire on size ≥ cap OR timer expired — whichever first.

This is the part most candidates fumble: they try to make each submitter decide when to flush. Don't. A single consumer loop assembling batches is the clean pattern (_collect() in the code: block for the first item, then pull more until size cap, token budget, or deadline).

Stage 2 — How Big a Batch Can Be — capacity = memory

max_batch_size = 100 is a guess. The real limit isn't a request count — it's KV-cache bytes, and inputs are variable-length.

A batch of 100 short prompts and 100 long prompts use wildly different memory. So you admit by a token budget, not headcount:

capacity_tokens = total_kv_bytes / bytes_per_token admit next request while sum(seq_len for r in batch) + r.n_tokens ≤ capacity_tokens

In the code, _collect() tracks running tokens and, when the next request would blow max_tokens, puts it back for the following batch (deferred_for_budget). The demo's peak batch lands at 4080 tokens under a 4096 budget — full but never over.

How you actually find that number

MethodWhat you doReality
Static / offlineSweep batch size, watch where latency knees up or you OOM, pin just under.Where most interview answers stop. Fine for fixed-length work.
Memory-budgetedAdmit by summed sequence length vs KV-cache budget (above).The right answer for LLMs — variable-length inputs make headcount meaningless.
ContinuousMeasure free KV blocks live; admit/evict every token step.What vLLM / TGI actually do. See Stage 4.

Stage 3 — Prefill / Decode Disaggregation — inside the GPU

The same "different profile → different pool" logic applies within the GPU work:

PrefillDecode
WhatProcess the whole prompt at onceGenerate one token at a time, autoregressively
Bound byCompute (FLOPs) — bursty, shortMemory bandwidth — sustained, long
ShapeRuns hot for a momentStreams for the whole response

SOTA stacks (vLLM, TensorRT-LLM, DeepSeek) put these on separate GPU pools and ship the KV-cache between them. The full disaggregation story:

CPU: tokenize scales w/ rate GPU pool A: prefill compute-bound scales w/ prompt length GPU pool B: decode memory-bandwidth-bound scales w/ concurrent seqs × output len CPU: detokenize / stream scales w/ output rate KV transfer
purple = the KV-cache shipped between pools. Prefill is compute-bound and bursty; decode is memory-bandwidth-bound and sustained — distinct curves, so SOTA stacks give each its own GPU pool.

Stage 4 — Continuous Batching — why fixed batches waste the GPU

In autoregressive decode, requests finish at different times. A fixed batch waits for the SLOWEST sequence — short replies sit idle, GPU slots wasted.
Fixed batch — slots idle once a short seq finishes token steps → slots seq A · 500 tokens (the straggler) idle — held hostage → next batch can't start until A finishes Continuous batch — freed slots refilled each step token steps → seq A · 500 tokens admitted E admitted F admitted G admitted H admitted I, J …
green = active sequences · dashed = wasted idle slots under fixed batching · indigo = waiting requests admitted into freed KV slots the moment a sequence finishes. The GPU stays full instead of waiting on the straggler.

The code's continuous_batching_demo() proves it on slot occupancy with mixed output lengths:

fixed-batch: 17 steps, 45% slot utilization continuous-batch: 11 steps, 70% slot utilization

The Progression To Say Out Loud

Four moves, in order
  1. Start: size + timer batcher — throughput vs tail-latency knob.
  2. Then: capacity is memory, not headcount → admit by token budget, because inputs are variable-length.
  3. Then: disaggregate stages with distinct profiles → CPU tokenize pool, GPU prefill pool, GPU decode pool, CPU detok. Each scales on its own curve; cost is one hop per split.
  4. Then: continuous batching — fixed batches waste the GPU waiting on the slowest sequence; admit/evict per token step instead.
One-liner per axis: tokenize scales with request rate · prefill scales with prompt length · decode scales with concurrent sequences × output length. Different curves → different pools. And the binding capacity everywhere downstream of tokenize is KV-cache bytes, which is why a token — not a request — is the unit of work.