Worked Example — The One Billion Row Challenge

1BRC in Swift: parse a ~13 GB / 1,000,000,000-line file (station;temperature), compute min/mean/max per station, as fast as possible. The parsing is trivial; the memory and concurrency are everything — so it's a live-fire test of both primers (concurrency · memory models). Grounded in a real Swift implementation (brennanMKE/BillionRowChallenge), with an honest principal-level critique.

The one lesson: the bottleneck is never the CPU doing the math — it's getting bytes to the CPU without copying or scheduling overhead. Every 10× here is a memory or concurrency win. Naive readLine() + Dictionary + Double ≈ minutes; a tuned version ≈ 1–2 s. ~100× — entirely from the systems techniques in the two primers.

1 · What the reference code gets right — the memory primer in action

The implementation memory-maps the file and parses it in place — exactly the zero-copy story (memory primer §1.7):

// ✓ raw mmap — zero-copy, paged in on demand, parsed in place dataPointer = mmap(nil, fileSize, PROT_READ, MAP_PRIVATE, fd, 0) .assumingMemoryBound(to: CChar.self) close(fd) // mapping keeps its own reference — fd can close segmentSize = fileSize / numberOfSegments

mmap, not read(): the 13 GB never gets copied into a buffer; the OS pages it from SSD lazily and evicts under pressure. A buffered read loop loses because it copies every byte.
Correct segment-boundary handling: the file is split into N segments; each worker's range is adjusted with findNextNewline — segment 0 starts at byte 0, every other segment skips its partial first line, and each extends its end to the next newline so no line is split across workers. This is the classic chunk-boundary off-by-a-line bug, handled properly.

Memory-primer concepts exercised

mmap / zero-copy (§1.7) · sequential read = perfect prefetch (§1.5 CPU path) · addressable ≠ resident (§0.5): 13 GB mmap'd is fine on a 16 GB Mac, but on 8 GB it pages to SSD and craters — the cliff in reverse.

2 · Where the win is thrown away — the two traps

You mmap to avoid copies… then copy every line into a String one line later. The reference does both of these — which makes it the perfect teaching contrast.

Trap 1 — a `String` (two allocations) per line

// ✗ called 1,000,000,000 times extension String { init?(cString: UnsafePointer<CChar>, length: Int) { let buffer = UnsafeBufferPointer(start: cString, count: length) self.init(bytes: buffer.map { UInt8(bitPattern: $0) }, // alloc #1: Array<UInt8> encoding: .utf8) // alloc #2: String storage + copy } }

Two heap allocations + two copies per line, ×1B. The .map builds a throwaway Array<UInt8>; String(bytes:encoding:) allocates again and copies. This is the heap (memory primer §1.8) on the hottest possible path — the mmap zero-copy win is undone immediately.
It also defeats cache locality: each String is a fresh heap object, scattered, cache-cold — the opposite of the in-place, prefetch-friendly mmap bytes.
Latent correctness bug: String(cString:length:) is failable → returns nil on any invalid UTF-8 byte → the iterator returns nil → the segment ends early, silently dropping the rest of its data.

Trap 2 — `AsyncSequence` for in-memory parsing

// ✗ a suspension point per line — for zero I/O benefit struct LineIterator: AsyncIteratorProtocol { mutating func next() async -> String? { ... } // bytes are already resident! }

There is no I/O to await — the bytes are mmap'd and resident. Marking the per-line read async adds a suspension point a billion times for nothing.
This is the forward-progress lesson inverted (concurrency primer §6.5): async earns its keep when a task suspends on real I/O and frees a cooperative-pool thread. CPU-bound in-memory parsing never suspends — so async here is pure overhead, not a win.

3 · The fix — both primers applied

Never make a String. Parse the bytes in place, key the map on the raw name bytes, parse the temperature as an integer, and fan out one synchronous worker per core. Every change maps to a primer concept.

Fix	Why it's faster	Primer concept
Parse bytes in place (no String)	kills 2 allocs + 2 copies per line; stays on the cache-hot mmap pages	memory §1.7 zero-copy / §1.8 heap-avoidance
Key map on raw name bytes	~400 distinct stations → the only allocations are bounded (one per new station), not per line	memory §1.8 (heap = the bounded, shared thing)
Integer temp parse (xx.x → int*10)	integer ALU, no float parse/round; the NN-engine-style "avoid the slow unit" move	memory §1.5 (execution unit)
Open-addressing hash map, sized to fit cache	1B lookups → a cache miss per lookup is the whole runtime; keep it in L1/L2	memory §1.5 ("cache misses are the enemy")
`concurrentPerform`, 1 worker/core, private maps	CPU-bound, never blocks; no locks during parse; merge once	concurrency §6.5 (≈core-count, no exhaustion) + §7 (private state → no race)
Synchronous, not `async`	no suspension points on work that never suspends	concurrency §6.5 (async only pays off on real I/O)

Why GCD here, not Swift Concurrency

This is pure CPU-bound compute that never blocks — DispatchQueue.concurrentPerform(iterations: cores) (GCD's parallel-for) is the natural fit: it runs exactly core-count iterations of uninterrupted work and joins. A TaskGroup works too, but you'd gain nothing and must be sure no parse step blocks a cooperative-pool thread (the forward-progress rule, §6.5). When the work is "saturate every core with non-blocking compute and reduce," GCD's parallel-for is the right tool — a case where newer isn't automatically better.

4 · The reduce step — the all-reduce, in miniature

Each worker produces a private station map; merge folds them into one. This is the same shape as the distributed-training all-reduce (the inference/training prep): parallel, independent local work → one synchronization barrier at the end. No locks during the parse (the expensive 99.99%); a single merge over ~400 stations × N workers (trivial) at the close. Contend only where contention is cheap.

5 · The takeaways

mmap is necessary but not sufficient. The reference does the hard part right, then copies every line into a String — so it's slow despite the zero-copy read. Zero-copy in, zero-copy through.
Heap allocation in a hot loop is the silent killer. 2 allocs/line × 1B dominates everything. Move allocation off the per-item path (bounded per-station, not per-line).
Don't pay for async you don't use. AsyncSequence over resident memory adds suspension overhead with no suspension benefit — the forward-progress rule, read the other way.
Parallelism is the easy 2×–8×; memory layout is the 50×. One-worker-per-core matters, but cache-resident map + no-copy parse + integer math is where the order of magnitude lives.
Newer isn't automatically better. GCD's concurrentPerform beats a TaskGroup here precisely because the work never blocks — match the tool to the work shape, not the calendar.

Reference implementation: brennanMKE/BillionRowChallenge (Swift). Critique and the improved approach are this author's; the "fix" code is illustrative (pointer/byte-parsing sketch), not a drop-in. Companion primers: Apple Systems & Concurrency · Apple Memory Models.