Worked Example — The One Billion Row Challenge

1BRC in Swift: parse a ~13 GB / 1,000,000,000-line file (station;temperature), compute min/mean/max per station, as fast as possible. The parsing is trivial; the memory and concurrency are everything — so it's a live-fire test of both primers (concurrency · memory models). Grounded in a real Swift implementation (brennanMKE/BillionRowChallenge), with an honest principal-level critique.
The one lesson: the bottleneck is never the CPU doing the math — it's getting bytes to the CPU without copying or scheduling overhead. Every 10× here is a memory or concurrency win. Naive readLine() + Dictionary + Double ≈ minutes; a tuned version ≈ 1–2 s. ~100× — entirely from the systems techniques in the two primers.

1 · What the reference code gets right — the memory primer in action

The implementation memory-maps the file and parses it in place — exactly the zero-copy story (memory primer §1.7):

// ✓ raw mmap — zero-copy, paged in on demand, parsed in place dataPointer = mmap(nil, fileSize, PROT_READ, MAP_PRIVATE, fd, 0) .assumingMemoryBound(to: CChar.self) close(fd) // mapping keeps its own reference — fd can close segmentSize = fileSize / numberOfSegments
Memory-primer concepts exercised
mmap / zero-copy (§1.7) · sequential read = perfect prefetch (§1.5 CPU path) · addressable ≠ resident (§0.5): 13 GB mmap'd is fine on a 16 GB Mac, but on 8 GB it pages to SSD and craters — the cliff in reverse.

2 · Where the win is thrown away — the two traps

You mmap to avoid copies… then copy every line into a String one line later. The reference does both of these — which makes it the perfect teaching contrast.

Trap 1 — a String (two allocations) per line

// ✗ called 1,000,000,000 times extension String { init?(cString: UnsafePointer<CChar>, length: Int) { let buffer = UnsafeBufferPointer(start: cString, count: length) self.init(bytes: buffer.map { UInt8(bitPattern: $0) }, // alloc #1: Array<UInt8> encoding: .utf8) // alloc #2: String storage + copy } }

Trap 2 — AsyncSequence for in-memory parsing

// ✗ a suspension point per line — for zero I/O benefit struct LineIterator: AsyncIteratorProtocol { mutating func next() async -> String? { ... } // bytes are already resident! }

3 · The fix — both primers applied

Never make a String. Parse the bytes in place, key the map on the raw name bytes, parse the temperature as an integer, and fan out one synchronous worker per core. Every change maps to a primer concept.
let cores = ProcessInfo.processInfo.activeProcessorCount // ~1 worker / core var partials = [StationMap](repeating: .init(), count: cores) // embarrassingly parallel + reduce — no shared state, no locks DispatchQueue.concurrentPerform(iterations: cores) { i in let (start, end) = newlineAdjustedBounds(i) // same boundary logic as the ref var local = StationMap() // PRIVATE map — no contention var p = start while p < end { let nameStart = p // station: scan to ';' while p.pointee != semicolon { p += 1 } let nameLen = p - nameStart p += 1 var neg = false; if p.pointee == minus { neg = true; p += 1 } var val = 0 // temp: parse xx.x as int*10, no Double while p.pointee != newline { let c = p.pointee if c != dot { val = val * 10 + Int(c - zero) } p += 1 } if neg { val = -val } p += 1 local.update(nameStart, nameLen, val) // hash the raw bytes in place } partials[i] = local } let result = merge(partials) // the reduce barrier — only sync point
FixWhy it's fasterPrimer concept
Parse bytes in place (no String)kills 2 allocs + 2 copies per line; stays on the cache-hot mmap pagesmemory §1.7 zero-copy / §1.8 heap-avoidance
Key map on raw name bytes~400 distinct stations → the only allocations are bounded (one per new station), not per linememory §1.8 (heap = the bounded, shared thing)
Integer temp parse (xx.x → int*10)integer ALU, no float parse/round; the NN-engine-style "avoid the slow unit" movememory §1.5 (execution unit)
Open-addressing hash map, sized to fit cache1B lookups → a cache miss per lookup is the whole runtime; keep it in L1/L2memory §1.5 ("cache misses are the enemy")
concurrentPerform, 1 worker/core, private mapsCPU-bound, never blocks; no locks during parse; merge onceconcurrency §6.5 (≈core-count, no exhaustion) + §7 (private state → no race)
Synchronous, not asyncno suspension points on work that never suspendsconcurrency §6.5 (async only pays off on real I/O)
Why GCD here, not Swift Concurrency
This is pure CPU-bound compute that never blocksDispatchQueue.concurrentPerform(iterations: cores) (GCD's parallel-for) is the natural fit: it runs exactly core-count iterations of uninterrupted work and joins. A TaskGroup works too, but you'd gain nothing and must be sure no parse step blocks a cooperative-pool thread (the forward-progress rule, §6.5). When the work is "saturate every core with non-blocking compute and reduce," GCD's parallel-for is the right tool — a case where newer isn't automatically better.

4 · The reduce step — the all-reduce, in miniature

Each worker produces a private station map; merge folds them into one. This is the same shape as the distributed-training all-reduce (the inference/training prep): parallel, independent local work → one synchronization barrier at the end. No locks during the parse (the expensive 99.99%); a single merge over ~400 stations × N workers (trivial) at the close. Contend only where contention is cheap.

5 · The takeaways

  1. mmap is necessary but not sufficient. The reference does the hard part right, then copies every line into a String — so it's slow despite the zero-copy read. Zero-copy in, zero-copy through.
  2. Heap allocation in a hot loop is the silent killer. 2 allocs/line × 1B dominates everything. Move allocation off the per-item path (bounded per-station, not per-line).
  3. Don't pay for async you don't use. AsyncSequence over resident memory adds suspension overhead with no suspension benefit — the forward-progress rule, read the other way.
  4. Parallelism is the easy 2×–8×; memory layout is the 50×. One-worker-per-core matters, but cache-resident map + no-copy parse + integer math is where the order of magnitude lives.
  5. Newer isn't automatically better. GCD's concurrentPerform beats a TaskGroup here precisely because the work never blocks — match the tool to the work shape, not the calendar.

Reference implementation: brennanMKE/BillionRowChallenge (Swift). Critique and the improved approach are this author's; the "fix" code is illustrative (pointer/byte-parsing sketch), not a drop-in. Companion primers: Apple Systems & Concurrency · Apple Memory Models.