GPU Data Movement & System Design

Serialization Formats

FormatStrengthsWeaknessesWhen to use
Pickle Convenient, Python-native Unsafe across versions, slow, arbitrary code execution Never for cross-process or untrusted data
JSON / JSONL Human-readable, universal Slow, no binary support JSONL for line-oriented streaming/logging
Protobuf Schema'd, fast, language-agnostic, broad ecosystem Requires serialize/deserialize step; schema management overhead Cross-service communication, config, any place you need a stable contract
FlatBuffers Zero-copy read directly from buffer — no parsing step Harder to use, smaller ecosystem than Protobuf Shoving bytes at accelerators; anywhere decode cost matters
Arrow / Parquet Columnar, zero-copy, column pruning & predicate pushdown Overkill for small payloads. Wrong shape for dense tensors — columnar layout doesn't help N-dim arrays Arrow = in-memory, Parquet = on-disk. The answer for tabular/analytics data, NOT model weights
Safetensors 8-byte length prefix + JSON header (names, dtypes, shapes, offsets) + raw contiguous bytes. Mmap-friendly, zero-copy, no parsing of payload Narrow scope (tensors only) Model weights. HF created it to kill pickle's arbitrary code execution risk on a public model hub
MessagePack / CBOR Binary JSON, compact Less tooling than JSON/Protobuf Middle ground when JSON is too slow but schema is overkill

Safetensors — Why It Exists

Format layout:
1.8 bytes — header length (little-endian u64)
2.JSON header — tensor names, dtypes, shapes, byte offsets (plain UTF-8, not JSONB)
3.Raw tensor data — contiguous bytes right after the header
Why HF built it:
1.Security — pickle allows arbitrary code execution on load. Unacceptable for a public model hub.
2.Mmap — memory-map the file, load only the tensors you need. Matters at 70B+ params.
3.Simplicity — nothing to exploit, nothing to decode. Header + raw bytes is all you need for dense numerical arrays.

Parquet vs Safetensors — Know Which to Reach For

Why not just send Parquet for everything?
Parquet is columnar — optimized for tabular data where you read a subset of columns. Tensors are dense N-dim arrays, not rows and columns.
Sending a tensor as Parquet = flatten to columns, pay compression/decompression, reshape back. You lose the contiguous strided layout the GPU expects and gain nothing.
Parquet = "100GB of training records." Safetensors = "100GB of model weights."

"You have 100GB of records to move to 8 GPUs — what format and why?"

The Answer:
1.Parquet on disk, sharded so each shard fits in host memory per worker
2.Read into Arrow on host (zero-copy, columnar)
3.Transfer to device using pinned memory — makes host-to-device copies async (bonus points)
4.Each GPU worker reads its own shard subset — no cross-worker coordination needed for reads

Archetypes to Implement

1. Sharded In-Memory Store
Consistent hashing, rebalancing on node add/remove
2. Streaming Aggregator
Chunked processing, partial result emission
3. Work Queue
N workers, retry on failure
4. Dataset Loader
Read shards, decode, batch, yield — mini DataLoader with prefetching + worker pool
5. Log/Event Parser
Resilient to malformed lines, never crashes the stream

System Design Prompts (Their Domain)

Training Data Pipeline
Object store → shuffling → sharding → workers → GPUs
Distributed Inference Service
Request routing, batching, KV cache, model sharding
Feature / Embedding Store
Write path from training, read path with low-latency lookup
Checkpoint / Artifact System
Large blobs, versioning, fast restore