GPU Data Movement & System Design

Serialization Formats

Format	Strengths	Weaknesses	When to use
Pickle	Convenient, Python-native	Unsafe across versions, slow, arbitrary code execution	Never for cross-process or untrusted data
JSON / JSONL	Human-readable, universal	Slow, no binary support	JSONL for line-oriented streaming/logging
Protobuf	Schema'd, fast, language-agnostic, broad ecosystem	Requires serialize/deserialize step; schema management overhead	Cross-service communication, config, any place you need a stable contract
FlatBuffers	Zero-copy read directly from buffer — no parsing step	Harder to use, smaller ecosystem than Protobuf	Shoving bytes at accelerators; anywhere decode cost matters
Arrow / Parquet	Columnar, zero-copy, column pruning & predicate pushdown	Overkill for small payloads. Wrong shape for dense tensors — columnar layout doesn't help N-dim arrays	Arrow = in-memory, Parquet = on-disk. The answer for tabular/analytics data, NOT model weights
Safetensors	8-byte length prefix + JSON header (names, dtypes, shapes, offsets) + raw contiguous bytes. Mmap-friendly, zero-copy, no parsing of payload	Narrow scope (tensors only)	Model weights. HF created it to kill pickle's arbitrary code execution risk on a public model hub
MessagePack / CBOR	Binary JSON, compact	Less tooling than JSON/Protobuf	Middle ground when JSON is too slow but schema is overkill

Safetensors — Why It Exists

Format layout:

1.8 bytes — header length (little-endian u64)

2.JSON header — tensor names, dtypes, shapes, byte offsets (plain UTF-8, not JSONB)

3.Raw tensor data — contiguous bytes right after the header

Why HF built it:

1.Security — pickle allows arbitrary code execution on load. Unacceptable for a public model hub.

2.Mmap — memory-map the file, load only the tensors you need. Matters at 70B+ params.

3.Simplicity — nothing to exploit, nothing to decode. Header + raw bytes is all you need for dense numerical arrays.

Parquet vs Safetensors — Know Which to Reach For

Why not just send Parquet for everything?

•Parquet is columnar — optimized for tabular data where you read a subset of columns. Tensors are dense N-dim arrays, not rows and columns.

•Sending a tensor as Parquet = flatten to columns, pay compression/decompression, reshape back. You lose the contiguous strided layout the GPU expects and gain nothing.

•Parquet = "100GB of training records." Safetensors = "100GB of model weights."

"You have 100GB of records to move to 8 GPUs — what format and why?"

The Answer:

1.Parquet on disk, sharded so each shard fits in host memory per worker

2.Read into Arrow on host (zero-copy, columnar)

3.Transfer to device using pinned memory — makes host-to-device copies async (bonus points)

4.Each GPU worker reads its own shard subset — no cross-worker coordination needed for reads

Archetypes to Implement

1. Sharded In-Memory Store

Consistent hashing, rebalancing on node add/remove

2. Streaming Aggregator

Chunked processing, partial result emission

3. Work Queue

N workers, retry on failure

4. Dataset Loader

Read shards, decode, batch, yield — mini DataLoader with prefetching + worker pool

5. Log/Event Parser

Resilient to malformed lines, never crashes the stream

System Design Prompts (Their Domain)

Training Data Pipeline

Object store → shuffling → sharding → workers → GPUs

Distributed Inference Service

Request routing, batching, KV cache, model sharding

Feature / Embedding Store

Write path from training, read path with low-latency lookup

Checkpoint / Artifact System

Large blobs, versioning, fast restore