| Format | Strengths | Weaknesses | When to use |
| Pickle |
Convenient, Python-native |
Unsafe across versions, slow, arbitrary code execution |
Never for cross-process or untrusted data |
| JSON / JSONL |
Human-readable, universal |
Slow, no binary support |
JSONL for line-oriented streaming/logging |
| Protobuf |
Schema'd, fast, language-agnostic, broad ecosystem |
Requires serialize/deserialize step; schema management overhead |
Cross-service communication, config, any place you need a stable contract |
| FlatBuffers |
Zero-copy read directly from buffer — no parsing step |
Harder to use, smaller ecosystem than Protobuf |
Shoving bytes at accelerators; anywhere decode cost matters |
| Arrow / Parquet |
Columnar, zero-copy, column pruning & predicate pushdown |
Overkill for small payloads. Wrong shape for dense tensors — columnar layout doesn't help N-dim arrays |
Arrow = in-memory, Parquet = on-disk. The answer for tabular/analytics data, NOT model weights |
| Safetensors |
8-byte length prefix + JSON header (names, dtypes, shapes, offsets) + raw contiguous bytes. Mmap-friendly, zero-copy, no parsing of payload |
Narrow scope (tensors only) |
Model weights. HF created it to kill pickle's arbitrary code execution risk on a public model hub |
| MessagePack / CBOR |
Binary JSON, compact |
Less tooling than JSON/Protobuf |
Middle ground when JSON is too slow but schema is overkill |
Format layout:
1.8 bytes — header length (little-endian u64)
2.JSON header — tensor names, dtypes, shapes, byte offsets (plain UTF-8, not JSONB)
3.Raw tensor data — contiguous bytes right after the header
Why HF built it:
1.Security — pickle allows arbitrary code execution on load. Unacceptable for a public model hub.
2.Mmap — memory-map the file, load only the tensors you need. Matters at 70B+ params.
3.Simplicity — nothing to exploit, nothing to decode. Header + raw bytes is all you need for dense numerical arrays.
Why not just send Parquet for everything?
•Parquet is columnar — optimized for tabular data where you read a subset of columns. Tensors are dense N-dim arrays, not rows and columns.
•Sending a tensor as Parquet = flatten to columns, pay compression/decompression, reshape back. You lose the contiguous strided layout the GPU expects and gain nothing.
•Parquet = "100GB of training records." Safetensors = "100GB of model weights."
The Answer:
1.Parquet on disk, sharded so each shard fits in host memory per worker
2.Read into Arrow on host (zero-copy, columnar)
3.Transfer to device using pinned memory — makes host-to-device copies async (bonus points)
4.Each GPU worker reads its own shard subset — no cross-worker coordination needed for reads
1. Sharded In-Memory Store
Consistent hashing, rebalancing on node add/remove
2. Streaming Aggregator
Chunked processing, partial result emission
3. Work Queue
N workers, retry on failure
4. Dataset Loader
Read shards, decode, batch, yield — mini DataLoader with prefetching + worker pool
5. Log/Event Parser
Resilient to malformed lines, never crashes the stream
Training Data Pipeline
Object store → shuffling → sharding → workers → GPUs
Distributed Inference Service
Request routing, batching, KV cache, model sharding
Feature / Embedding Store
Write path from training, read path with low-latency lookup
Checkpoint / Artifact System
Large blobs, versioning, fast restore