Apple Memory Models — UMA, and is it better?

Apple Silicon's Unified Memory Architecture: what a buffer crosses to reach the CPU, GPU, or Neural Engine; how UMA differs from the discrete (PCIe) model at every layer; the history (it's mostly 1960s ideas, integrated); a verdict on whether unified is actually better; and the paging cliff that bounds the answer.
0 · Overview & history 0.5 · Is UMA better? (verdict + the paging cliff) 1 · The full stack (one diagram) 1.5 · Three consumers: CPU · NN engine · shaders 1.7 · Apple on-chip memory (threadgroup, tile, registers) 1.8 · Stack vs heap (CPU and GPU) 2 · Virtual vs physical; user vs kernel 3 · Pageable vs pinned (page-locked) 4 · DMA & the discrete copy path 5 · Software-unified (CUDA managed) 6 · Hardware-unified (Apple UMA) 7 · Cache coherence 7.5 · Memory security — ASLR + the Apple stack 8 · How this matters for developers (by scenario) 9 · Side-by-side & gotchas

0 · Overview & history

A pointer in your code is a virtual address in user space; the hardware that consumes it (DMA engine, GPU MMU) works in physical addresses. Bridging that gap — pin the pages, set up translation, then either copy across a bus (discrete) or share the same physical pages (unified) — is the whole story. Apple Silicon's UMA shares one physical pool, so the copy disappears; a discrete GPU pays a PCIe copy into its own VRAM.
History — Apple integrated, it didn't invent
Almost none of UMA is new; the pieces are decades old and Apple's contribution is putting them on one package and making the whole memory pool fast and shared: What Apple did that's genuinely distinctive: on-package LPDDR with high bandwidth (~68 GB/s on the M1 base, up to ~800 GB/s on Ultra-class parts), a single coherent address space the CPU/GPU/Neural-Engine all use at full speed, and no second VRAM pool at all — so "unified" stops being a budget compromise (integrated graphics) and becomes the high-performance default. The idea is old; making it not slow is the achievement. (The datacenter is now copying it — AMD MI300A is the true single-physical-pool analog; NVIDIA Grace-Hopper keeps two pools (Grace LPDDR5X + Hopper HBM3) joined by cache-coherent NVLink-C2C, closer to the coherent two-pool model in §5/§7.)

0.5 · Is UMA better? — a verdict, and the cliff that bounds it

For Apple's targets — phones, laptops, on-device inference — yes, decisively. As a universal model — no. UMA isn't strictly "better"; it's a different point on the curve: it optimizes bytes-not-moved (latency, power, copy-free sharing), where discrete + HBM optimizes bytes-per-second (raw bandwidth, scalable capacity).
UMA wins when…Discrete + HBM wins when…
latency-sensitive, power-constrained (battery devices)peak bandwidth dominates (HBM ~3+ TB/s, GPU-dedicated)
workload copies data CPU↔GPU↔NPU constantlytraining-scale throughput, the largest models
the model fits in RAM — "whole model addressable, no copy"you need to scale capacity past one package (add VRAM, add GPUs)
efficiency per watt matters (moving bytes over PCIe burns power)CPU and GPU would otherwise contend for one bus

The reason Macs punch above their weight on local LLMs: a 64/128/192 GB Apple Silicon machine can keep a large model fully resident and addressable without a copy, where a same-priced discrete card is capped by its VRAM. UMA trades peak for efficiency and capacity-at-the-low-end — the right trade for its devices, the wrong one for a training cluster.

The cliff that bounds the verdict: UMA can page to SSD
UMA is still ordinary virtual memory — those LPDDR pages are pageable. Under pressure, macOS/iOS compresses memory, then swaps to SSD. And because everything shares one pool, one greedy allocation squeezes everyone. So the verdict has a hard edge: UMA's "whole model is addressable" win holds only while the hot set stays within physical RAM. Addressable ≠ resident. Exceed the pool and the copy-free advantage is buried under SSD paging — which is exactly why people buy the big-RAM Apple Silicon configs for local models: they're paying to keep the model wired and never touch swap.

1 · The full stack — every layer a buffer crosses

DISCRETE GPU — PCIe (e.g. NVIDIA) UNIFIED MEMORY — UMA (Apple Silicon) USER SPACE virtual addrs your process KERNEL / SYSTEM driver, VM, page tables HARDWARE physical DRAM, bus, GPU Application buffer — malloc / new pageable, in process virtual address space CUDA runtime — cudaHostAlloc / cudaMemcpy allocate pinned host buffer · enqueue transfer VM: page-lock (pin) mark pages non-swappable; fix them to physical frames so DMA addrs stay valid GPU driver program DMA engine, build GPU page tables, submit command buffer CPU MMU + TLB virtual → physical for CPU accesses IOMMU / GPU MMU device virtual → physical; protects/maps DMA System DRAM (DDR5) — host RAM pinned staging buffer lives here at a fixed physical address ~50–100 GB/s PCIe bus — DMA engine copies bytes THE bottleneck: ~16–64 GB/s, far below either DRAM async DMA overlaps with compute (needs pinned src) GPU VRAM (GDDR6 / HBM) — device memory separate physical pool, GPU's own address space ~400 GB/s – 3+ TB/s (HBM) GPU compute (SMs) — kernels read VRAM VRAM in the fast path (zero-copy mapped host is the exception) DMA copy ↓ Application buffer — malloc / new same idea: pageable, in process virtual space Metal — MTLBuffer (storageModeShared) one buffer, addressable by CPU and GPU — no copy API VM: wire pages pages still pinned for the GPU, but they are the SAME pages — no copy GPU driver map the pages into the GPU's tables, submit work (no cross-bus DMA copy) CPU MMU + TLB CPU virtual → physical GPU MMU + TLB GPU virtual → same physical frames as the CPU ONE PHYSICAL POOL — LPDDR on-package CPU, GPU, and Neural Engine all address these same pages no VRAM, no PCIe, no copy — ~68–800 GB/s shared on-die fabric the buffer is written once, read in place by either processor tradeoff: CPU + GPU share one bandwidth budget CPU cores read/write in place GPU compute read/write in place shared, in place — no copy What unified memory DELETES from the left column ✗ the pinned staging buffer (no separate host copy) ✗ the PCIe DMA copy (the bandwidth bottleneck) ✗ the second VRAM pool (no duplicate of the data) ✓ keeps: page tables, pinning, two MMUs, coherence
blue = user space · purple = kernel/system · green = hardware/physical · orange = the bus / the cost

1.5 · Three consumers, three paths — CPU · NN engine · GPU shaders

"Pass memory to the processor" isn't one flow — what the consumer is changes the path. The CPU reads through its own caches (no DMA at all). A transformer / NN accelerator streams weights from device memory into tiny on-chip SRAM and keeps the KV-cache resident. GPU shaders bind textures and vertex buffers through descriptors and sample through fixed-function units. Same DRAM, three different access shapes.
Physical DRAM — the buffer in memory (discrete: VRAM after DMA · UMA: the one shared pool) ① CPU ② NN engine (transformer) ③ GPU shaders (graphics) MMU + TLB virtual → physical, per access L1 / L2 / SLC caches 64-byte lines, prefetch, coherent registers → ALU scalar / SIMD (AVX, NEON) key traits no DMA — direct loads latency-optimized, big caches cache misses are the enemy weights resident in device mem loaded once, reused every token stream → on-chip SRAM tiles staged near the MACs matmul / MAC array thousands of MACs, fixed dataflow KV-cache ↔ device mem (grows/token) key traits bandwidth-bound (weight streaming) SRAM reuse is everything (FlashAttn) KV-cache bytes bind capacity descriptor / argument table binds buffers + textures to the draw vertex / index buffers geometry fetched by input assembler texture units (samplers) tiled/swizzled layout, filtered reads shader cores → framebuffer key traits data BOUND via descriptors, not ptrs special tiled/swizzled texture layout fixed-function sampling hardware The one-line difference per consumer ① CPU — reads memory directly through its cache hierarchy; no DMA, no binding. Optimize for cache locality; misses stall the core. ② NN engine — streams weights from device DRAM into on-chip SRAM and feeds a matmul array; KV-cache stays resident. Bandwidth + SRAM-reuse bound. ③ Shaders — data is bound via descriptor tables (not raw pointers); textures use a tiled/swizzled layout read by fixed-function samplers. All three start from the same DRAM bytes — what differs is translation/binding and the on-chip path to the execution units.
blue = CPU path · purple = NN/transformer accelerator · green = GPU graphics/shader path

Why the three diverge

AxisCPUNN engine (transformer)GPU shaders
How memory is referencedraw virtual pointer → MMUdevice pointer; weights pre-stageddescriptor / argument buffer binds resources
On-chip stagingL1/L2 + SLC (Apple has no per-core L3; SLC is the SoC-wide last level)SRAM scratchpad (software-managed tiles)texture cache + threadgroup memory
Execution unitscalar/SIMD ALUMAC array, fixed dataflow (systolic-style — the generic NN-accelerator/TPU pattern; Apple's exact ANE microarchitecture is undocumented)shader cores + fixed-function samplers
Memory layoutlinear, cache-line alignedtiled for matmul reuseswizzled/tiled textures, interleaved verts
Bound bylatency (cache misses)bandwidth (weight streaming) + SRAM sizefill rate / bandwidth / sampler throughput
The key trickcache localitykeep tiles in SRAM (FlashAttention)minimize binds + exploit texture cache
The transformer-specific memory story (ties to the inference work)
A transformer accelerator's whole game is the memory hierarchy, not FLOPs: weights sit in device DRAM and are streamed in each forward pass (so decode is bandwidth-bound — see 20_kv_cache_decode); the KV-cache grows one token at a time and lives in device memory (so KV bytes bind serving capacity — the 14_inference_server thesis); and the performance win — FlashAttention — is purely a memory move: keep the attention tiles in on-chip SRAM instead of round-tripping the big matrices to DRAM. CPU caches are hardware-managed; the NN engine's SRAM is software-managed, which is why the kernel author controls the tiling.

1.7 · Apple on-chip memory — the fast tiers above the shared pool

UMA is about the big pool. But the speed of a GPU/NPU kernel is decided above it — in small on-chip memories the shared LPDDR feeds. These are the Apple-relevant ones:

TierWhat it isApple-specific note
Registersper-thread, fastestregister pressure caps occupancy — too many regs/thread → fewer threads resident → less latency hiding
Threadgroup memoryprogrammer-managed scratchpad shared by a threadgroup (CUDA "shared memory")Metal threadgroup address space; where you stage tiles for reuse. The matmul/attention win lives here.
Tile memoryon-chip framebuffer storage for a screen tileApple GPUs are TBDR (tile-based deferred renderers) — the framebuffer for a tile stays on-chip; shaders read/write it without round-tripping DRAM. This is the distinctively-Apple tier. imageblock / programmable blending exploit it.
Texture cache + samplersread-only cache feeding fixed-function samplingswizzled/tiled texture layout (the shader path in §1.5)
SLC (system level cache)large last-level cache shared across CPU/GPU/NPU on the SoCa shared cache in front of the unified pool — part of why UMA stays fast; soaks cross-processor reuse
mmap'd filea file mapped into virtual memory, paged in on accesshow you load weights zero-copy (e.g. safetensors) — on UMA the mapped pages are directly GPU-addressable, no staging copy. The OS pages them from SSD on demand (and back out under pressure — §0.5 cliff).
The pattern: the pool is feedstock; on-chip reuse is the win
TBDR tile memory is the Apple-distinctive piece — keeping a tile's framebuffer on-chip is why mobile GPUs render efficiently (no repeated DRAM framebuffer traffic). For compute, the lever is the same as everywhere: threadgroup memory reuse (the GPU analog of the NN engine's SRAM tiling, §1.5). UMA gets the bytes to the chip without a copy; these tiers decide how fast you actually consume them. storageModeMemoryless is the extreme case — a render target that lives only in tile memory and never gets a DRAM backing at all.

1.8 · Stack vs heap — and why only heap reaches the GPU

Everything above is about heap memory. The stack never enters the GPU story — and that's the rule to internalize: only heap (or an mmap'd region) can be handed to a GPU; you can never pass a stack variable. A stack frame vanishes on return and its address isn't something the driver can pin or map. "Send this to the GPU" always means a heap allocation (malloc / MTLBuffer / cudaMalloc), never a local.

CPU side

StackHeap
Lifetimescoped — freed on function returnexplicit / refcount / GC; outlives the call
Allocationbump the stack pointer (~free)allocator call (slower; fragments)
Paginghot + wired in practice; not the cliffwhere the §0.5 paging cliff lives — large/long-lived allocations swap
GPU-shareable?no — vanishes, can't be pinned/mappedyes — the only thing you can give a GPU

GPU side — there's a stack/heap analog too

CPU conceptGPU analogNote
Stack / localsregisters + threadgroup memoryfast, scoped to the kernel invocation, gone when it ends — the on-chip "stack"
Heapdevice buffers / VRAM (UMA: the shared pool)persists across launches; explicitly allocated; where data you hand the GPU lives
Stack overflow's cousinregister spillinga kernel using too many registers spills to slow device memory ("local memory" in CUDA) — you fall off the fast tier onto the slow one, tanking occupancy/throughput
The developer rule
Small, short-lived, per-call → stack / registers (free, automatic, fast, nothing to manage). Shared across CPU/GPU, large, or persistent → heap / device buffer (you own the lifetime, and it's where every UMA/pinning/paging concern in this doc applies). On UMA a storageModeShared buffer is heap the GPU can also see — there is no stack equivalent that crosses the CPU↔GPU boundary, which is why zero-copy handoff is always a heap-buffer story. And watch register pressure in kernels: spilling to device memory is the GPU's silent perf cliff, the analog of blowing the cache on the CPU.

2 · Virtual vs physical; user vs kernel — the two axes

Two independent distinctions get conflated; keep them separate.

Why both matter for GPU handoff
The GPU needs a stable physical target, but your buffer is a movable virtual allocation the OS can swap or relocate at will. Closing that gap — locking the pages (physical stability) from kernel space (privilege) and mapping them for the device (translation) — is the irreducible work, and it's the same in both architectures. Unified memory removes the copy, not the translation + pinning.

3 · Pageable vs pinned (page-locked) memory

Ordinary malloc memory is pageable: the OS may swap it to disk or move its physical frame. That's fatal for DMA — the device could read a frame that's been remapped out from under it.

4 · DMA & the discrete copy path

DMA (Direct Memory Access) lets a dedicated engine move bytes between host DRAM and device VRAM without the CPU shuttling each word. The discrete path, end to end:

  1. App fills a buffer (virtual, pageable).
  2. Runtime allocates/uses a pinned host buffer; kernel locks the pages.
  3. Driver programs the DMA engine with physical source + device dest, through the IOMMU.
  4. DMA copies over PCIe into VRAM — the slow link (~16–64 GB/s vs hundreds of GB/s on either side).
  5. GPU kernels read VRAM in the normal/fast case (barring zero-copy mapped host memory, where a kernel can read host DRAM directly over PCIe).
  6. Results DMA back the same way.
PCIe is the bottleneck — so amortize it
The link is an order of magnitude slower than either memory it connects. Practical consequences (and the reason the prep-set's "100GB to 8 GPUs" answer exists): keep data resident on the GPU (don't round-trip), batch transfers, overlap copy with compute using pinned buffers + streams, and prefer formats you can DMA without re-encoding (raw/contiguous, e.g. safetensors). On discrete, moving data is often costlier than computing on it.
When do you actually use DMA today?
You almost never program a DMA descriptor by hand — it's the mechanism under higher-level APIs. But it's running constantly: Apple note: UMA deletes the cross-bus DMA copy (no PCIe hop), but Apple GPUs still have blit/DMA engines for layout conversions and storageModePrivate uploads — DMA the mechanism doesn't disappear, the PCIe round-trip does. The modern trend pushes further: zero-copy everywhere (io_uring, RDMA, GPUDirect, sendfile) and smart-NIC/DPU offload so the CPU never sees the bytes at all.

5 · Software-unified — CUDA Unified Memory — one pointer, two pools

cudaMallocManaged gives a single pointer valid on both CPU and GPU. It feels unified, but on discrete hardware the memory is still physically split — the runtime migrates pages on demand: touch a managed page on the GPU that currently lives in host DRAM and you take a page fault that copies that page over PCIe, and vice-versa.

6 · Hardware-unified — Apple Silicon UMA

On Apple Silicon there is one physical LPDDR pool on the package, and the CPU, GPU, and Neural Engine all address it. There is no VRAM and no PCIe hop — a buffer is written once and read in place by whichever processor needs it.

Metal storage modes (how you express CPU/GPU visibility)

ModeWho sees itUse for
storageModeSharedCPU + GPU, same bytesthe default UMA win — zero-copy handoff (CPU fills, GPU reads)
storageModePrivateGPU onlyGPU-produced/consumed data (render targets, intermediates) — still in the one pool, but not CPU-mapped, so the GPU can lay it out optimally
storageModeManagedCPU + GPU, with a synced copyIntel/AMD Macs only — unavailable on Apple Silicon; a coherence shim where the pools are separate, so you didModifyRange/synchronize
Unified ≠ "no rules"
You still pick a storage mode, still cross two MMUs, still pin pages for the GPU, and still synchronize access ordering (CPU must not read a buffer the GPU is still writing — that's a fence/completion-handler concern, not a copy). Unified memory removes the copy and the duplicate pool; it does not remove translation, pinning, or ordering.

7 · Cache coherence — the subtle layer

CPU and GPU each have their own caches. When both touch the same physical pages (UMA) or a buffer moves between them (discrete), someone must ensure a reader sees the latest writes rather than stale cached lines.

7.5 · Memory security — ASLR and the Apple mitigation stack

The same virtual-memory machinery that does translation (page tables, MMU) is also the enforcement point for exploit mitigation. The headline is ASLR — Apple deliberately loads code, stack, heap, and libraries at randomized addresses every run, so an attacker can't hardcode where to jump. Because Apple owns the whole silicon stack, it layers several mitigations on top that most platforms can't.

ASLR — randomize the layout so addresses can't be hardcoded

The layers Apple stacks with it

MitigationWhat it doesWhy it's hard to beat
ASLR / KASLRrandomize user + kernel address layoutneed an info-leak to even know where to aim
PAC (Pointer Authentication)Apple Silicon signs pointers with a truncated-QARMA crypto tag in the high bits (free because real VAs are ~48-bit, not 64; ARM's Top-Byte-Ignore also frees the top byte); tampering fails the check → crashthe distinctively-Apple one (ARMv8.3, shipped first at scale on A12, 2018) — raises the bar against generic ROP/JOP even after an ASLR leak, since a forged pointer won't authenticate. Not absolute — signing gadgets, key leaks, and PACMAN-style speculative oracles are known bypasses.
W^X + code signinga page is writable XOR executable; you can't write shellcode and run itiOS enforces hard (no JIT without entitlement); hardware APRR/SPRR makes the W↔X flip cheap
KTRR / CTRRhardware-locked read-only kernel textkernel code can't be modified even with a kernel write primitive
The connection to the rest of this doc
Memory protection and memory performance ride the same hardware: page tables + MMU do translation (§2) and isolation; PAC reuses the unused high bits of a 64-bit pointer — free because real address spaces use far fewer than 64 bits (ARMv8 VAs are ~48-bit, and ARM's Top-Byte-Ignore frees the top byte). On UMA, where CPU/GPU/NPU share one physical pool, this isolation matters more, not less — the page tables are what keep one processor's (or process's) buffers invisible to another despite the shared silicon. Apple leans on all of it hard precisely because it controls CPU, GPU, OS, and toolchain end-to-end.

8 · How this matters for developers — by scenario

The theory above only pays off as decisions you make in code. Here's the "so what" for the situations you'll actually hit on Apple Silicon — and the one case where the discrete habits you learned are wrong here.

① On-device ML / LLM inference

② Graphics / rendering (Metal)

③ General app / Swift & ObjC

④ Cross-process / system

The one habit to unlearn coming from discrete/CUDA
On a discrete GPU, moving data is often costlier than computing on it, so you architect around minimizing PCIe copies (keep data resident, batch transfers, double-buffer). On UMA there is no copy to minimize — the equivalent trap is the opposite: gratuitous explicit copies (allocating a separate "device" buffer and memcpy-ing into it) that throw away UMA's entire advantage. Port the intent (keep the hot set on-chip, reuse tiles/threadgroup memory) but drop the copy choreography. And replace "will it fit in VRAM?" (a hard OOM) with "will the hot set stay resident in the shared pool?" (a soft, silent SSD-paging cliff).

9 · Side-by-side & gotchas

AxisDiscrete (PCIe)CUDA Unified (managed)Apple UMA
Physical pools2 (DRAM + VRAM)2 (migrated)1
Pointerseparate host/deviceone (virtual)one (shared buffer)
Copy to use on GPUexplicit DMAimplicit (page fault)none
BottleneckPCIe bandwidthfault/migration thrashshared bandwidth budget
Pinning still neededyes (for DMA)runtime-managedyes (page wiring)
Best forhuge models, max VRAM bandwidthconvenience / oversubscriptionlow-latency, low-power, copy-free
Gotchas worth naming