Apple Memory Models — UMA, and is it better?

Apple Silicon's Unified Memory Architecture: what a buffer crosses to reach the CPU, GPU, or Neural Engine; how UMA differs from the discrete (PCIe) model at every layer; the history (it's mostly 1960s ideas, integrated); a verdict on whether unified is actually better; and the paging cliff that bounds the answer.

0 · Overview & history 0.5 · Is UMA better? (verdict + the paging cliff) 1 · The full stack (one diagram) 1.5 · Three consumers: CPU · NN engine · shaders 1.7 · Apple on-chip memory (threadgroup, tile, registers) 1.8 · Stack vs heap (CPU and GPU) 2 · Virtual vs physical; user vs kernel 3 · Pageable vs pinned (page-locked) 4 · DMA & the discrete copy path 5 · Software-unified (CUDA managed) 6 · Hardware-unified (Apple UMA) 7 · Cache coherence 7.5 · Memory security — ASLR + the Apple stack 8 · How this matters for developers (by scenario) 9 · Side-by-side & gotchas

0 · Overview & history

A pointer in your code is a virtual address in user space; the hardware that consumes it (DMA engine, GPU MMU) works in physical addresses. Bridging that gap — pin the pages, set up translation, then either copy across a bus (discrete) or share the same physical pages (unified) — is the whole story. Apple Silicon's UMA shares one physical pool, so the copy disappears; a discrete GPU pays a PCIe copy into its own VRAM.

Discrete (NVIDIA/PCIe): separate CPU DRAM and GPU VRAM. Data is copied over PCIe via DMA; pinned + staging buffers make that copy correct and fast.
Software-unified (CUDA managed): one virtual pointer, physically still two pools — pages migrate on page faults. Convenience over PCIe, not its elimination.
Hardware-unified (Apple UMA): one physical LPDDR pool on-package; CPU, GPU, and Neural Engine map the same pages. No copy, lower latency/power — but they share one bandwidth budget and one capacity ceiling.

History — Apple integrated, it didn't invent

Almost none of UMA is new; the pieces are decades old and Apple's contribution is putting them on one package and making the whole memory pool fast and shared:

Virtual memory, the MMU, paging — Atlas (1962), Multics. ~60 years old.
DMA — mainframe era, 1960s: offload bulk copies from the CPU.
Cache coherence — SMP research from the 1980s.
Shared/unified memory for GPUs — integrated GPUs (Intel HD, AMD APUs) shared system DRAM for years, but over a slow bus and a small carve-out; consoles (PS4/PS5, Xbox) shipped real unified GDDR before Apple. SGI's older workstations shared memory too.

What Apple did that's genuinely distinctive: on-package LPDDR with high bandwidth (~68 GB/s on the M1 base, up to ~800 GB/s on Ultra-class parts), a single coherent address space the CPU/GPU/Neural-Engine all use at full speed, and no second VRAM pool at all — so "unified" stops being a budget compromise (integrated graphics) and becomes the high-performance default. The idea is old; making it not slow is the achievement. (The datacenter is now copying it — AMD MI300A is the true single-physical-pool analog; NVIDIA Grace-Hopper keeps two pools (Grace LPDDR5X + Hopper HBM3) joined by cache-coherent NVLink-C2C, closer to the coherent two-pool model in §5/§7.)

0.5 · Is UMA better? — a verdict, and the cliff that bounds it

For Apple's targets — phones, laptops, on-device inference — yes, decisively. As a universal model — no. UMA isn't strictly "better"; it's a different point on the curve: it optimizes bytes-not-moved (latency, power, copy-free sharing), where discrete + HBM optimizes bytes-per-second (raw bandwidth, scalable capacity).

UMA wins when…	Discrete + HBM wins when…
latency-sensitive, power-constrained (battery devices)	peak bandwidth dominates (HBM ~3+ TB/s, GPU-dedicated)
workload copies data CPU↔GPU↔NPU constantly	training-scale throughput, the largest models
the model fits in RAM — "whole model addressable, no copy"	you need to scale capacity past one package (add VRAM, add GPUs)
efficiency per watt matters (moving bytes over PCIe burns power)	CPU and GPU would otherwise contend for one bus

The reason Macs punch above their weight on local LLMs: a 64/128/192 GB Apple Silicon machine can keep a large model fully resident and addressable without a copy, where a same-priced discrete card is capped by its VRAM. UMA trades peak for efficiency and capacity-at-the-low-end — the right trade for its devices, the wrong one for a training cluster.

The cliff that bounds the verdict: UMA can page to SSD

UMA is still ordinary virtual memory — those LPDDR pages are pageable. Under pressure, macOS/iOS compresses memory, then swaps to SSD. And because everything shares one pool, one greedy allocation squeezes everyone.

While the GPU is actively using a buffer, Metal wires (pins) its pages — they can't page out mid-command-buffer. Between uses, they're fair game.
The danger case — working set > physical RAM: on discrete you OOM (fail fast, honest). On UMA it "works"… by paging to SSD, and throughput falls off a cliff: from ~68–800 GB/s on-package to single-digit-GB/s SSD with far worse latency. It looks like it runs — tens to hundreds of times slower.
iOS adds a wired-memory cap: pin too much and jetsam kills your app rather than swapping (phones have little/no swap). So over-allocating either thrashes (Mac) or gets you killed (iPhone).

So the verdict has a hard edge: UMA's "whole model is addressable" win holds only while the hot set stays within physical RAM. Addressable ≠ resident. Exceed the pool and the copy-free advantage is buried under SSD paging — which is exactly why people buy the big-RAM Apple Silicon configs for local models: they're paying to keep the model wired and never touch swap.

1 · The full stack — every layer a buffer crosses

blue = user space · purple = kernel/system · green = hardware/physical · orange = the bus / the cost

1.5 · Three consumers, three paths — CPU · NN engine · GPU shaders

"Pass memory to the processor" isn't one flow — what the consumer is changes the path. The CPU reads through its own caches (no DMA at all). A transformer / NN accelerator streams weights from device memory into tiny on-chip SRAM and keeps the KV-cache resident. GPU shaders bind textures and vertex buffers through descriptors and sample through fixed-function units. Same DRAM, three different access shapes.

blue = CPU path · purple = NN/transformer accelerator · green = GPU graphics/shader path

Why the three diverge

Axis	CPU	NN engine (transformer)	GPU shaders
How memory is referenced	raw virtual pointer → MMU	device pointer; weights pre-staged	descriptor / argument buffer binds resources
On-chip staging	L1/L2 + SLC (Apple has no per-core L3; SLC is the SoC-wide last level)	SRAM scratchpad (software-managed tiles)	texture cache + threadgroup memory
Execution unit	scalar/SIMD ALU	MAC array, fixed dataflow (systolic-style — the generic NN-accelerator/TPU pattern; Apple's exact ANE microarchitecture is undocumented)	shader cores + fixed-function samplers
Memory layout	linear, cache-line aligned	tiled for matmul reuse	swizzled/tiled textures, interleaved verts
Bound by	latency (cache misses)	bandwidth (weight streaming) + SRAM size	fill rate / bandwidth / sampler throughput
The key trick	cache locality	keep tiles in SRAM (FlashAttention)	minimize binds + exploit texture cache

The transformer-specific memory story (ties to the inference work)

A transformer accelerator's whole game is the memory hierarchy, not FLOPs: weights sit in device DRAM and are streamed in each forward pass (so decode is bandwidth-bound — see 20_kv_cache_decode); the KV-cache grows one token at a time and lives in device memory (so KV bytes bind serving capacity — the 14_inference_server thesis); and the performance win — FlashAttention — is purely a memory move: keep the attention tiles in on-chip SRAM instead of round-tripping the big matrices to DRAM. CPU caches are hardware-managed; the NN engine's SRAM is software-managed, which is why the kernel author controls the tiling.

1.7 · Apple on-chip memory — the fast tiers above the shared pool

UMA is about the big pool. But the speed of a GPU/NPU kernel is decided above it — in small on-chip memories the shared LPDDR feeds. These are the Apple-relevant ones:

Tier	What it is	Apple-specific note
Registers	per-thread, fastest	register pressure caps occupancy — too many regs/thread → fewer threads resident → less latency hiding
Threadgroup memory	programmer-managed scratchpad shared by a threadgroup (CUDA "shared memory")	Metal `threadgroup` address space; where you stage tiles for reuse. The matmul/attention win lives here.
Tile memory	on-chip framebuffer storage for a screen tile	Apple GPUs are TBDR (tile-based deferred renderers) — the framebuffer for a tile stays on-chip; shaders read/write it without round-tripping DRAM. This is the distinctively-Apple tier. `imageblock` / programmable blending exploit it.
Texture cache + samplers	read-only cache feeding fixed-function sampling	swizzled/tiled texture layout (the shader path in §1.5)
SLC (system level cache)	large last-level cache shared across CPU/GPU/NPU on the SoC	a shared cache in front of the unified pool — part of why UMA stays fast; soaks cross-processor reuse
mmap'd file	a file mapped into virtual memory, paged in on access	how you load weights zero-copy (e.g. safetensors) — on UMA the mapped pages are directly GPU-addressable, no staging copy. The OS pages them from SSD on demand (and back out under pressure — §0.5 cliff).

The pattern: the pool is feedstock; on-chip reuse is the win

TBDR tile memory is the Apple-distinctive piece — keeping a tile's framebuffer on-chip is why mobile GPUs render efficiently (no repeated DRAM framebuffer traffic). For compute, the lever is the same as everywhere: threadgroup memory reuse (the GPU analog of the NN engine's SRAM tiling, §1.5). UMA gets the bytes to the chip without a copy; these tiers decide how fast you actually consume them. storageModeMemoryless is the extreme case — a render target that lives only in tile memory and never gets a DRAM backing at all.

1.8 · Stack vs heap — and why only heap reaches the GPU

Everything above is about heap memory. The stack never enters the GPU story — and that's the rule to internalize: only heap (or an mmap'd region) can be handed to a GPU; you can never pass a stack variable. A stack frame vanishes on return and its address isn't something the driver can pin or map. "Send this to the GPU" always means a heap allocation (malloc / MTLBuffer / cudaMalloc), never a local.

CPU side

	Stack	Heap
Lifetime	scoped — freed on function return	explicit / refcount / GC; outlives the call
Allocation	bump the stack pointer (~free)	allocator call (slower; fragments)
Paging	hot + wired in practice; not the cliff	where the §0.5 paging cliff lives — large/long-lived allocations swap
GPU-shareable?	no — vanishes, can't be pinned/mapped	yes — the only thing you can give a GPU

GPU side — there's a stack/heap analog too

CPU concept	GPU analog	Note
Stack / locals	registers + threadgroup memory	fast, scoped to the kernel invocation, gone when it ends — the on-chip "stack"
Heap	device buffers / VRAM (UMA: the shared pool)	persists across launches; explicitly allocated; where data you hand the GPU lives
Stack overflow's cousin	register spilling	a kernel using too many registers spills to slow device memory ("local memory" in CUDA) — you fall off the fast tier onto the slow one, tanking occupancy/throughput

The developer rule

Small, short-lived, per-call → stack / registers (free, automatic, fast, nothing to manage). Shared across CPU/GPU, large, or persistent → heap / device buffer (you own the lifetime, and it's where every UMA/pinning/paging concern in this doc applies). On UMA a storageModeShared buffer is heap the GPU can also see — there is no stack equivalent that crosses the CPU↔GPU boundary, which is why zero-copy handoff is always a heap-buffer story. And watch register pressure in kernels: spilling to device memory is the GPU's silent perf cliff, the analog of blowing the cache on the CPU.

2 · Virtual vs physical; user vs kernel — the two axes

Two independent distinctions get conflated; keep them separate.

Virtual vs physical address. Your pointer is a virtual address private to your process. The MMU (with the TLB as its cache) translates it to a physical DRAM address on every access, via page tables. A device's DMA engine has no notion of your virtual space — it needs the physical address (or its own device-virtual address through an IOMMU).
User vs kernel (system) space. Your code runs in user space and cannot touch hardware, set up DMA, or edit page tables. It asks the kernel (via syscalls/ioctls into the GPU driver) to do that. So "give this buffer to the GPU" is always a trip into kernel space to pin pages and program the device.

Why both matter for GPU handoff

The GPU needs a stable physical target, but your buffer is a movable virtual allocation the OS can swap or relocate at will. Closing that gap — locking the pages (physical stability) from kernel space (privilege) and mapping them for the device (translation) — is the irreducible work, and it's the same in both architectures. Unified memory removes the copy, not the translation + pinning.

3 · Pageable vs pinned (page-locked) memory

Ordinary malloc memory is pageable: the OS may swap it to disk or move its physical frame. That's fatal for DMA — the device could read a frame that's been remapped out from under it.

Pinned (page-locked) memory (cudaHostAlloc, mlock) is locked to fixed physical frames so the DMA engine's addresses stay valid for the whole transfer.
Pageable transfers pay a hidden double-copy: the driver first copies your pageable buffer into an internal pinned staging buffer, then DMAs from there. Allocating pinned up front skips that copy.
Async copy requires pinned — overlapping transfer with compute (cudaMemcpyAsync) only works from page-locked source, because the engine must run unattended.
Cost of pinning: it removes pages from the swappable pool — over-pinning starves the rest of the system. Pin the working set, not everything.

4 · DMA & the discrete copy path

DMA (Direct Memory Access) lets a dedicated engine move bytes between host DRAM and device VRAM without the CPU shuttling each word. The discrete path, end to end:

App fills a buffer (virtual, pageable).
Runtime allocates/uses a pinned host buffer; kernel locks the pages.
Driver programs the DMA engine with physical source + device dest, through the IOMMU.
DMA copies over PCIe into VRAM — the slow link (~16–64 GB/s vs hundreds of GB/s on either side).
GPU kernels read VRAM in the normal/fast case (barring zero-copy mapped host memory, where a kernel can read host DRAM directly over PCIe).
Results DMA back the same way.

PCIe is the bottleneck — so amortize it

The link is an order of magnitude slower than either memory it connects. Practical consequences (and the reason the prep-set's "100GB to 8 GPUs" answer exists): keep data resident on the GPU (don't round-trip), batch transfers, overlap copy with compute using pinned buffers + streams, and prefer formats you can DMA without re-encoding (raw/contiguous, e.g. safetensors). On discrete, moving data is often costlier than computing on it.

When do you actually use DMA today?

You almost never program a DMA descriptor by hand — it's the mechanism under higher-level APIs. But it's running constantly:

Indirectly, all the time: every cudaMemcpy / Metal blit is the DMA engine; NVMe SSDs DMA into RAM; NICs DMA packets into ring buffers; the display controller DMAs the framebuffer to the screen each refresh. The whole point is keeping the CPU off the data path.
Deliberately, in high-perf data planes: kernel-bypass + zero-copy — io_uring, RDMA/RoCE, DPDK — DMA data device→app-memory with neither CPU nor kernel touching it (how trading, storage, and ML fabrics hit line rate). GPUDirect DMAs straight from NIC/NVMe into GPU VRAM, skipping the host hop (the multi-node-training answer).
By hand, only in: driver / kernel code, embedded firmware (program the DMA controller directly to keep a tiny CPU free), or a custom kernel-bypass data plane.

Apple note: UMA deletes the cross-bus DMA copy (no PCIe hop), but Apple GPUs still have blit/DMA engines for layout conversions and storageModePrivate uploads — DMA the mechanism doesn't disappear, the PCIe round-trip does. The modern trend pushes further: zero-copy everywhere (io_uring, RDMA, GPUDirect, sendfile) and smart-NIC/DPU offload so the CPU never sees the bytes at all.

5 · Software-unified — CUDA Unified Memory — one pointer, two pools

cudaMallocManaged gives a single pointer valid on both CPU and GPU. It feels unified, but on discrete hardware the memory is still physically split — the runtime migrates pages on demand: touch a managed page on the GPU that currently lives in host DRAM and you take a page fault that copies that page over PCIe, and vice-versa.

Win: programmer convenience — no explicit cudaMemcpy, oversubscription (use more than VRAM, paged in as needed).
Catch: the PCIe copy didn't vanish — it became implicit and fault-driven. Bad access patterns thrash pages across the bus. Hints (cudaMemPrefetchAsync, cudaMemAdvise) exist to control migration.
Mental model: software-unified = virtual unification (one address) over physical separation. Apple UMA = physical unification.

6 · Hardware-unified — Apple Silicon UMA

On Apple Silicon there is one physical LPDDR pool on the package, and the CPU, GPU, and Neural Engine all address it. There is no VRAM and no PCIe hop — a buffer is written once and read in place by whichever processor needs it.

Metal storage modes (how you express CPU/GPU visibility)

Mode	Who sees it	Use for
storageModeShared	CPU + GPU, same bytes	the default UMA win — zero-copy handoff (CPU fills, GPU reads)
storageModePrivate	GPU only	GPU-produced/consumed data (render targets, intermediates) — still in the one pool, but not CPU-mapped, so the GPU can lay it out optimally
storageModeManaged	CPU + GPU, with a synced copy	Intel/AMD Macs only — unavailable on Apple Silicon; a coherence shim where the pools are separate, so you `didModifyRange`/`synchronize`

Unified ≠ "no rules"

You still pick a storage mode, still cross two MMUs, still pin pages for the GPU, and still synchronize access ordering (CPU must not read a buffer the GPU is still writing — that's a fence/completion-handler concern, not a copy). Unified memory removes the copy and the duplicate pool; it does not remove translation, pinning, or ordering.

7 · Cache coherence — the subtle layer

CPU and GPU each have their own caches. When both touch the same physical pages (UMA) or a buffer moves between them (discrete), someone must ensure a reader sees the latest writes rather than stale cached lines.

Discrete: coherence is mostly sidestepped by the copy — the DMA delivers a fresh copy to VRAM; within the GPU, its own cache hierarchy applies. Crossing the bus is an explicit sync point.
Apple UMA: the on-die fabric keeps CPU/GPU access coherent at a hardware level for shared buffers, but you still enforce ordering — wait on the command buffer's completion before the CPU reads GPU output. Coherence (do I see fresh bytes?) is handled; ordering (have the writes happened yet?) is yours.
This is the same split as the concurrency primer: coherence ≈ visibility, ordering ≈ the acquire/release / fence discipline — just across processors instead of threads.

7.5 · Memory security — ASLR and the Apple mitigation stack

The same virtual-memory machinery that does translation (page tables, MMU) is also the enforcement point for exploit mitigation. The headline is ASLR — Apple deliberately loads code, stack, heap, and libraries at randomized addresses every run, so an attacker can't hardcode where to jump. Because Apple owns the whole silicon stack, it layers several mitigations on top that most platforms can't.

ASLR — randomize the layout so addresses can't be hardcoded

What: the main binary, dynamic linker (dyld), and stack are randomized per launch; the heap gets base randomization plus per-allocation entropy from the allocator (not a single slide). The dyld shared cache slide is per-boot — computed once and shared identically across every process until reboot (so one leaked shared-cache pointer in any process defeats that cache's ASLR everywhere until reboot).
Why: classic exploits jump to a known address (a function, a gadget, injected shellcode). Randomize the layout and that address is different every run — the attacker must first leak a pointer to learn the slide before they can aim. It converts "one bug → code execution" into "need an info-leak and a control bug."
Apple specifics: the dyld shared cache (all system libs, prelinked) is randomized as a unit; on iOS it's shared across processes but slid. KASLR randomizes the kernel itself.

The layers Apple stacks with it

Mitigation	What it does	Why it's hard to beat
ASLR / KASLR	randomize user + kernel address layout	need an info-leak to even know where to aim
PAC (Pointer Authentication)	Apple Silicon signs pointers with a truncated-QARMA crypto tag in the high bits (free because real VAs are ~48-bit, not 64; ARM's Top-Byte-Ignore also frees the top byte); tampering fails the check → crash	the distinctively-Apple one (ARMv8.3, shipped first at scale on A12, 2018) — raises the bar against generic ROP/JOP even after an ASLR leak, since a forged pointer won't authenticate. Not absolute — signing gadgets, key leaks, and PACMAN-style speculative oracles are known bypasses.
W^X + code signing	a page is writable XOR executable; you can't write shellcode and run it	iOS enforces hard (no JIT without entitlement); hardware APRR/SPRR makes the W↔X flip cheap
KTRR / CTRR	hardware-locked read-only kernel text	kernel code can't be modified even with a kernel write primitive

The connection to the rest of this doc

Memory protection and memory performance ride the same hardware: page tables + MMU do translation (§2) and isolation; PAC reuses the unused high bits of a 64-bit pointer — free because real address spaces use far fewer than 64 bits (ARMv8 VAs are ~48-bit, and ARM's Top-Byte-Ignore frees the top byte). On UMA, where CPU/GPU/NPU share one physical pool, this isolation matters more, not less — the page tables are what keep one processor's (or process's) buffers invisible to another despite the shared silicon. Apple leans on all of it hard precisely because it controls CPU, GPU, OS, and toolchain end-to-end.

8 · How this matters for developers — by scenario

The theory above only pays off as decisions you make in code. Here's the "so what" for the situations you'll actually hit on Apple Silicon — and the one case where the discrete habits you learned are wrong here.

① On-device ML / LLM inference

Use storageModeShared and stop copying. The discrete reflex — allocate host buffer, allocate device buffer, memcpy between them — is pure waste on UMA. Fill the buffer on the CPU, hand it to the GPU, read results back, all in place.
Load weights with mmap, not read. A memory-mapped safetensors/GGUF file is paged in on demand and is directly GPU-addressable — zero-copy load, and the OS evicts cold pages for free. (This is why local LLM runtimes mmap their weights.)
Size RAM to keep the model resident. The §0.5 cliff is the whole ballgame: a model that fits in the unified pool runs at full bandwidth; one that's slightly too big silently swaps to SSD and craters. Buy/target the RAM tier that keeps your hot set (weights + KV-cache) wired. Addressable ≠ resident.
Budget the KV-cache as memory, not an afterthought. It grows per token and shares the one pool — it competes with the weights and the rest of the app. (Ties to 14_inference_server / 20_kv_cache_decode.)
Expect bandwidth contention. A memory-bound GPU kernel and a busy CPU fight for the same bus — don't assume UMA bandwidth is "free" per processor.

② Graphics / rendering (Metal)

Exploit TBDR tile memory. Use storageModeMemoryless for transient render targets that are written and consumed within a single render pass (depth, MSAA color/resolve sources) — they live only in on-chip tile memory and never get DRAM backing, saving bandwidth and power. A classic deferred G-buffer qualifies only when you fuse the passes into one via programmable blending / imageblocks (tile shading); a G-buffer written in a geometry pass and read in a later pass forces DRAM backing and can't be memoryless. Porting a desktop renderer that allocates real textures for the genuinely-transient targets leaves the Apple GPU's main advantage on the table.
Pick storage mode by who touches it: shared for CPU-updated data (uniforms, dynamic vertices), private for GPU-only data (let the driver lay it out optimally), memoryless for tile-only transients.
Minimize binds, batch draws — data reaches shaders via descriptor/argument buffers, so argument-buffer reuse beats re-binding per draw.

③ General app / Swift & ObjC

You mostly don't think about this — until memory pressure. ARC + the allocator handle it; the layer that bites is paging. Honor memory-pressure notifications (didReceiveMemoryWarning, DispatchSource memory-pressure) and drop caches before the OS swaps or jetsam kills you.
On iOS, respect the wired/footprint cap. Large pinned allocations (big images, decoded video, model buffers) count against a hard footprint limit — exceed it and you're killed, not swapped. Downsample, stream, release.
mmap big read-only assets instead of loading them resident — they page in lazily and out under pressure.

④ Cross-process / system

Share with mmap/shared memory, don't serialize-and-copy when two processes need the same big buffer (the CPU analog of UMA's zero-copy handoff).
ASLR/PAC are free wins you shouldn't fight — don't hardcode addresses, don't disable hardened runtime, keep pointer-auth on. They cost you nothing and raise the exploitation bar a lot (§7.5).

The one habit to unlearn coming from discrete/CUDA

On a discrete GPU, moving data is often costlier than computing on it, so you architect around minimizing PCIe copies (keep data resident, batch transfers, double-buffer). On UMA there is no copy to minimize — the equivalent trap is the opposite: gratuitous explicit copies (allocating a separate "device" buffer and memcpy-ing into it) that throw away UMA's entire advantage. Port the intent (keep the hot set on-chip, reuse tiles/threadgroup memory) but drop the copy choreography. And replace "will it fit in VRAM?" (a hard OOM) with "will the hot set stay resident in the shared pool?" (a soft, silent SSD-paging cliff).

9 · Side-by-side & gotchas

Axis	Discrete (PCIe)	CUDA Unified (managed)	Apple UMA
Physical pools	2 (DRAM + VRAM)	2 (migrated)	1
Pointer	separate host/device	one (virtual)	one (shared buffer)
Copy to use on GPU	explicit DMA	implicit (page fault)	none
Bottleneck	PCIe bandwidth	fault/migration thrash	shared bandwidth budget
Pinning still needed	yes (for DMA)	runtime-managed	yes (page wiring)
Best for	huge models, max VRAM bandwidth	convenience / oversubscription	low-latency, low-power, copy-free

Gotchas worth naming

Forgetting to pin → silent fallback to the staging double-copy; transfers look slow for no obvious reason.
Synchronous copy on the critical path → CPU stalls waiting for PCIe. Use async + pinned + streams to overlap.
Treating CUDA managed memory as free unification → page-fault thrash over PCIe; prefetch/advise to fix.
On UMA, reading GPU output before the command buffer completes → you read stale/partial bytes. Coherence is handled; ordering is not — wait on completion.
Over-pinning → starves the swappable pool; the whole system degrades.
Assuming UMA bandwidth is "free" → CPU and GPU contend for one bus; a memory-bound GPU kernel can starve the CPU and vice-versa.