Apple Silicon's Unified Memory Architecture: what a buffer crosses to reach the CPU, GPU, or Neural Engine; how UMA differs from the discrete (PCIe) model at every layer; the history (it's mostly 1960s ideas, integrated); a verdict on whether unified is actually better; and the paging cliff that bounds the answer.
A pointer in your code is a virtual address in user space; the hardware that
consumes it (DMA engine, GPU MMU) works in physical addresses. Bridging that
gap — pin the pages, set up translation, then either copy across a bus
(discrete) or share the same physical pages (unified) — is the whole story.
Apple Silicon's UMA shares one physical pool, so the copy disappears; a
discrete GPU pays a PCIe copy into its own VRAM.
Discrete (NVIDIA/PCIe): separate CPU DRAM and GPU VRAM. Data is copied over PCIe via DMA; pinned + staging buffers make that copy correct and fast.
Software-unified (CUDA managed): one virtual pointer, physically still two pools — pages migrate on page faults. Convenience over PCIe, not its elimination.
Hardware-unified (Apple UMA): one physical LPDDR pool on-package; CPU, GPU, and Neural Engine map the same pages. No copy, lower latency/power — but they share one bandwidth budget and one capacity ceiling.
History — Apple integrated, it didn't invent
Almost none of UMA is new; the pieces are decades old and Apple's contribution is
putting them on one package and making the whole memory pool fast and shared:
Virtual memory, the MMU, paging — Atlas (1962), Multics. ~60 years old.
DMA — mainframe era, 1960s: offload bulk copies from the CPU.
Cache coherence — SMP research from the 1980s.
Shared/unified memory for GPUs — integrated GPUs (Intel HD, AMD APUs) shared system DRAM for years, but over a slow bus and a small carve-out; consoles (PS4/PS5, Xbox) shipped real unified GDDR before Apple. SGI's older workstations shared memory too.
What Apple did that's genuinely distinctive: on-package LPDDR with high bandwidth
(~68 GB/s on the M1 base, up to ~800 GB/s on Ultra-class parts), a single coherent address space the CPU/GPU/Neural-Engine all use
at full speed, and no second VRAM pool at all — so "unified" stops being a
budget compromise (integrated graphics) and becomes the high-performance default.
The idea is old; making it not slow is the achievement. (The datacenter is
now copying it — AMD MI300A is the true single-physical-pool analog; NVIDIA
Grace-Hopper keeps two pools (Grace LPDDR5X + Hopper HBM3) joined by cache-coherent
NVLink-C2C, closer to the coherent two-pool model in §5/§7.)
0.5 · Is UMA better? — a verdict, and the cliff that bounds it
For Apple's targets — phones, laptops, on-device inference — yes, decisively.
As a universal model — no. UMA isn't strictly "better"; it's a different point
on the curve: it optimizes bytes-not-moved (latency, power, copy-free
sharing), where discrete + HBM optimizes bytes-per-second (raw bandwidth,
scalable capacity).
the model fits in RAM — "whole model addressable, no copy"
you need to scale capacity past one package (add VRAM, add GPUs)
efficiency per watt matters (moving bytes over PCIe burns power)
CPU and GPU would otherwise contend for one bus
The reason Macs punch above their weight on local LLMs: a 64/128/192 GB Apple
Silicon machine can keep a large model fully resident and addressable
without a copy, where a same-priced discrete card is capped by its VRAM. UMA trades
peak for efficiency and capacity-at-the-low-end — the right trade
for its devices, the wrong one for a training cluster.
The cliff that bounds the verdict: UMA can page to SSD
UMA is still ordinary virtual memory — those LPDDR pages are pageable. Under
pressure, macOS/iOS compresses memory, then swaps to SSD. And because
everything shares one pool, one greedy allocation squeezes everyone.
While the GPU is actively using a buffer, Metal wires (pins) its pages — they can't page out mid-command-buffer. Between uses, they're fair game.
The danger case — working set > physical RAM: on discrete you OOM (fail fast, honest). On UMA it "works"… by paging to SSD, and throughput falls off a cliff: from ~68–800 GB/s on-package to single-digit-GB/s SSD with far worse latency. It looks like it runs — tens to hundreds of times slower.
iOS adds a wired-memory cap: pin too much and jetsam kills your app rather than swapping (phones have little/no swap). So over-allocating either thrashes (Mac) or gets you killed (iPhone).
So the verdict has a hard edge: UMA's "whole model is addressable" win holds
only while the hot set stays within physical RAM. Addressable ≠
resident. Exceed the pool and the copy-free advantage is buried under SSD
paging — which is exactly why people buy the big-RAM Apple Silicon configs for local
models: they're paying to keep the model wired and never touch swap.
1 · The full stack — every layer a buffer crosses
blue = user space · purple = kernel/system · green = hardware/physical · orange = the bus / the cost
1.5 · Three consumers, three paths — CPU · NN engine · GPU shaders
"Pass memory to the processor" isn't one flow — what the consumer is changes
the path. The CPU reads through its own caches (no DMA at all). A
transformer / NN accelerator streams weights from device memory into tiny
on-chip SRAM and keeps the KV-cache resident. GPU shaders bind textures and
vertex buffers through descriptors and sample through fixed-function units. Same
DRAM, three different access shapes.
blue = CPU path · purple = NN/transformer accelerator · green = GPU graphics/shader path
Why the three diverge
Axis
CPU
NN engine (transformer)
GPU shaders
How memory is referenced
raw virtual pointer → MMU
device pointer; weights pre-staged
descriptor / argument buffer binds resources
On-chip staging
L1/L2 + SLC (Apple has no per-core L3; SLC is the SoC-wide last level)
SRAM scratchpad (software-managed tiles)
texture cache + threadgroup memory
Execution unit
scalar/SIMD ALU
MAC array, fixed dataflow (systolic-style — the generic NN-accelerator/TPU pattern; Apple's exact ANE microarchitecture is undocumented)
shader cores + fixed-function samplers
Memory layout
linear, cache-line aligned
tiled for matmul reuse
swizzled/tiled textures, interleaved verts
Bound by
latency (cache misses)
bandwidth (weight streaming) + SRAM size
fill rate / bandwidth / sampler throughput
The key trick
cache locality
keep tiles in SRAM (FlashAttention)
minimize binds + exploit texture cache
The transformer-specific memory story (ties to the inference work)
A transformer accelerator's whole game is the memory hierarchy, not FLOPs:
weights sit in device DRAM and are streamed in each forward pass (so decode
is bandwidth-bound — see 20_kv_cache_decode); the
KV-cache grows one token at a time and lives in device memory (so KV bytes
bind serving capacity — the 14_inference_server thesis); and the
performance win — FlashAttention — is purely a memory move: keep the
attention tiles in on-chip SRAM instead of round-tripping the big matrices
to DRAM. CPU caches are hardware-managed; the NN engine's SRAM is
software-managed, which is why the kernel author controls the tiling.
1.7 · Apple on-chip memory — the fast tiers above the shared pool
UMA is about the big pool. But the speed of a GPU/NPU kernel is decided
above it — in small on-chip memories the shared LPDDR feeds. These are the
Apple-relevant ones:
Tier
What it is
Apple-specific note
Registers
per-thread, fastest
register pressure caps occupancy — too many regs/thread → fewer threads resident → less latency hiding
Threadgroup memory
programmer-managed scratchpad shared by a threadgroup (CUDA "shared memory")
Metal threadgroup address space; where you stage tiles for reuse. The matmul/attention win lives here.
Tile memory
on-chip framebuffer storage for a screen tile
Apple GPUs are TBDR (tile-based deferred renderers) — the framebuffer for a tile stays on-chip; shaders read/write it without round-tripping DRAM. This is the distinctively-Apple tier. imageblock / programmable blending exploit it.
Texture cache + samplers
read-only cache feeding fixed-function sampling
swizzled/tiled texture layout (the shader path in §1.5)
SLC (system level cache)
large last-level cache shared across CPU/GPU/NPU on the SoC
a shared cache in front of the unified pool — part of why UMA stays fast; soaks cross-processor reuse
mmap'd file
a file mapped into virtual memory, paged in on access
how you load weights zero-copy (e.g. safetensors) — on UMA the mapped pages are directly GPU-addressable, no staging copy. The OS pages them from SSD on demand (and back out under pressure — §0.5 cliff).
The pattern: the pool is feedstock; on-chip reuse is the win
TBDR tile memory is the Apple-distinctive piece — keeping a tile's
framebuffer on-chip is why mobile GPUs render efficiently (no repeated DRAM
framebuffer traffic). For compute, the lever is the same as everywhere:
threadgroup memory reuse (the GPU analog of the NN engine's SRAM tiling, §1.5).
UMA gets the bytes to the chip without a copy; these tiers decide how fast
you actually consume them. storageModeMemoryless is the extreme case —
a render target that lives only in tile memory and never gets a DRAM
backing at all.
1.8 · Stack vs heap — and why only heap reaches the GPU
Everything above is about heap memory. The stack never enters the
GPU story — and that's the rule to internalize: only heap (or an mmap'd region)
can be handed to a GPU; you can never pass a stack variable. A stack frame
vanishes on return and its address isn't something the driver can pin or map.
"Send this to the GPU" always means a heap allocation (malloc /
MTLBuffer / cudaMalloc), never a local.
CPU side
Stack
Heap
Lifetime
scoped — freed on function return
explicit / refcount / GC; outlives the call
Allocation
bump the stack pointer (~free)
allocator call (slower; fragments)
Paging
hot + wired in practice; not the cliff
where the §0.5 paging cliff lives — large/long-lived allocations swap
GPU-shareable?
no — vanishes, can't be pinned/mapped
yes — the only thing you can give a GPU
GPU side — there's a stack/heap analog too
CPU concept
GPU analog
Note
Stack / locals
registers + threadgroup memory
fast, scoped to the kernel invocation, gone when it ends — the on-chip "stack"
Heap
device buffers / VRAM (UMA: the shared pool)
persists across launches; explicitly allocated; where data you hand the GPU lives
Stack overflow's cousin
register spilling
a kernel using too many registers spills to slow device memory ("local memory" in CUDA) — you fall off the fast tier onto the slow one, tanking occupancy/throughput
The developer rule
Small, short-lived, per-call → stack / registers (free, automatic, fast,
nothing to manage). Shared across CPU/GPU, large, or persistent → heap / device
buffer (you own the lifetime, and it's where every UMA/pinning/paging concern in
this doc applies). On UMA a storageModeShared buffer is heap the GPU
can also see — there is no stack equivalent that crosses the CPU↔GPU boundary,
which is why zero-copy handoff is always a heap-buffer story. And watch register
pressure in kernels: spilling to device memory is the GPU's silent perf cliff, the
analog of blowing the cache on the CPU.
2 · Virtual vs physical; user vs kernel — the two axes
Two independent distinctions get conflated; keep them separate.
Virtual vs physical address. Your pointer is a virtual address private to your process. The MMU (with the TLB as its cache) translates it to a physical DRAM address on every access, via page tables. A device's DMA engine has no notion of your virtual space — it needs the physical address (or its own device-virtual address through an IOMMU).
User vs kernel (system) space. Your code runs in user space and cannot touch hardware, set up DMA, or edit page tables. It asks the kernel (via syscalls/ioctls into the GPU driver) to do that. So "give this buffer to the GPU" is always a trip into kernel space to pin pages and program the device.
Why both matter for GPU handoff
The GPU needs a stable physical target, but your buffer is a movable
virtual allocation the OS can swap or relocate at will. Closing that gap —
locking the pages (physical stability) from kernel space (privilege) and mapping
them for the device (translation) — is the irreducible work, and it's the same in
both architectures. Unified memory removes the copy, not the
translation + pinning.
3 · Pageable vs pinned (page-locked) memory
Ordinary malloc memory is pageable: the OS may swap it to disk
or move its physical frame. That's fatal for DMA — the device could read a frame
that's been remapped out from under it.
Pinned (page-locked) memory (cudaHostAlloc, mlock) is locked to fixed physical frames so the DMA engine's addresses stay valid for the whole transfer.
Pageable transfers pay a hidden double-copy: the driver first copies your pageable buffer into an internal pinned staging buffer, then DMAs from there. Allocating pinned up front skips that copy.
Async copy requires pinned — overlapping transfer with compute (cudaMemcpyAsync) only works from page-locked source, because the engine must run unattended.
Cost of pinning: it removes pages from the swappable pool — over-pinning starves the rest of the system. Pin the working set, not everything.
4 · DMA & the discrete copy path
DMA (Direct Memory Access) lets a dedicated engine move bytes between host
DRAM and device VRAM without the CPU shuttling each word. The discrete path, end to
end:
App fills a buffer (virtual, pageable).
Runtime allocates/uses a pinned host buffer; kernel locks the pages.
Driver programs the DMA engine with physical source + device dest, through the IOMMU.
DMA copies over PCIe into VRAM — the slow link (~16–64 GB/s vs hundreds of GB/s on either side).
GPU kernels read VRAM in the normal/fast case (barring zero-copy mapped host memory, where a kernel can read host DRAM directly over PCIe).
Results DMA back the same way.
PCIe is the bottleneck — so amortize it
The link is an order of magnitude slower than either memory it connects. Practical
consequences (and the reason the prep-set's "100GB to 8 GPUs" answer exists):
keep data resident on the GPU (don't round-trip), batch transfers,
overlap copy with compute using pinned buffers + streams, and prefer formats
you can DMA without re-encoding (raw/contiguous, e.g. safetensors). On discrete,
moving data is often costlier than computing on it.
When do you actually use DMA today?
You almost never program a DMA descriptor by hand — it's the mechanism
under higher-level APIs. But it's running constantly:
Indirectly, all the time: every cudaMemcpy / Metal blit is the DMA engine; NVMe SSDs DMA into RAM; NICs DMA packets into ring buffers; the display controller DMAs the framebuffer to the screen each refresh. The whole point is keeping the CPU off the data path.
Deliberately, in high-perf data planes:kernel-bypass + zero-copy — io_uring, RDMA/RoCE, DPDK — DMA data device→app-memory with neither CPU nor kernel touching it (how trading, storage, and ML fabrics hit line rate). GPUDirect DMAs straight from NIC/NVMe into GPU VRAM, skipping the host hop (the multi-node-training answer).
By hand, only in: driver / kernel code, embedded firmware (program the DMA controller directly to keep a tiny CPU free), or a custom kernel-bypass data plane.
Apple note: UMA deletes the cross-bus DMA copy (no PCIe hop), but
Apple GPUs still have blit/DMA engines for layout conversions and
storageModePrivate uploads — DMA the mechanism doesn't disappear, the
PCIe round-trip does. The modern trend pushes further: zero-copy everywhere
(io_uring, RDMA, GPUDirect, sendfile) and smart-NIC/DPU offload so the
CPU never sees the bytes at all.
5 · Software-unified — CUDA Unified Memory — one pointer, two pools
cudaMallocManaged gives a single pointer valid on both CPU and
GPU. It feels unified, but on discrete hardware the memory is still physically
split — the runtime migrates pages on demand: touch a managed page on the
GPU that currently lives in host DRAM and you take a page fault that copies
that page over PCIe, and vice-versa.
Win: programmer convenience — no explicit cudaMemcpy, oversubscription (use more than VRAM, paged in as needed).
Catch: the PCIe copy didn't vanish — it became implicit and fault-driven. Bad access patterns thrash pages across the bus. Hints (cudaMemPrefetchAsync, cudaMemAdvise) exist to control migration.
Mental model: software-unified = virtual unification (one address) over physical separation. Apple UMA = physical unification.
6 · Hardware-unified — Apple Silicon UMA
On Apple Silicon there is one physical LPDDR pool on the package, and the
CPU, GPU, and Neural Engine all address it. There is no VRAM and no PCIe hop —
a buffer is written once and read in place by whichever processor needs it.
Metal storage modes (how you express CPU/GPU visibility)
Mode
Who sees it
Use for
storageModeShared
CPU + GPU, same bytes
the default UMA win — zero-copy handoff (CPU fills, GPU reads)
storageModePrivate
GPU only
GPU-produced/consumed data (render targets, intermediates) — still in the one pool, but not CPU-mapped, so the GPU can lay it out optimally
storageModeManaged
CPU + GPU, with a synced copy
Intel/AMD Macs only — unavailable on Apple Silicon; a coherence shim where the pools are separate, so you didModifyRange/synchronize
Unified ≠ "no rules"
You still pick a storage mode, still cross two MMUs, still pin pages for the GPU,
and still synchronize access ordering (CPU must not read a buffer the GPU
is still writing — that's a fence/completion-handler concern, not a copy). Unified
memory removes the copy and the duplicate pool; it does not remove
translation, pinning, or ordering.
7 · Cache coherence — the subtle layer
CPU and GPU each have their own caches. When both touch the same physical pages
(UMA) or a buffer moves between them (discrete), someone must ensure a reader sees
the latest writes rather than stale cached lines.
Discrete: coherence is mostly sidestepped by the copy — the DMA delivers a fresh copy to VRAM; within the GPU, its own cache hierarchy applies. Crossing the bus is an explicit sync point.
Apple UMA: the on-die fabric keeps CPU/GPU access coherent at a hardware level for shared buffers, but you still enforce ordering — wait on the command buffer's completion before the CPU reads GPU output. Coherence (do I see fresh bytes?) is handled; ordering (have the writes happened yet?) is yours.
This is the same split as the concurrency primer: coherence ≈ visibility, ordering ≈ the acquire/release / fence discipline — just across processors instead of threads.
7.5 · Memory security — ASLR and the Apple mitigation stack
The same virtual-memory machinery that does translation (page tables, MMU) is also
the enforcement point for exploit mitigation. The headline is ASLR —
Apple deliberately loads code, stack, heap, and libraries at randomized
addresses every run, so an attacker can't hardcode where to jump. Because Apple
owns the whole silicon stack, it layers several mitigations on top that most
platforms can't.
ASLR — randomize the layout so addresses can't be hardcoded
What: the main binary, dynamic linker (dyld), and stack are randomized per launch; the heap gets base randomization plus per-allocation entropy from the allocator (not a single slide). The dyld shared cache slide is per-boot — computed once and shared identically across every process until reboot (so one leaked shared-cache pointer in any process defeats that cache's ASLR everywhere until reboot).
Why: classic exploits jump to a known address (a function, a gadget, injected shellcode). Randomize the layout and that address is different every run — the attacker must first leak a pointer to learn the slide before they can aim. It converts "one bug → code execution" into "need an info-leak and a control bug."
Apple specifics: the dyld shared cache (all system libs, prelinked) is randomized as a unit; on iOS it's shared across processes but slid. KASLR randomizes the kernel itself.
The layers Apple stacks with it
Mitigation
What it does
Why it's hard to beat
ASLR / KASLR
randomize user + kernel address layout
need an info-leak to even know where to aim
PAC (Pointer Authentication)
Apple Silicon signs pointers with a truncated-QARMA crypto tag in the high bits (free because real VAs are ~48-bit, not 64; ARM's Top-Byte-Ignore also frees the top byte); tampering fails the check → crash
the distinctively-Apple one (ARMv8.3, shipped first at scale on A12, 2018) — raises the bar against generic ROP/JOP even after an ASLR leak, since a forged pointer won't authenticate. Not absolute — signing gadgets, key leaks, and PACMAN-style speculative oracles are known bypasses.
W^X + code signing
a page is writable XOR executable; you can't write shellcode and run it
iOS enforces hard (no JIT without entitlement); hardware APRR/SPRR makes the W↔X flip cheap
KTRR / CTRR
hardware-locked read-only kernel text
kernel code can't be modified even with a kernel write primitive
The connection to the rest of this doc
Memory protection and memory performance ride the same hardware: page
tables + MMU do translation (§2) and isolation; PAC reuses the unused high
bits of a 64-bit pointer — free because real address spaces use far fewer than
64 bits (ARMv8 VAs are ~48-bit, and ARM's Top-Byte-Ignore frees the top byte).
On UMA, where CPU/GPU/NPU share one physical pool, this isolation matters more, not
less — the page tables are what keep one processor's (or process's) buffers
invisible to another despite the shared silicon. Apple leans on all of it hard
precisely because it controls CPU, GPU, OS, and toolchain end-to-end.
8 · How this matters for developers — by scenario
The theory above only pays off as decisions you make in code. Here's the
"so what" for the situations you'll actually hit on Apple Silicon — and the one
case where the discrete habits you learned are wrong here.
① On-device ML / LLM inference
Use storageModeShared and stop copying. The discrete reflex — allocate host buffer, allocate device buffer, memcpy between them — is pure waste on UMA. Fill the buffer on the CPU, hand it to the GPU, read results back, all in place.
Load weights with mmap, not read. A memory-mapped safetensors/GGUF file is paged in on demand and is directly GPU-addressable — zero-copy load, and the OS evicts cold pages for free. (This is why local LLM runtimes mmap their weights.)
Size RAM to keep the model resident. The §0.5 cliff is the whole ballgame: a model that fits in the unified pool runs at full bandwidth; one that's slightly too big silently swaps to SSD and craters. Buy/target the RAM tier that keeps your hot set (weights + KV-cache) wired. Addressable ≠ resident.
Budget the KV-cache as memory, not an afterthought. It grows per token and shares the one pool — it competes with the weights and the rest of the app. (Ties to 14_inference_server / 20_kv_cache_decode.)
Expect bandwidth contention. A memory-bound GPU kernel and a busy CPU fight for the same bus — don't assume UMA bandwidth is "free" per processor.
② Graphics / rendering (Metal)
Exploit TBDR tile memory. Use storageModeMemoryless for transient render targets that are written and consumed within a single render pass (depth, MSAA color/resolve sources) — they live only in on-chip tile memory and never get DRAM backing, saving bandwidth and power. A classic deferred G-buffer qualifies only when you fuse the passes into one via programmable blending / imageblocks (tile shading); a G-buffer written in a geometry pass and read in a later pass forces DRAM backing and can't be memoryless. Porting a desktop renderer that allocates real textures for the genuinely-transient targets leaves the Apple GPU's main advantage on the table.
Pick storage mode by who touches it:shared for CPU-updated data (uniforms, dynamic vertices), private for GPU-only data (let the driver lay it out optimally), memoryless for tile-only transients.
Minimize binds, batch draws — data reaches shaders via descriptor/argument buffers, so argument-buffer reuse beats re-binding per draw.
③ General app / Swift & ObjC
You mostly don't think about this — until memory pressure. ARC + the allocator handle it; the layer that bites is paging. Honor memory-pressure notifications (didReceiveMemoryWarning, DispatchSource memory-pressure) and drop caches before the OS swaps or jetsam kills you.
On iOS, respect the wired/footprint cap. Large pinned allocations (big images, decoded video, model buffers) count against a hard footprint limit — exceed it and you're killed, not swapped. Downsample, stream, release.
mmap big read-only assets instead of loading them resident — they page in lazily and out under pressure.
④ Cross-process / system
Share with mmap/shared memory, don't serialize-and-copy when two processes need the same big buffer (the CPU analog of UMA's zero-copy handoff).
ASLR/PAC are free wins you shouldn't fight — don't hardcode addresses, don't disable hardened runtime, keep pointer-auth on. They cost you nothing and raise the exploitation bar a lot (§7.5).
The one habit to unlearn coming from discrete/CUDA
On a discrete GPU, moving data is often costlier than computing on it, so you
architect around minimizing PCIe copies (keep data resident, batch transfers, double-buffer).
On UMA there is no copy to minimize — the equivalent trap is the opposite:
gratuitous explicit copies (allocating a separate "device" buffer and memcpy-ing into it)
that throw away UMA's entire advantage. Port the intent (keep the hot set on-chip,
reuse tiles/threadgroup memory) but drop the copy choreography. And replace
"will it fit in VRAM?" (a hard OOM) with "will the hot set stay resident in the shared
pool?" (a soft, silent SSD-paging cliff).
9 · Side-by-side & gotchas
Axis
Discrete (PCIe)
CUDA Unified (managed)
Apple UMA
Physical pools
2 (DRAM + VRAM)
2 (migrated)
1
Pointer
separate host/device
one (virtual)
one (shared buffer)
Copy to use on GPU
explicit DMA
implicit (page fault)
none
Bottleneck
PCIe bandwidth
fault/migration thrash
shared bandwidth budget
Pinning still needed
yes (for DMA)
runtime-managed
yes (page wiring)
Best for
huge models, max VRAM bandwidth
convenience / oversubscription
low-latency, low-power, copy-free
Gotchas worth naming
Forgetting to pin → silent fallback to the staging double-copy; transfers look slow for no obvious reason.
Synchronous copy on the critical path → CPU stalls waiting for PCIe. Use async + pinned + streams to overlap.
Treating CUDA managed memory as free unification → page-fault thrash over PCIe; prefetch/advise to fix.
On UMA, reading GPU output before the command buffer completes → you read stale/partial bytes. Coherence is handled; ordering is not — wait on completion.
Over-pinning → starves the swappable pool; the whole system degrades.
Assuming UMA bandwidth is "free" → CPU and GPU contend for one bus; a memory-bound GPU kernel can starve the CPU and vice-versa.