The new M1 chip in the new Macs has 8-16GB of DRAM on the package, just like many mobile phones or single-board computers. But unlike many desktop, laptop or workstation computers (there are exceptions). In the first tranche of Macs using the chip, that’s all the addressable RAM they have (i.e. ignoring caches), just like many mobile phones or single-board computers. But what happens when they move the Apple Silicon chips up the scale, to computers like the iMac or Mac Pro?
It’s possible that these models would have a few GB of memory on-package and access to memory modules connected via a conventional controller, for example DDR4 RAM. They almost certainly would if you could deploy multiple M1 (or successor) packages on a single system. Such a Mac would be a non-uniform memory access architecture (NUMA), which (depending on how it’s configured) has implications for how software can be designed to best make use of the memory.
NUMA computing is of course not new. If you have a computer with a CPU and a discrete graphics processor, you have a NUMA computer: the GPU has access to RAM that the CPU doesn’t, and vice versa. Running GPU code involves copying data from CPU-memory to GPU-memory, doing GPU stuff, then copying the result from GPU-memory to CPU-memory.
A hypothetical NUMA-because-Apple-Silicon Mac would not be like that. The GPU shares access to the integrated RAM with the CPU, a little like an Amiga. The situation on Amiga was that there was “chip RAM” (which both the CPU and graphics and other peripheral chips could access), and “fast RAM” (only available to the CPU). The fast RAM was faster because the CPU didn’t have to wait for the coprocessors to use it, whereas they had to take turns accessing the chip RAM. Nonetheless, the CPU had access to all the RAM, and programmers had to tell `AllocMem` whether they wanted to use chip RAM, fast RAM, or didn’t care.
A NUMA Mac would not be like that, either. It would share the property that there’s a subset of the RAM available for sharing with the GPU, but this memory would be faster than the off-chip memory because of the closer integration and lack of (relatively) long communication bus. Apple has described the integrated RAM as “high bandwidth”, which probably means multiple access channels.
A better and more recently analogy to this setup is Intel’s discontinued supercomputer chip, Knight’s Landing (marketed as Xeon Phi). Like the M1, this chip has 16GB of on-die high bandwidth memory. Like my hypothetical Mac Pro, it can also access external memory modules. Unlike the M1, it has 64 or 72 identical cores rather than 4 big and 4 little cores.
There are three ways to configure a Xeon Phi computer. You can not use any external memory, and the CPU entirely uses its on-package RAM. You can use a cache mode, where the software only “sees” the external memory and the high-bandwidth RAM is used as a cache. Or you can go full NUMA, where programmers have to explicitly request memory in the high-bandwidth region to access it, like with the Amiga allocator.
People rarely go full NUMA. It’s hard to work out what split of allocations between the high-bandwidth and regular RAM yields best performance, so people tend to just run with cached mode and hope that’s faster than not having any on-package memory at all.
And that makes me think that a Mac would either not go full NUMA, or would not have public API for it. Maybe Apple would let the kernel and some OS processes have exclusive access to the on-package RAM, but even that seems overly complex (particularly where you have more than one M1 in a computer, so you need to specify core affinity for your memory allocations in addition to memory type). My guess is that an early workstation Mac with 16GB of M1 RAM and 64GB of DDR4 RAM would look like it has 64GB of RAM, with the on-package memory used for the GPU and as cache. NUMA APIs, if they come at all, would come later.