Structure and Interpretation of Computer Programmers

I make it easier and faster for you to write high-quality software.

Friday, November 13, 2020

Apple Silicon, Xeon Phi, and Amigas

The new M1 chip in the new Macs has 8-16GB of DRAM on the package, just like many mobile phones or single-board computers. But unlike many desktop, laptop or workstation computers (there are exceptions). In the first tranche of Macs using the chip, that’s all the addressable RAM they have (i.e. ignoring caches), just like many mobile phones or single-board computers. But what happens when they move the Apple Silicon chips up the scale, to computers like the iMac or Mac Pro?

It’s possible that these models would have a few GB of memory on-package and access to memory modules connected via a conventional controller, for example DDR4 RAM. They almost certainly would if you could deploy multiple M1 (or successor) packages on a single system. Such a Mac would be a non-uniform memory access architecture (NUMA), which (depending on how it’s configured) has implications for how software can be designed to best make use of the memory.

NUMA computing is of course not new. If you have a computer with a CPU and a discrete graphics processor, you have a NUMA computer: the GPU has access to RAM that the CPU doesn’t, and vice versa. Running GPU code involves copying data from CPU-memory to GPU-memory, doing GPU stuff, then copying the result from GPU-memory to CPU-memory.

A hypothetical NUMA-because-Apple-Silicon Mac would not be like that. The GPU shares access to the integrated RAM with the CPU, a little like an Amiga. The situation on Amiga was that there was “chip RAM” (which both the CPU and graphics and other peripheral chips could access), and “fast RAM” (only available to the CPU). The fast RAM was faster because the CPU didn’t have to wait for the coprocessors to use it, whereas they had to take turns accessing the chip RAM. Nonetheless, the CPU had access to all the RAM, and programmers had to tell `AllocMem` whether they wanted to use chip RAM, fast RAM, or didn’t care.

A NUMA Mac would not be like that, either. It would share the property that there’s a subset of the RAM available for sharing with the GPU, but this memory would be faster than the off-chip memory because of the closer integration and lack of (relatively) long communication bus. Apple has described the integrated RAM as “high bandwidth”, which probably means multiple access channels.

A better and more recently analogy to this setup is Intel’s discontinued supercomputer chip, Knight’s Landing (marketed as Xeon Phi). Like the M1, this chip has 16GB of on-die high bandwidth memory. Like my hypothetical Mac Pro, it can also access external memory modules. Unlike the M1, it has 64 or 72 identical cores rather than 4 big and 4 little cores.

There are three ways to configure a Xeon Phi computer. You can not use any external memory, and the CPU entirely uses its on-package RAM. You can use a cache mode, where the software only “sees” the external memory and the high-bandwidth RAM is used as a cache. Or you can go full NUMA, where programmers have to explicitly request memory in the high-bandwidth region to access it, like with the Amiga allocator.

People rarely go full NUMA. It’s hard to work out what split of allocations between the high-bandwidth and regular RAM yields best performance, so people tend to just run with cached mode and hope that’s faster than not having any on-package memory at all.

And that makes me think that a Mac would either not go full NUMA, or would not have public API for it. Maybe Apple would let the kernel and some OS processes have exclusive access to the on-package RAM, but even that seems overly complex (particularly where you have more than one M1 in a computer, so you need to specify core affinity for your memory allocations in addition to memory type). My guess is that an early workstation Mac with 16GB of M1 RAM and 64GB of DDR4 RAM would look like it has 64GB of RAM, with the on-package memory used for the GPU and as cache. NUMA APIs, if they come at all, would come later.

posted by Graham at 09:35  

5 Comments »

  1. This discussion is over my head. Are you suggesting that Apple’s use of its M1 chip in computers is likely to produce memory management issues down the road? That it will be difficult to write software for these Macs and that the resulting software will be prone to errors?

    Comment by Paul M Dulaney — 2020-11-13 @ 16:24

  2. I’m saying that Apple will likely take a soft approach to introducing a more complex memory model in later Apple Silicon Macs, one that’s enabled by the architecture of the M1. I doubt that most Swift/JS app programmers will ever notice or care about the difference, but it’d be possible to tune performance on the higher-end Macs if they do it the way I describe.

    Comment by Graham — 2020-11-13 @ 17:47

  3. They could go with explicit local / remote allocs, and hide that, initially, behind their Accelerate framework.

    The application programmer requires no knowledge, but doing work via Accelerate, will magically be faster.

    Later, explicit APIs could open it up for all.

    It feels a bit like SGI in the olden days with ccNUMA.

    Comment by Bram — 2020-11-14 @ 02:17

  4. Yes, that seems like a good staged approach to take.

    By the way I think I forgot to list another option for how it could work: the two types of memory could both be addressable with the OS unable to distinguish them; some operations are magically faster than others and some are surprisingly slow. It’s possible, even if not a good idea.

    Comment by Graham — 2020-11-16 @ 09:43

  5. […] Graham Lee: […]

    Pingback by Michael Tsai - Blog - M1 Memory and Performance — 2020-11-23 @ 22:04

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress