Reader beware: This post is 100% serious and contains absolutely no sarcasm what-so-ever. The author doesn’t have a background in mathematics or computer engineering, so please attempt any arithmetic or manufacturing techniques outlined in this article at home at your own risk.
The Story So Far
So, after nearly over a decade of speculation, at WWDC 2020 Apple finally announced that they were moving from Intel processors to their own, in-house ARM-based processors. In the nearly 2 years since that announcement, Apple has released a family of processors based around their M1 chip.
First the M1, a pretty remarkable achievement in low power consumption, high performance computing. Not something thats going to replace a high power desktop anytime soon, but it showed that Apple really had something here.
Then the M1 Pro and M1 Max. These chips really established what Apple’s plan was. Take an existing chip, and scale it up to provide more power. Use binning to establish performance tiers within each chip.
Most recently, Apple released the M1 Ultra, the focus of this blog post. Since the M1 Max was already a monstrous chip, increasing the size would be a difficult ask. So instead Apple decided to just take 2 chips and “glue” them together using a high speed interconnect called an “UltraFusion” interconnect.
Hyperbolic naming aside, this is really an impressive achievement. Multi-processor systems have existed for decades but they usually had to be “NUMA aware”. This meant that the programs run on these machines had to be optimized to not split a task and its data across processors. If a process was running on core 1, it’s data needed to be stored on the memory modules directly connected to core 1. If the data was stored on the memory modules directly connected to core 2, the increased latency would massively hamper performance. The way that Apple circumvents this is by having the interconnect fabric transfer data at 2.5 TB/sec.
And that’s just the beginning. There have been rumors for months now that Apple’s not done. The only Mac in their line up that has yet to get the Apple Silicon treatment is the Mac Pro and thats where this story begins.
The Rumor Mill is Boring
The image I keep seeing online goes a little something like this:
With the interconnects looking something like this:
But this orientation poses a problem for Apple. Every interconnect is taken and there is no room for expansion.
What if Apple wanted to bolt together 8 chips?
Or 16?
This simply will not do.
So instead I went in search of a solution. A way for these chips to scale up infinitely.
Math To The Rescue!
My research led to the world of Space-Filling Curves! What is a space-filling curve? Well the Wikipedia article was no help to someone who doesn’t already have a math background to let me try to outline it in simpler terms.
A space-filling curve is any curve that can be, through a process of rotation, translation, and/or reflection, tiled to fill an infinite plane. The simplest example of which is the Peano Curve, the first space-filling curve ever discovered.
The Peano Curve can be understood pretty intuitively through just its first few iterations. You start with 2 units attached vertically and you then tile this same shape outward infinitely until you fill a 2D plane.
If we want to apply this to our chip design, we would simply start with the M1 Ultra, a chip that is already 2 interconnected units, and tile outward until it fills an infinite computer chassis.
But there is a problem here.
No, not the feasibility of manufacturing an infinite processor. A math problem.
The current M1 Ultra processor has an interconnect on only one side. If we wanted to accomplish this design, we would need to add an interconnect on the other 3 sides of every chip. This can easily increase marginal costs once we scale up to infinite processors. Lets see if we can’t reduce costs a little.
The current M1 XL (Jade 4c die) concept has interconnects on 2 adjacent sides of the chips, lets see if we can’t design a space-filling curve around that.
This brings us to the second space-filling curve we’re going to talk about, the Hilbert Curve.
This curve has an additional feature that might make manufacturing a little easier: it’s fractal!
Looking at the first few iterations of the Hilbert Curve should give us a better idea of how this works.
We start out with 4 units of space connected in an upside down “U” shape.
The shape is then rotated clockwise to create the first quarter of this new shape.
The shape is then copied twice, rotated counter-clockwise, and translated up to create the top of the next shape.
The final step is to make one more copy of the shape, rotate it twice, and translate it to the other side of the new shape.
We now have 4 “U” shapes and the final step is to connect them together with 3 straight lines.
Does the resulting shape look familiar?
It should.
The first iteration consisted of a line that passed through every square of a 2×2 grid. The new curve passes through every square of a 4×4 grid.
This is the “Space-Filling” part of the Hilbert curve.
The second attribute of the Hilbert Curve is the fact that it is continuous. This means that if you print it out on a piece of paper, you could place your finger down on one end and, by following the line, reach the other side without every lifting your finger from the page.
The final piece of the Hilbert Curve is that it’s fractal. This has already been demonstrated by the fact that we took the first iteration, and, through just rotation and translation, managed to create a new curve that had all of the same properties of the original curve.
This might be more apparent if we build the third iteration of the Hilbert curve. The steps are the same as I’ve outlined above.
Start with the shape and rotate it clockwise.
Make 2 copies, rotate them counter-clockwise and translate them up.
Make a final copy and rotate it twice.
Connect the shapes together with 3 straight lines.
And voila!
You now have a new shape that still satisfies the properties of the original.
The curve passes through every square in an 8×8 grid. It’s continuous, and it’s fractal.
Other Space-Filling Curves
This is only one of several space-filling curves that have been discovered.
Legend of Zelda fans might recognize the Sierpinski Triangle. 3D printing fans might recognize the curve,
Thoughts on the M1 Hyper
So, this is my proposal for the M1 Hyper. A theoretically infinite processor.
[image of M1s tiled]
We keep the design of the proposed Jade 4c Die with interconnects on only 2 sides and simply infinitely tile it until we reach the performance level we need.
Roadblocks
Observant readers might have noticed that if we use a processor with interconnects on only 2 sides, we can connect the first 3 shapes with translation and rotation alone, but connecting the 4th shape requires a reflection. This means that that cluster of chips would be upside-down!
But don’t worry, I have a solution to that as well. We simply design a new motherboard for this processor the same way that we design unique motherboards for dual-CPU systems.
We design a sandwich motherboard. The total number of incorrectly oriented chips can be calculated with the following formula
Where n is any non-negative number representing the current iteration, starting at 1.
This way the chips that are oriented incorrectly can be used as efficiency cores the the rest of the chips can be used as performance cores.
Why?
Here’s the thing, dear reader, I designed this processor for completely selfish reasons.
You see, I can bring any computer to its knees within 6 months.
I recently bought an M1 Max, 16″ MBP with 64Gb of ram and a 32-core GPU and I’m already hitting some roadblocks.
I usually have dozens of “productivity” apps running. Things like Yabai, Karabiner Elements, BetterTouchTool, etc… I also have at least 2 web browsers open with 100 – 300 tabs at any given time, at least 3 terminal windows open (2 for the current machine and one managing my home server), an IDE like VSCode, an email client, a note-taking app, a calendar, 1 – 2 messaging apps, an RSS reader, and maybe a video player like Jellyfin or a creative app like Affinity Designer.
I figured that an 8-core machine with 64Gb of ram would be enough to run this stuff, but I’m still struggling with performance. Perhaps this is because I expected too much out of this little machine, or maybe I use computers in a way that they weren’t designed to be used. Either way, I just don’t think that I can every get enough performance…
For my home server, I’m running Unraid on an 8-core Ryzen 2700x with 64Gb of RAM and 2 GPUs and I’m still running out of performance… on a desktop machine! I have ~86 Docker containers running and ~26 VMs configured (with only 2-4 running at any given time) and I’m just itching to upgrade to the new Threadripper Pros once they become available.
I took my frustration and designed the ultimate processor, one that could finally run my entire workload on and never redline.
Performance Numbers
In the interest of being thorough, I did a little more math and figured out the performance number of this purposed processor.
As we can see from these numbers, the single core performance is about on par with the rest of the M1 lineup
When we take a look at the multicore numbers though…
This is where we see a massive increase in performance.
So if you are planning on upgrading to the M1 Hyper, please make sure that your workload can actually take advantage of those infinite cores.
EDIT: It looks like the latest rumors suggest that the chips will actually be stacked in 3 dimensions, not just 2. It’s a good thing the Hilbert curve can also fill any 3-dimensional area with minimal changes!