When Three MacBooks Beat One: The Distributed AI Experiment

Alex Ziskind wired three M5 Max MacBook Pros together with Thunderbolt 5 cables and tried to break them. Not physically—computationally. The goal: run AI models so large that a single machine couldn't even load them into memory, let alone process them at usable speeds.

This isn't the first time Ziskind has clustered Apple hardware for local AI inference. A few months ago, he did similar experiments with M3 Ultra Mac Studios. But with Apple's WWDC potentially around the corner and no M5 Ultra announced yet, he's working with what's actually available: the M5 Max MacBook Pro, currently Apple's top-end silicon you can buy today.

The experiment reveals something more interesting than raw performance numbers. It maps the actual boundaries of what's possible with consumer hardware when you're trying to run models that weren't designed to fit on consumer hardware.

The Setup and the Math That Lies

Each MacBook Pro has 128 GB of unified memory. The stack is MLX, Apple's framework for machine learning on Apple Silicon, configured for distributed inference. Ziskind started small—a 4 billion parameter model at just 2 GB on disk—to verify the cluster actually worked before attempting anything ambitious.

On one machine: 179 tokens per second. On two machines: 220 tokens per second. A 22% speedup from RDMA (Remote Direct Memory Access) over Thunderbolt 5, even though the model was too small to really benefit from distribution. "This is a proof of concept that the cluster is actually doing distributed work, not just sitting on one machine pretending," Ziskind explained.

The interesting part came when he scaled up to Qwen 3.5 122B—122 billion parameters with 10 billion active at any time (a mixture-of-experts architecture). At 65 GB on disk, it still fit on a single 128 GB machine. One node: 57 tokens per second. Two nodes: 73 tokens per second. A 27% improvement.

Then he switched to the 8-bit quantization of the same model: 122 GB on disk. Theoretically, this should fit on a single 128 GB machine. The math checks out. The reality didn't. "One node doesn't even load out of memory," Ziskind said. "You don't get a slow benchmark. You get nothing at all."

On two nodes: 51 tokens per second. This is where the cluster stops being an optimization and becomes the only option. The OS overhead, the MLX runtime, the KV cache—all the invisible tax on that 128 GB means a 122 GB model simply won't run on hardware that theoretically has room for it.

The Practical Ceiling

Ziskind kept pushing. GLM-4 9B at 185 GB: 29 tokens per second across two machines. Legitimately usable for coding work, he noted—you could point VS Code or Cursor at the OpenAI-compatible endpoint and actually write software with it.

Then he tried Llama 3.1 405B: 215 GB on disk. Two 128 GB machines have 256 GB of raw unified memory. The math should work. It didn't. Both nodes allocated about 109 GB each and immediately started swapping to disk. After 25 minutes, the benchmark hadn't generated a single token.

"The math lied," Ziskind said. "After OS overhead, after MLX runtime overhead, the KV cache, it just didn't work."

The practical ceiling for a two-node, 128 GB-per-machine cluster: about 200 GB of model weights. Beyond that, you're swapping and thrashing. To go bigger, you need a third machine.

Why Not Three Nodes?

Here's where the experiment runs into a wall that has nothing to do with hardware and everything to do with how transformer models are architected. Ziskind explained the difference between tensor parallelism and pipeline parallelism—two strategies for distributing model computation across multiple machines.

Tensor parallelism (what the MLX distributed stack uses) slices each layer horizontally. Every node processes every layer, but only on its own shard of the weights. This requires constant coordination between nodes, which is why you need fast connections. But it also requires that layer dimensions—hidden size, intermediate size, attention head counts—divide evenly across all nodes.

Every popular model—Llama, Qwen, Mistral, GLM, Gemma—is built on power-of-two dimensions: 4096, 8192, 14,336. These divide evenly by 1, 2, 4, 8, 16. They don't divide evenly by 3.

"A three-node tensor parallel cluster works for almost no popular model out of the box," Ziskind said. "This is the actual reason this whole video is about two MacBook Pros instead of three."

Pipeline parallelism solves this by slicing models vertically—each node holds entire layers, and tokens flow through them sequentially like an assembly line. The project Exo implements this approach. Ziskind switched to Exo for the three-node experiments, which opened up access to even larger models but came with different headaches.

Exo sometimes re-downloads models it's already cached. It occasionally grays out models that exist on disk for reasons Ziskind couldn't determine. "I have to be honest with you, EXO is a little frustrating sometimes," he admitted. But it worked well enough to run Qwen 3.5 397B—323 GB on disk—at 28 tokens per second across three machines.

That's a 397 billion parameter mixture-of-experts model running on three laptops. Not a benchmark stunt. Actually usable.

The M5 vs M3 Comparison

Ziskind promised a head-to-head comparison at the start, and he delivered. On the small Qwen 3 4B model: two M3 Ultra Mac Studios achieved 170 tokens per second. Two M5 Max MacBook Pros hit 219 tokens per second—29% faster, from laptops.

"It's less than an H100, but it's less power than my microwave, and it fits in the backpack," Ziskind noted.

The results surface a tension in local AI development that won't resolve itself through better silicon alone. Model architectures assume certain hardware configurations. Hardware configurations assume certain model architectures. When you try to run massive models on consumer machines—even clustered consumer machines—you hit limits that are partly mathematical, partly architectural, partly just the accumulated overhead of operating systems designed for general computing, not dedicated inference.

The experiment proves clustering works. It also proves clustering has boundaries that aren't obvious until you've wired up the cables and watched the memory swap.

Dev Kapoor covers open source software and developer communities for Buzzrag.

When Three MacBooks Beat One: The Distributed AI Experiment

The Setup and the Math That Lies

The Practical Ceiling

Why Not Three Nodes?

The M5 vs M3 Comparison

Watch the Original Video

3 MacBooks Did What One Never Could

About This Source

Alex Ziskind

More Like This

NVIDIA's Open Models: A New Era for Developers

Decoding the Fastest Machines for Token Generation

Giant Spinning Wheels Are Preventing Grid Blackouts

How Cloudflare Uses Lava Lamps to Encrypt the Internet