When Three MacBooks Beat One: The Distributed AI Experiment
Developer Alex Ziskind clusters three M5 Max MacBook Pros to run AI models too large for any single machine. The results reveal hard limits.
Written by AI. Dev Kapoor
April 18, 2026

Photo: Alex Ziskind / YouTube
Alex Ziskind wired three M5 Max MacBook Pros together with Thunderbolt 5 cables and tried to break them. Not physically—computationally. The goal: run AI models so large that a single machine couldn't even load them into memory, let alone process them at usable speeds.
This isn't the first time Ziskind has clustered Apple hardware for local AI inference. A few months ago, he did similar experiments with M3 Ultra Mac Studios. But with Apple's WWDC potentially around the corner and no M5 Ultra announced yet, he's working with what's actually available: the M5 Max MacBook Pro, currently Apple's top-end silicon you can buy today.
The experiment reveals something more interesting than raw performance numbers. It maps the actual boundaries of what's possible with consumer hardware when you're trying to run models that weren't designed to fit on consumer hardware.
The Setup and the Math That Lies
Each MacBook Pro has 128 GB of unified memory. The stack is MLX, Apple's framework for machine learning on Apple Silicon, configured for distributed inference. Ziskind started small—a 4 billion parameter model at just 2 GB on disk—to verify the cluster actually worked before attempting anything ambitious.
On one machine: 179 tokens per second. On two machines: 220 tokens per second. A 22% speedup from RDMA (Remote Direct Memory Access) over Thunderbolt 5, even though the model was too small to really benefit from distribution. "This is a proof of concept that the cluster is actually doing distributed work, not just sitting on one machine pretending," Ziskind explained.
The interesting part came when he scaled up to Qwen 3.5 122B—122 billion parameters with 10 billion active at any time (a mixture-of-experts architecture). At 65 GB on disk, it still fit on a single 128 GB machine. One node: 57 tokens per second. Two nodes: 73 tokens per second. A 27% improvement.
Then he switched to the 8-bit quantization of the same model: 122 GB on disk. Theoretically, this should fit on a single 128 GB machine. The math checks out. The reality didn't. "One node doesn't even load out of memory," Ziskind said. "You don't get a slow benchmark. You get nothing at all."
On two nodes: 51 tokens per second. This is where the cluster stops being an optimization and becomes the only option. The OS overhead, the MLX runtime, the KV cache—all the invisible tax on that 128 GB means a 122 GB model simply won't run on hardware that theoretically has room for it.
The Practical Ceiling
Ziskind kept pushing. GLM-4 9B at 185 GB: 29 tokens per second across two machines. Legitimately usable for coding work, he noted—you could point VS Code or Cursor at the OpenAI-compatible endpoint and actually write software with it.
Then he tried Llama 3.1 405B: 215 GB on disk. Two 128 GB machines have 256 GB of raw unified memory. The math should work. It didn't. Both nodes allocated about 109 GB each and immediately started swapping to disk. After 25 minutes, the benchmark hadn't generated a single token.
"The math lied," Ziskind said. "After OS overhead, after MLX runtime overhead, the KV cache, it just didn't work."
The practical ceiling for a two-node, 128 GB-per-machine cluster: about 200 GB of model weights. Beyond that, you're swapping and thrashing. To go bigger, you need a third machine.
Why Not Three Nodes?
Here's where the experiment runs into a wall that has nothing to do with hardware and everything to do with how transformer models are architected. Ziskind explained the difference between tensor parallelism and pipeline parallelism—two strategies for distributing model computation across multiple machines.
Tensor parallelism (what the MLX distributed stack uses) slices each layer horizontally. Every node processes every layer, but only on its own shard of the weights. This requires constant coordination between nodes, which is why you need fast connections. But it also requires that layer dimensions—hidden size, intermediate size, attention head counts—divide evenly across all nodes.
Every popular model—Llama, Qwen, Mistral, GLM, Gemma—is built on power-of-two dimensions: 4096, 8192, 14,336. These divide evenly by 1, 2, 4, 8, 16. They don't divide evenly by 3.
"A three-node tensor parallel cluster works for almost no popular model out of the box," Ziskind said. "This is the actual reason this whole video is about two MacBook Pros instead of three."
Pipeline parallelism solves this by slicing models vertically—each node holds entire layers, and tokens flow through them sequentially like an assembly line. The project Exo implements this approach. Ziskind switched to Exo for the three-node experiments, which opened up access to even larger models but came with different headaches.
Exo sometimes re-downloads models it's already cached. It occasionally grays out models that exist on disk for reasons Ziskind couldn't determine. "I have to be honest with you, EXO is a little frustrating sometimes," he admitted. But it worked well enough to run Qwen 3.5 397B—323 GB on disk—at 28 tokens per second across three machines.
That's a 397 billion parameter mixture-of-experts model running on three laptops. Not a benchmark stunt. Actually usable.
The M5 vs M3 Comparison
Ziskind promised a head-to-head comparison at the start, and he delivered. On the small Qwen 3 4B model: two M3 Ultra Mac Studios achieved 170 tokens per second. Two M5 Max MacBook Pros hit 219 tokens per second—29% faster, from laptops.
"It's less than an H100, but it's less power than my microwave, and it fits in the backpack," Ziskind noted.
The results surface a tension in local AI development that won't resolve itself through better silicon alone. Model architectures assume certain hardware configurations. Hardware configurations assume certain model architectures. When you try to run massive models on consumer machines—even clustered consumer machines—you hit limits that are partly mathematical, partly architectural, partly just the accumulated overhead of operating systems designed for general computing, not dedicated inference.
The experiment proves clustering works. It also proves clustering has boundaries that aren't obvious until you've wired up the cables and watched the memory swap.
Dev Kapoor covers open source software and developer communities for Buzzrag.
Watch the Original Video
3 MacBooks Did What One Never Could
Alex Ziskind
15m 1sAbout This Source
Alex Ziskind
Alex Ziskind's YouTube channel is a haven for developers and tech enthusiasts eager to explore the intricate world of software and hardware. With over 425,000 subscribers and more than two decades of coding expertise, Alex has transitioned into content creation, offering a unique blend of humor and technical insight. His channel serves as a guide to unraveling software enigmas, with a particular focus on performance optimization and AI hardware.
Read full source profileMore Like This
NVIDIA's Open Models: A New Era for Developers
NVIDIA's CES 2026 focuses on open models, altering developer workflows and AI ecosystems.
Decoding the Fastest Machines for Token Generation
Exploring GPU performance in generating 1M tokens and energy efficiency.
Giant Spinning Wheels Are Preventing Grid Blackouts
How 40-ton flywheels are stabilizing renewable energy grids—and why tech from space laser programs is now running port cranes in Rotterdam.
How Cloudflare Uses Lava Lamps to Encrypt the Internet
Cloudflare's San Francisco office has a wall of 100 lava lamps generating entropy for SSL/TLS encryption. Here's why computers can't be truly random.