Apple M5 Max Crushes Local AI—Even Beats M3 Ultra

Apple's M5 Max marketing claimed 4x GPU compute for AI versus the M4 Max. That's the kind of number that makes you squint because it sounds too good, like when phone makers claim "all-day battery" and mean "if you never turn it on." But software developer and YouTuber Alex Ziskind actually tested those claims against both the M4 Max and Apple's desktop beast, the M3 Ultra Mac Studio. One result completely changed how he's thinking about this machine—and honestly, it should change how anyone shopping in this price range thinks about it too.

The thing that matters for local AI isn't just raw power. It's the interplay between GPU compute (for the initial prompt processing) and memory bandwidth (for spitting out those tokens). Apple redesigned the M5 Max's GPU with neural accelerators in every single core and bumped unified memory bandwidth to 614 GB/s. On paper, impressive. In practice? That's what Ziskind wanted to find out.

The Setup That Actually Matters

Ziskind's testing approach cuts through the spec sheet nonsense. He's running real workloads that developers and AI tinkerers actually use: software compilation, JavaScript benchmarks, and multiple local LLM frameworks. Not synthetic benchmarks designed to make press releases look good—actual tools like LM Studio and Llama.cpp that people run daily.

The M5 Max he tested has 40 GPU cores and 128GB of unified memory. That puts it at the top M5 Max configuration, but still half the memory of the M3 Ultra's 512GB ceiling. The M3 Ultra also has 32 CPU cores versus the M5 Max's 18. On paper, the desktop should crush the laptop in everything. That's not what happened.

First, the fundamentals. Single-core Speedometer 3.1 hit 60.5 on the M5 Max—"the highest score I've ever seen in this test," Ziskind said. That's up from 56.7 on the M4 Max. For multi-core Python algorithm work (Mandelbrot rendering that hammers all cores), the M5 Max clocked 11.6 seconds versus the M4 Max's 14.6 seconds. The M3 Ultra still won at 8.5 seconds, but the gap is narrower than the core count difference would suggest.

Apple also ditched "efficiency cores" entirely on the M5 Max. Now it's six "super cores" and 12 "performance cores." Ziskind's take: "That's really kind of a marketing thing if you ask me because the performance cores have been renamed to super cores and the new performance cores are kind of different than the old efficiency cores." Fair. But the 18-core setup is faster than the 16-core M4 Max regardless of what Apple calls them.

The SSD Speed Nobody Expected

Before even getting to AI, there's the storage. The M5 Max uses Gen 5 NVMe drives. Sequential read hit 13,647 MB/s and write hit 16,032 MB/s. That's nearly double the M4 Max and M3 Ultra, both around 7,300-8,200 MB/s. For loading large AI models that can be 60GB+, that startup time difference is real. Random read/write improved too—59 MB/s read and 45 MB/s write versus 49/35 on the M4 Max.

This matters because local LLMs don't live in RAM until you load them. Faster storage means less waiting around for models to spin up, especially if you're switching between different models frequently.

Memory Bandwidth: The AI Performance Multiplier

Memory bandwidth determines how fast the system can feed data to the GPU and CPU. For token generation (the second phase of LLM inference, where the model spits out its response), higher bandwidth = more tokens per second. Apple claimed 614 GB/s on the M5 Max versus 546 GB/s on the M4 Max and 819 GB/s on the M3 Ultra.

Ziskind ran the Stream Triad memory bandwidth test—a long-lived standard for measuring sustained memory throughput. Results: M4 Max hit 319,000 MB/s, M3 Ultra reached 337,000 MB/s, and the M5 Max topped out at 351,000 MB/s. That's 13% faster than the M4 Max and about 4-5% faster than the significantly larger M3 Ultra chip.

These are CPU-based measurements, so they come in lower than Apple's advertised peak (which combines GPU and CPU), but the ranking holds. More bandwidth should mean faster token generation. Does it?

Using LM Studio with the Qwen 3.5 mixture-of-experts model (35B active, 3B parameters), token generation jumped from 79.1 tokens/second on the M4 Max to 88.49 tokens/second on the M5 Max. The M3 Ultra? Only 69 tokens/second, though it had faster time-to-first-token. That memory bandwidth advantage is showing up in real inference speed.

On a larger 120B parameter model (GPT-QSS), results were closer: 61 tokens/second on M4 Max, 65 tokens/second on M5 Max, 82 tokens/second on M3 Ultra. Both the M4 Max and M5 Max pulled around 130 watts during this test; the M3 Ultra spiked to 240 watts. More power, more speed, but diminishing returns.

The Prompt Processing Shocker

Here's where things get wild. Prompt processing (PP) is the first phase of LLM inference—when the model ingests your prompt before generating a response. This stage leans heavily on GPU compute, which is exactly what Apple redesigned with those neural accelerators in every GPU core.

Using Llama.cpp's benchmark with Gemma 34B (a smaller dense model), the M4 Max processed prompts at 1,855 tokens per second. The M3 Ultra hit 2,959 tokens/second. And the M5 Max? 4,468 tokens per second.

Read that again. The laptop beat the desktop. By a lot. "Apple was not lying," Ziskind said. "This is for real, folks."

That's not 4x versus the M4 Max—it's about 2.4x—but it's still a massive jump. And crucially, it's 50% faster than the M3 Ultra in prompt processing despite having fewer GPU cores (40 vs the Ultra's 76). Those neural accelerators aren't just marketing. They're doing actual work.

What This Actually Means

If you're running local LLMs regularly—not just tinkering but actually using them for code generation, writing assistance, or research—the M5 Max is legitimately competitive with Apple's desktop hardware for a lot of workflows. It won't replace the Ultra for massive batch processing or models that need 512GB of RAM. But for the intersection of portability and AI performance? This is a different calculation than previous generations.

The M3 Ultra still wins on sustained multi-core workloads and has way more memory headroom. If you're training models or running simultaneous large contexts, that matters. But for iterative development work where prompt processing speed affects your flow? The M5 Max's faster PP could feel snappier in practice.

And there's the other thing: power draw. The M5 Max pulls roughly half the watts of the M3 Ultra under load. It's doing competitive AI work in a laptop form factor at laptop power budgets. That's not a minor detail if you actually move around with your machine.

Ziskind's obvious next question: "This makes me pretty excited for what might be coming in the M5 Ultra." If the Max gets this kind of prompt processing boost with 40 GPU cores, what happens when Apple fuses two of these together? The math gets interesting really fast.

—Tyler Nakamura