Apple's M5 Max Just Changed the Local AI Game

While Claude's APIs were going down mid-recording (again), IndyDevDan was running state-of-the-art language models on his laptop. No cloud dependency. No API bills. Just Apple silicon doing what it apparently does best.

The timing is almost too perfect. Every few weeks, we see another cloud AI provider go dark for a few hours, and every few weeks, the conversation about local inference gets a little louder. But Dan's new benchmarking video isn't just vibes—it's hard numbers comparing Apple's brand-new M5 Max chip against last year's M4 Max, and the performance gap is wider than anyone expected.

The Format War Nobody's Talking About

Here's the thing that caught my attention: this isn't really a story about hardware generations. It's a story about software optimization that most people are completely missing.

Dan tested four model configurations: Qwen 3.5 and Gemma 4, each in both GGUF format (the standard format most people use) and MLX format (Apple's specialized machine learning framework). The results weren't subtle. The MLX variant of Qwen 3.5 hit 118 tokens per second on the M5 Max. The GGUF version? 60 tokens per second.

That's not a minor difference. That's the MLX version running almost twice as fast as the format most Mac users are probably running right now.

"Prefill speed is almost double using the MLX variant," Dan notes in the benchmark. "If you're running GGUF models on Apple Silicon in 2026, you're leaving 2x performance on the table."

The reason comes down to how deeply MLX integrates with Apple's unified memory architecture and GPU neural accelerators. GGUF is platform-agnostic, which means it works everywhere but excels nowhere. MLX is purpose-built for Apple silicon, and that specialization shows up in the benchmarks as raw speed.

What "Fast Enough" Actually Means

Dan introduces a useful framework here: wall clock time versus tokens per second. Tokens per second sounds impressive in marketing materials, but wall clock time is what you actually experience—how long you sit waiting for a response, including all the hidden costs of loading models into memory, processing your prompt, and generating output.

On simple prompts ("explain what a hash table is in two sentences"), both chips performed well. The M5 Max showed 15-50% faster wall clock times than the M4 Max across most tests, but both were comfortably in what Dan calls the "fully usable" range—anything above 30 tokens per second.

The real stress test came from the context scaling benchmark, where models had to perform breadth-first search across increasingly large graphs. At 200 tokens of context, both machines handled it easily. At 32K tokens, things got interesting. Dan could hear the M4 Max's fans spinning up hard. The M5 Max stayed relatively quiet while maintaining better performance.

"The 32K is what I'm seeing as the proper context limit for these small language models," Dan observes. "I'm talking 35 billion parameters and below."

That's a useful data point. We're used to hearing about models with million-token context windows, but those are cloud-scale models burning through data center GPUs. On local hardware, 32K seems to be where the performance cliff arrives—where the models start struggling not because they can't technically handle more context, but because the quality and speed degrade noticeably.

The Gemma 4 Surprise

Among the models tested, Google's Gemma 4 stood out. It's a 26-billion-parameter model that somehow fits into just 16GB of RAM in its MLX variant, while maintaining competitive performance with the 35-billion-parameter Qwen model.

"It's great to have a model coming out of the US that's truly open and actually competitive with the Qwen series and the other Chinese labs," Dan notes.

That matters more than it might seem. For a while, the open model conversation was dominated by Chinese labs releasing increasingly powerful models while US companies largely focused on closed APIs. Gemma 4 represents Google actually shipping something competitive in the open model space, and it's optimized specifically for the hardware most developers already own.

The efficiency angle is particularly interesting. Smaller models that can do 80% of what larger models do, but run twice as fast and use half the memory, are going to win for a huge range of everyday tasks. Not everything needs GPT-4 scale.

Where This Actually Leads

Dan's thesis throughout the video is that we're approaching a tipping point where local models become genuinely preferable to cloud APIs for certain workloads. Not all workloads—he's clear about that—but enough that the calculus changes.

The three factors he emphasizes: privacy, cost, and reliability. Privacy because your data never leaves your machine. Cost because there's no per-token API charge. Reliability because your local model doesn't go down when Anthropic's servers do.

But there's a fourth factor he doesn't explicitly name: control. When you run models locally, you're not subject to sudden pricing changes, model deprecations, or terms of service updates. You own your inference stack.

That ownership comes with tradeoffs, obviously. You need to buy the hardware upfront (a fully-specced M5 Max isn't cheap), you're limited by your local compute power, and you're responsible for managing your own models. For many use cases, cloud APIs are still the obvious choice.

But for developers who are already on Apple Silicon, who work with sensitive data, who want predictable costs, or who just got burned one too many times by API outages? The performance numbers Dan's showing suggest the local option is no longer a compromise. It's a legitimate alternative with its own set of advantages.

The interesting question is what happens when Apple ships the M5 Ultra or M6 generation with even more unified memory. Dan mentions the possibility of 500GB of RAM in future Mac hardware. At that point, the models you can run locally start overlapping significantly with what people currently pay cloud providers for.

That doesn't mean cloud AI is going away—obviously not. But it does mean the "you have to use cloud APIs" framing that's dominated the conversation for the past two years might need updating. The hardware is here. The models are here. The tooling is getting better fast. What's missing is just awareness that the local option actually works now.

—Zara Chen, Tech & Politics Correspondent