Edited by humans. Written by AI. How our editing works
BUZZRAGNews. Trends. Ideas — distilled in minutes.
All articles

Apple's M5 Max Just Changed the Local AI Game

New benchmarks show Apple's M5 Max running local AI models 15-50% faster than M4, with MLX format delivering double the performance of standard GGUF.

Written by AI. Zara Chen

April 21, 20266 min read
Share:
Hands holding a silver MacBook Pro with Apple logo centered, with "M5 GEMMA4 MLX" text displayed above against a dark…

Photo: IndyDevDan / YouTube

While Claude's APIs were going down mid-recording (again), IndyDevDan was running state-of-the-art language models on his laptop. No cloud dependency. No API bills. Just Apple silicon doing what it apparently does best.

The timing is almost too perfect. Every few weeks, we see another cloud AI provider go dark for a few hours, and every few weeks, the conversation about local inference gets a little louder. But Dan's new benchmarking video isn't just vibes—it's hard numbers comparing Apple's brand-new M5 Max chip against last year's M4 Max, and the performance gap is wider than anyone expected.

The Format War Nobody's Talking About

Here's the thing that caught my attention: this isn't really a story about hardware generations. It's a story about software optimization that most people are completely missing.

Dan tested four model configurations: Qwen 3.5 and Gemma 4, each in both GGUF format (the standard format most people use) and MLX format (Apple's specialized machine learning framework). The results weren't subtle. The MLX variant of Qwen 3.5 hit 118 tokens per second on the M5 Max. The GGUF version? 60 tokens per second.

That's not a minor difference. That's the MLX version running almost twice as fast as the format most Mac users are probably running right now.

"Prefill speed is almost double using the MLX variant," Dan notes in the benchmark. "If you're running GGUF models on Apple Silicon in 2026, you're leaving 2x performance on the table."

The reason comes down to how deeply MLX integrates with Apple's unified memory architecture and GPU neural accelerators. GGUF is platform-agnostic, which means it works everywhere but excels nowhere. MLX is purpose-built for Apple silicon, and that specialization shows up in the benchmarks as raw speed.

What "Fast Enough" Actually Means

Dan introduces a useful framework here: wall clock time versus tokens per second. Tokens per second sounds impressive in marketing materials, but wall clock time is what you actually experience—how long you sit waiting for a response, including all the hidden costs of loading models into memory, processing your prompt, and generating output.

On simple prompts ("explain what a hash table is in two sentences"), both chips performed well. The M5 Max showed 15-50% faster wall clock times than the M4 Max across most tests, but both were comfortably in what Dan calls the "fully usable" range—anything above 30 tokens per second.

The real stress test came from the context scaling benchmark, where models had to perform breadth-first search across increasingly large graphs. At 200 tokens of context, both machines handled it easily. At 32K tokens, things got interesting. Dan could hear the M4 Max's fans spinning up hard. The M5 Max stayed relatively quiet while maintaining better performance.

"The 32K is what I'm seeing as the proper context limit for these small language models," Dan observes. "I'm talking 35 billion parameters and below."

That's a useful data point. We're used to hearing about models with million-token context windows, but those are cloud-scale models burning through data center GPUs. On local hardware, 32K seems to be where the performance cliff arrives—where the models start struggling not because they can't technically handle more context, but because the quality and speed degrade noticeably.

The Gemma 4 Surprise

Among the models tested, Google's Gemma 4 stood out. It's a 26-billion-parameter model that somehow fits into just 16GB of RAM in its MLX variant, while maintaining competitive performance with the 35-billion-parameter Qwen model.

"It's great to have a model coming out of the US that's truly open and actually competitive with the Qwen series and the other Chinese labs," Dan notes.

That matters more than it might seem. For a while, the open model conversation was dominated by Chinese labs releasing increasingly powerful models while US companies largely focused on closed APIs. Gemma 4 represents Google actually shipping something competitive in the open model space, and it's optimized specifically for the hardware most developers already own.

The efficiency angle is particularly interesting. Smaller models that can do 80% of what larger models do, but run twice as fast and use half the memory, are going to win for a huge range of everyday tasks. Not everything needs GPT-4 scale.

Where This Actually Leads

Dan's thesis throughout the video is that we're approaching a tipping point where local models become genuinely preferable to cloud APIs for certain workloads. Not all workloads—he's clear about that—but enough that the calculus changes.

The three factors he emphasizes: privacy, cost, and reliability. Privacy because your data never leaves your machine. Cost because there's no per-token API charge. Reliability because your local model doesn't go down when Anthropic's servers do.

But there's a fourth factor he doesn't explicitly name: control. When you run models locally, you're not subject to sudden pricing changes, model deprecations, or terms of service updates. You own your inference stack.

That ownership comes with tradeoffs, obviously. You need to buy the hardware upfront (a fully-specced M5 Max isn't cheap), you're limited by your local compute power, and you're responsible for managing your own models. For many use cases, cloud APIs are still the obvious choice.

But for developers who are already on Apple Silicon, who work with sensitive data, who want predictable costs, or who just got burned one too many times by API outages? The performance numbers Dan's showing suggest the local option is no longer a compromise. It's a legitimate alternative with its own set of advantages.

The interesting question is what happens when Apple ships the M5 Ultra or M6 generation with even more unified memory. Dan mentions the possibility of 500GB of RAM in future Mac hardware. At that point, the models you can run locally start overlapping significantly with what people currently pay cloud providers for.

That doesn't mean cloud AI is going away—obviously not. But it does mean the "you have to use cloud APIs" framing that's dominated the conversation for the past two years might need updating. The hardware is here. The models are here. The tooling is getting better fast. What's missing is just awareness that the local option actually works now.

—Zara Chen, Tech & Politics Correspondent

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Two terminal windows labeled /insights and /power-up connected by a lightning bolt, with "DREAM TEAM" text below on a dark…

Two Hidden Claude Code Commands That Actually Matter

Most Claude Code users ignore /power-up and /insights. Here's why these slash commands might be the productivity hack you didn't know you needed.

Zara Chen·2 months ago·6 min read
Man in blue shirt holding MacBook displaying M5 Max logo against colorful background

Apple M5 Max Crushes Local AI—Even Beats M3 Ultra

The M5 Max's prompt processing destroys Apple's desktop M3 Ultra. Real-world tests show this laptop is rewriting local AI performance expectations.

Tyler Nakamura·3 months ago·6 min read
Man in blue shirt holds laptop displaying "prompt" with "480B" text and blue app icon against yellow background

How to Run Massive AI Models on a MacBook Air

LM Studio's new remote access feature lets you run 480B parameter models from a 16GB MacBook Air. Here's how it actually works in practice.

Yuki Okonkwo·2 months ago·6 min read
A retro-styled classroom scene with diverse students watching a friendly robot teacher at a blackboard displaying "How to…

Everything You've Heard About AI Is Probably Wrong

AI capabilities are doubling every 4 months, but most people are working with outdated info. Here's what's actually happening in 2025.

Zara Chen·2 months ago·5 min read
Two glowing chip designs labeled M5 Pro and M5 Max in blue and purple neon against a digital circuit background with the…

Apple's M5 Pro & Max Just Changed Everything About Chips

Apple's M5 Pro and Max use chiplets for the first time, ditching efficiency cores entirely. Here's what that means for performance and why it matters.

Tyler Nakamura·3 months ago·6 min read
Man with serious expression next to Claude Design by Anthropic Labs logo on black background

I Tested Claude Design: Here's What Happened to My UI

Developer OrcDev spent hours testing Anthropic's Claude Design AI tool. The results reveal what AI can—and critically can't—do for interface design.

Zara Chen·2 months ago·5 min read
Four men's headshots labeled with names under yellow "AGI Ultimatum" banner against black background

When AI Safety Becomes a Luxury No One Can Afford

Anthropic just dropped its safety pledges. Amazon's betting $35B on AGI. The AI race has officially entered its 'screw it, we're doing this' phase.

Zara Chen·3 months ago·6 min read
A happy yellow emoji with open arms, "talk's kernels" text logo, and a man wearing glasses speaking outdoors with green…

Hugging Face Just Made GPU Kernels Way Less Painful

Hugging Face's new Kernels ecosystem cuts FlashAttention install time from 2 hours to 2.5 seconds. Here's how they're democratizing GPU optimization.

Zara Chen·3 months ago·6 min read

RAG·vector embedding

2026-04-20
1,395 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.