oMLX: A Smarter Local AI Runner for Apple Silicon
oMLX beats LM Studio 47 vs 16 tokens/sec on Apple Silicon—but occasional 400 errors mean it's not plug-and-play. Here's what the tradeoff actually looks like.
Written by AI. Marcus Chen-Ramirez

Photo: AI. Marco Velez
There's a particular kind of frustration that comes with owning a MacBook Pro and watching a language model eat your entire machine alive. Your fan screams. Your second monitor becomes a slideshow. Your expensive laptop, ostensibly built for professional work, has been reduced to a single-purpose space heater that happens to output Python.
This is the problem oMLX is trying to solve—and according to a recent deep-dive by Andrus from Better Stack, it does so in ways that are technically interesting enough to be worth understanding, not just benchmarking.
The Memory Problem, Properly Framed
To appreciate what oMLX is doing, you first need to understand why running large models on consumer hardware is so painful in the first place. On a traditional PC, your CPU and GPU live in separate memory worlds. Every time a model needs to move data between them, it travels across the PCIe bus—a bottleneck that gets uglier the larger your model gets.
Apple Silicon sidesteps this by design. The unified memory architecture means the CPU and GPU are drawing from the same physical pool. Apple's MLX framework exploits this with what it calls zero-copy arrays: when the GPU finishes a computation, the CPU can read those results immediately, without shuffling a single byte. Add lazy computation—where math operations are deferred until the last possible moment to optimize the entire calculation graph—and you have a foundation that's genuinely friendlier to inference workloads than traditional PC architectures.
oMLX builds on top of MLX, but its real contribution is in how it handles something called the KV cache.
What the KV Cache Actually Is (and Why It Matters Here)
In a typical LLM session, the model maintains a running record of your entire conversation—every prompt, every response, every system instruction. This record lives in RAM, which is fast but finite. The longer a session runs, the more RAM it consumes. Eventually, you run out, and things get ugly.
LM Studio's approach is essentially to hold all of this in a "hot state"—everything active, everything in memory, everything available instantly. It's stable. It works. It also means that on a 16GB or 24GB MacBook, you're one long coding session away from grinding your system to a halt.
oMLX introduces what it calls a two-tier KV cache. The immediate, recent context stays in unified memory where it needs to be for fast generation. But older context—big system prompts, lengthy tool definitions, the stuff that accumulates over a long agentic session—gets frozen and offloaded to SSD. When that context becomes relevant again, oMLX reads it back from disk and "hydrates" the model's state.
As Andrus describes it: "OMLX is more like a modern operating system. It's smart enough to know what data needs to be in your brain right now and what can be paged to disk."
It's a reasonable analogy. Virtual memory has been doing this for CPUs since the 1960s. The interesting question is whether the latency penalty of SSD reads is small enough to be worth the RAM savings. The benchmark numbers Andrus ran suggest, at least in this one scenario, the answer is yes.
The Test: A Real Coding Task, Not a Toy Benchmark
Andrus ran a practical agentic task rather than a synthetic speed test: building a web app that searches movies, supports wishlisting, and handles ratings via the MovieDB API—using Qwen 3.6, a 35-billion-parameter model quantized to 4 bits, running on an M2 MacBook Pro. The agent harness was Codex CLI rather than Claude Code, a choice worth noting: Claude Code reportedly consumes roughly 16,200 tokens just for its own system prompts and tool definitions on a blank session, which in a 32k context window leaves about half the available space before you've written a single line of project code.
The task completed in roughly 20 minutes on oMLX. The same task on LM Studio, same model, same constraints, took approximately 35 minutes. Token generation speed was the primary driver: oMLX averaged around 47 tokens per second; LM Studio averaged around 16. The gap is significant enough that it's not noise.
The multitasking difference was also notable. While LM Studio was running, Andrus couldn't watch video on a second monitor without lag—the RAM pressure was that severe. oMLX, by offloading older context to SSD, left enough headroom to browse normally. For anyone who actually needs to work while an agent runs in the background, this is a meaningful quality-of-life difference, not just a benchmark curiosity.
The Honest Part: oMLX Isn't Frictionless
The picture isn't uniformly rosy, and Andrus doesn't pretend otherwise.
oMLX threw 400 errors two or three times during the session when prompts exceeded the 30k context limit. LM Studio didn't throw a single one. Context stability is genuinely better in LM Studio, and for workflows where reliability matters more than speed—automated pipelines, unattended runs, anything where you can't babysit the session—that's a real consideration.
The persistent SSD caching does partially compensate for this. When Andrus cleared the Codex session after hitting a 400 error, oMLX retained the computational state on disk. Re-launching with a continuation prompt caused oMLX to recognize the prefix and reload the model's state from the cache, picking up mid-project without the hallucinations that typically follow a context wipe. "Instead of hallucinating or starting from scratch, it picked up right where it left off," he noted. The final session logged 1.78 million tokens processed, with 1.59 million served from cache—an 89% cache efficiency rate.
That's an impressive number. It also means the SSD is doing a lot of work, which raises questions the video doesn't address: thermal implications over long sessions, SSD write endurance over time, behavior when the SSD itself becomes a bottleneck. These aren't dealbreakers, but they're worth sitting with before treating oMLX as a universal upgrade.
What This Actually Tells Us
The most interesting thing about oMLX isn't the benchmark numbers—it's what those numbers reveal about where the local AI tooling ecosystem currently stands.
LM Studio has been the default answer to "how do I run models locally on a Mac" for a while now. It earned that position through breadth: it supports a huge range of models and hardware configurations, its interface is approachable, and it generally just works. But "generally just works" is a different design goal than "extracts maximum performance from this specific hardware," and oMLX is betting that there's an audience willing to accept a bit more operational overhead in exchange for speed and lower memory pressure.
There's a historical pattern here. Generalist tools dominate early in any platform's lifecycle, then specialists emerge as users' needs sharpen. The question is whether Apple Silicon's installed base—and the growing appetite for running capable local models on it—is large enough to sustain a specialist ecosystem. oMLX appears to be a bet that it is.
"These kinds of projects like OMLX are proving that we don't necessarily need 128 GB of RAM to run powerful agents," Andrus concludes. "We just need a smarter way to manage the memory we already have."
That framing matters. The ambient narrative in AI hardware circles tends toward more—more VRAM, more RAM, bigger machines. oMLX is an argument that the constraint isn't raw capacity, it's architecture. Smarter memory management on modest hardware might be a more democratic path to capable local inference than waiting for everyone to buy a 192GB Mac Studio.
Whether oMLX is mature enough for that argument to hold at scale, outside one developer's M2 MacBook Pro over a single coding session, remains an open question. The results here are promising. They're also a sample size of one.
By Marcus Chen-Ramirez, Senior Technology Correspondent
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
More Like This
When Three MacBooks Beat One: The Distributed AI Experiment
Developer Alex Ziskind clusters three M5 Max MacBook Pros to run AI models too large for any single machine. The results reveal hard limits.
Anthropic's AI-Built C Compiler: Engineering Feat or PR Stunt?
Anthropic let 16 Claude agents build a C compiler over two weeks. It compiled Linux and ran Doom—but the methods raise questions about what 'AI-built' means.
TurboQuant Makes 16GB Macs Actually Useful for AI
New compression tech lets budget Macs run large language models that previously required 128GB. Here's what actually changed and what it means for you.
How to Run Massive AI Models on a MacBook Air
LM Studio's new remote access feature lets you run 480B parameter models from a 16GB MacBook Air. Here's how it actually works in practice.
Apple's M5 Max Just Changed the Local AI Game
New benchmarks show Apple's M5 Max running local AI models 15-50% faster than M4, with MLX format delivering double the performance of standard GGUF.
Cybersecurity 2026: The AI Arms Race
2026 looms as a daunting year for cybersecurity. Explore AI's dual role and the push for safer programming languages.
Nvidia's Jetson Orin Nano Gets Better With Age
The $249 AI development board keeps improving a year after launch. Gary Explains tests whether Nvidia's continued software support makes it worth buying.
Claude Code's Hidden Features That Change Everything
Boris Cherny reveals 15 underused Claude Code features that transform how developers work—from parallel sessions to remote dispatch.
RAG·vector embedding
2026-05-09This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.