Local AI vs. Cloud: Why the Holy War Misses the Point
Running AI locally isn't a purity test—it's a systems design problem. Here's what one builder's hardware journey reveals about the real tradeoffs.
Written by AI. Mike Sullivan

Photo: AI. Lila Bencher
Every generation of computing produces its own holy war. Mac vs. PC. Client-server vs. mainframe. Open source vs. proprietary. Linux desktop enthusiasts versus everyone who just wanted to print a document. Now we have cloud AI versus local AI, and it has all the same energy: the same certainty, the same tribal signaling, the same guys who will absolutely corner you at a meetup about it.
Manolo Remiddi, who runs the Augmented Mind channel, has a 25-minute video that's worth your time specifically because he's trying to defuse this particular land mine. His argument, stripped of the hardware unboxing theater, is pretty simple: the local-vs-cloud binary is a trap, and people who fall into it — on either side — are optimizing for the wrong thing.
"I don't want to put you in this trap of thinking the binary idea," Remiddi says. "100% on the cloud for dependency, and 100% locally — because in this moment in history, the strategy is different."
That is a reasonable thing to say. It is also, unfortunately, the kind of thing that requires watching someone spend $3,200 on a 128GB Nvidia-based machine to really land.
The Hardware Tour, With Caveats
Remiddi's setup has accumulated the way these things do — not through a master plan but through a series of problems that each demanded a new box. An M1 MacBook Air (16GB) for daily work. An M4 Mac Mini (32GB) synced to the laptop. A second M4 Mac Mini (16GB, base model) dedicated entirely to running agentic tools. And then the centerpiece: a 128GB machine he describes as sharing hardware architecture with Nvidia's DGX Spark.
That last claim is worth flagging. Remiddi says his machine "has the same hardware of the DGX Spark." The DGX Spark is a real Nvidia product — a compact workstation aimed at edge AI inference — but whether Remiddi's specific machine is genuinely built on that platform or is a looser analogy isn't something I can verify from the video alone. Treat it as directionally accurate rather than a spec sheet.
The $3,200 price point he mentions for 128GB of GPU memory is also eyebrow-raising. That's a specific figure for hardware with serious compute attached, and it's unusually low by current market standards. He notes he bought it at a discount "crazy enough" and that the price has since risen by nearly $1,000 — which suggests he got lucky with timing, not that this is a reproducible budget.
The more immediately useful stuff is what he does with the machines, and why each one exists.
The Sandbox Instinct
The most operationally sound thing Remiddi describes — and the thing most people running local agentic tools skip entirely — is isolation. When OpenCode (an open-source agentic coding tool) launched earlier this year, his first instinct wasn't to install it and see what happened. He spun up a virtual machine inside his Mac Mini, sandboxed the agent there, and kept it away from the machine holding his actual client data.
"The idea of running open [Code] on your personal machine would be madness," he says. "We know there is a lot of virus and prompt injection going on."
This is, to be generous, not the median approach in the local-AI hobbyist community. The median approach is to install the shiny new agent, give it filesystem access, and then spend three days figuring out why your Downloads folder looks like that. Prompt injection — where malicious content in an AI's input hijacks its behavior — is a real and underappreciated attack surface, and running agentic tools with broad permissions on your primary machine is how you find out about it the hard way. Remiddi's instinct to isolate first, experiment second, is just good security hygiene dressed up in AI vocabulary. The fact that it reads as unusual advice for this community is the problem, not the advice.
Token Speed Is the Variable Nobody Talks About
The most technically substantive section of the video is about token generation speed, and it's where Remiddi earns his credibility with the nerds.
His 128GB Nvidia machine has a RAM bandwidth limitation that creates a real usability ceiling. Run a dense model — say, a straight 27B-parameter architecture where all parameters are active at inference — and you're looking at roughly 10 tokens per second. That's technically functional. It's also the AI equivalent of watching a progress bar move on a 2003 file transfer. You remember watching progress bars on a 2003 file transfer.
His workaround is a Mixture-of-Experts model. He references what he calls "Qwen 3.6" with 35 billion total parameters but only around 3 billion active at any given inference step — yielding roughly 70 tokens per second on his hardware. This trades some ceiling intelligence for dramatically better throughput, and by his account the subjective experience of using it is genuinely responsive.
A note on the model naming here: Alibaba's Qwen 3 family does include MoE variants — the Qwen3-30B-A3B (30 billion parameters, 3 billion active) is a documented release. Remiddi's "Qwen 3.6 / 35B" naming doesn't precisely match the official lineup as I understand it, so treat these figures as his observed benchmarks rather than manufacturer specs. The broader point — that MoE architecture dramatically changes the usability equation on memory-bandwidth-constrained hardware — is solid regardless.
He also tests Google's Gemma 4, noting the 27B variant with 4 billion active parameters runs at around 50 tokens per second on his machine. That's single-source benchmark data from one hardware configuration, so your mileage will vary, but 50 t/s is genuinely usable in a way that 10 t/s isn't.
The Mac Studio comparison is where he makes his most interesting hardware argument: more RAM isn't automatically better if the model runs too slowly to use. A 512GB unified memory Mac Studio could theoretically load a quantized 250B-parameter model. It would also, by his account, run it slowly enough that you'd stop using it. The sweet spot, in his view, is 128GB with fast enough throughput — and for now, that means Nvidia architecture over Apple Silicon, even though Apple's unified memory is genuinely elegant.
The RTX 5090 (32GB GDDR7 per card, correctly specced) is his preferred consumer GPU recommendation for people with budget — though his extrapolation that four of them gets you to 128GB of VRAM equivalent to the 128GB unified memory setup isn't architecturally precise. Discrete VRAM and unified memory don't behave identically, even at matching capacity numbers. Flag that comparison as illustrative rather than literal.
"Sovereignty" and the Zip Disk Problem
Remiddi uses the word "sovereignty" a lot. Building your own stack, running your own models, controlling your own data pipeline — this is "a journey towards sovereignty," which is a phrase that would have sounded completely at home in a 1999 Linux advocacy pamphlet. Or, for that matter, from someone evangelizing Zip disks as the path to true data ownership. (The Zip disk owners were right about the ownership part. They were less right about the Zip disk part.)
The underlying concern is legitimate. Cloud AI providers do train on user data, do profile usage patterns, do operate with an opacity that's genuinely uncomfortable if you think about it carefully. The business model Remiddi describes — lock users in on cheap pricing, then raise rates — is exactly the playbook from every SaaS cycle since Salesforce figured it out in 2001. Convenience wins until it doesn't, and by the time it doesn't, you've built your workflows around the convenient thing.
His practical prescription for navigating this is, frankly, sensible: use frontier cloud models for high-level architectural work, where their reasoning ceiling matters most. Run local models for the bulk of execution work. Structure your code in modular blocks small enough that a local model's context window can audit each piece independently. Escalate to the cloud model for bug review, stress testing, and the things where capability differential actually changes the outcome.
"The bigger one just works every now and then, but when it does, it does really high-level work," he says. That's just engineering triage applied to AI tooling. It's not a philosophy. It's a workflow.
Where his local AI regulation gap gets more complicated is when the "sovereignty" framing slides from practical risk management into something closer to an ideology. Running a local model because you want privacy and continuity is a clear-eyed decision. Running local models because you've decided cloud AI is inherently illegitimate — and then still using cloud models for the hard parts, as Remiddi himself does — is an ideology with a workaround built in. He's self-aware about this tension, which is to his credit. Most people in this space aren't.
What It Actually Costs
The honest budget breakdown from the video: a $300-400 small PC running Whisper for transcription is the floor. An M4 Mac Mini is the next rung. A 128GB Nvidia-based machine in the $3,200-4,200 range (prices are moving) is where you start getting genuinely productive local inference. A single RTX 5090 plus a capable workstation puts you in the $8,000-10,000 range.
At none of these price points is "total AI sovereignty" the accurate description of what you're buying. What you're actually buying is reduced dependency, better privacy posture, and the ability to keep working when the API is down or the pricing model changes. That's worth something. It's worth something proportional to your actual threat model and your tolerance for maintaining hardware. It is not, with respect, a declaration of independence.
The people who will get the most value from Remiddi's framework are the ones who can hear "systems design problem" and actually think about their system — not the ones who want to opt out of the cloud on principle and then discover what it costs to keep a local inference stack running when a new model drops and everything needs to be updated again.
Owning your intelligence layer is a reasonable goal. Just know that ownership, as anyone who's ever maintained their own mail server can tell you, comes with a support contract. You're it.
Mike Sullivan covers the technology industry for BuzzRAG.
AI Moves Fast. We Keep You Current.
Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.
More Like This
ASI:One Brings AI Agents to the Command Line—No UI Required
ASI:One's new CLI tool lets developers run agentic AI from the terminal. No dashboard, no playground—just HTTP calls and Python. Does it hold up?
Ternary Models Promise Full AI Power at Fraction of Size
PrismML's new ternary models claim to deliver FP16-level AI accuracy at 7-8x smaller size. We examine what's real and what's still theoretical.
Your Next Job Is Being AI's Personal Shopper
The future of knowledge work isn't about building anymore—it's about showing AI what good looks like. Welcome to the taste economy.
AI Progress Is Accelerating Faster Than Anyone Expected
New data shows AI capabilities doubling every four months, not seven. Industry leaders say coding is 'solved.' What does this mean for the rest of us?
Mac Studio vs. Abacus AI: The $10K vs. $10 Showdown
Exploring the battle between $10,000 Mac Studio and $10 Abacus AI Agent in coding efficiency and capability.
Anthropic's Claude Code Integration: A Legal Minefield
Developer Theo navigates murky legal waters integrating Claude Code with T3 Code while Anthropic stays silent on crucial questions.
17 Linux Facts That Stumped a 12-Year Veteran
From IBM's 2000 Linux smartwatch to NSA backdoor requests, these Linux history facts surprised even experienced engineers. How many do you know?
RAG·vector embedding
2026-06-25This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.