This MCP Server Cuts Claude's Token Costs by 99%

Here's a problem you probably didn't know had a number [attached to it: if you're using Claude Code with multiple tools, you've got about 30 minutes before the AI starts forgetting what it was doing.

That's not a metaphor. That's math.

The culprit is something called context bloat, and it's expensive in both the wallet-draining sense and the productivity-killing sense. Every time Claude Code calls a tool through the Model Context Protocol (MCP), it dumps the entire output—all of it—into Claude's 200,000-token context window. A single Playwright screenshot of a webpage? 56 kilobytes. Twenty GitHub issues? 59 kilobytes. Do a few of these operations during planning and you've burned through 70% of your available context before writing a single line of code.

Andras from Better Stack demonstrates a solution in a new video: an MCP server called Context Mode that acts as a virtualization layer between Claude and your actual data. Instead of shoving everything into the model's working memory, it indexes outputs in a local SQLite database and feeds Claude only what it needs to know.

The compression is kind of absurd. That 56KB Playwright snapshot? 299 bytes. An analytics CSV? 222 bytes. We're talking 99%+ reduction in some cases.

The Token Tax You Didn't Know You Were Paying

The architecture here is clever in a way that makes you wonder why it wasn't the default. Claude doesn't talk directly to your operating system anymore—it talks to a sandbox. When you run a command or fetch data, Context Mode stores the raw output in a SQLite database using FTS5 (full-text search) indexing. Claude receives a confirmation that the data exists and can be queried, not the data itself.

"Context mode acts as a virtualization layer," Andras explains. "Instead of the AI talking directly to your OS, it talks to a sandbox. And instead of dumping massive outputs, context mode indexes them in a local SQLite database."

This matters because of how context windows actually work. Every message you send to Claude includes the entire conversation history up to that point. If your third message contains 50KB of log data, that 50KB gets resent with message four, five, six, and every subsequent message. It's not just taking up space once—it's taking up space repeatedly.

In Andras's demo, he creates a dummy access log with 5,000 lines, where every hundredth line contains a 500 error. He asks Claude to index it via Context Mode, find the error patterns, and summarize associated IP addresses.

In the background, Context Mode chunks those 5,000 lines into its SQLite database. Claude receives: "file indexed." Not 5,000 lines. Just confirmation. When Claude needs to actually search the logs, it queries the indexed database rather than parsing the raw file.

The stats command reveals the damage avoided: instead of dumping 20KB into the conversation, Context Mode kept about 5KB in the sandbox—a 25% reduction for this small test file. Scale that up to production logs or a large repo, and you're talking about saving 100,000+ tokens per session.

The Memory Problem Nobody's Talking About

But here's where it gets interesting: token savings are almost beside the point.

The real issue is session continuity. Anyone who's worked with AI coding assistants knows the moment when the model just... forgets. You're 40 minutes into a debugging session and Claude starts suggesting fixes you already tried. It rewrites code it wrote ten minutes ago. The context window has compacted, and with it, your progress.

Context Mode addresses this with what Andras calls "save checkpoints." It uses hooks to monitor every file edit, git operation, and sub-agent task. When your conversation history compacts (which it will, eventually), Context Mode builds a priority-tiered snapshot—usually under 2KB—and injects it back into the conversation.

"It also tracks decisions and errors," Andras notes. "For example, if the AI tried a fix that failed 20 minutes ago, it won't repeat that mistake even after the context resets."

This is the kind of feature that sounds minor until you've lost 30 minutes because your AI assistant forgot that approach X didn't work. The claimed extension is significant: from roughly 30 minutes of useful session time to approximately 3 hours.

What This Actually Means for AI Coding

I'm genuinely curious about the second-order effects here. If Context Mode delivers on its promise, we're not just talking about cheaper API calls—we're talking about fundamentally different workflows becoming viable.

Right now, using Claude Code for anything complex means planning around the context limit. You break tasks into smaller chunks. You avoid broad exploratory phases. You definitely don't spin up multiple tools simultaneously. These constraints shape what kinds of problems feel solvable with AI assistance.

Extending useful session time by 6x changes the calculus. Suddenly multi-hour refactoring sessions become feasible. You can actually let the AI explore a codebase without constantly resetting. The bottleneck shifts from "how much can Claude remember" to "how well can Claude reason."

Which brings up an interesting tension in Andras's framing. He positions cost savings as "a nice bonus" compared to "maintaining the intelligence of the model." The argument: "When you clear the noise out of the context window, you're leaving more room for actual reasoning."

That's probably true—a context window stuffed with raw log data is a context window that can't hold complex reasoning chains. But it also reveals something about the current state of AI coding tools: we're still figuring out the basics of how to feed these models information efficiently.

Context Mode is available on GitHub and works with Claude Code, Gemini CLI, and VS Code Copilot. Installation looks straightforward—a couple of command-line operations and you're set. The MCP server, hooks, and routing instructions handle themselves after that.

What's less clear is how well this approach generalizes. The demo uses a simple log file analysis, which is basically ideal for this kind of indexing strategy. What happens with deeply nested data structures, or situations where the AI needs to reason about relationships between different chunks of indexed data? Does querying the SQLite database introduce latency that slows down the coding flow?

These aren't criticisms—they're genuine questions about how this plays out at scale. The mathematics of the token savings are compelling. The session continuity features address a real pain point. Whether this becomes essential infrastructure for AI coding or a clever optimization for specific use cases will depend on how it performs in messier, real-world scenarios.

For now, if you're building anything complex with AI agents and hitting that 30-minute wall, Context Mode represents a pretty concrete improvement over eating through tokens and losing your session history. The question isn't whether it helps—the demo makes that clear. The question is what becomes possible when that particular constraint loosens.

—Yuki Okonkwo, AI & Machine Learning Correspondent