Your AI Coding Assistant Is Eating Your Tokens

So you're on a subscription plan for your AI coding assistant. You think you're not paying per token. Technically correct! Except here's the thing nobody tells you upfront: those subscriptions are absolutely capped by token usage, and if you don't understand how that works, you're going to hit a wall at the worst possible moment.

A developer at Software Engineer Meets AI just broke down the actual mechanics of token consumption in Claude Code, and honestly? The findings are kind of wild. Because the way these tools present themselves—unlimited! subscription-based! no usage anxiety!—obscures what's actually happening under the hood.

The Subscription Illusion

Here's what's real: Claude's Pro plan gives you roughly 45 Claude messages and 10 to 40 Claude Code prompts every five hours. The Max plan multiplies that by five or twenty, depending on which tier you're on. So yeah, you're not being billed per prompt. But you're definitely being metered. Hit that limit and you're locked out until the window resets.

The video creator explains it plainly: "If you are not being billed per prompt, you are absolutely being capped." Which means optimization isn't just about saving money—it's about actually being able to finish your work.

And here's where it gets interesting: most developers don't realize they're burning tokens on things that provide zero value.

The Hidden Token Drain

LLMs are stateless. That's not common knowledge, apparently, but it matters a lot. Every time you send a new prompt, the entire conversation history gets included unless you specifically tell it otherwise. So if you've had twenty back-and-forth messages with your AI assistant and haven't cleared them, you're resending that full history with every single new prompt.

"You are paying for all those tokens repeatedly," the creator notes. Which is... not great when you're trying to maximize a five-hour window.

The fix is simple but requires discipline: use the clear command when switching tasks or when the conversation gets messy. Claude Code also offers a compact command for situations where you want to keep the context but lighten the load. (Claude will actually do this automatically when threads get too long, though by then you've already burned through tokens.)

But context management is just the start. The bigger issue is how developers actually prompt these tools.

Precision vs. Exploration

Claude Code can theoretically understand large codebases. But letting it explore freely is expensive. Instead of "Here's my whole repo, go find the bug," the advice is to be surgical: "Check the verify user function inside O.js. That's where the issue probably is."

Being specific does three things: cuts token usage, speeds up the response, and gives you a more focused answer. It's not about constraining the tool's capabilities—it's about directing them efficiently.

Which raises a question: are we using AI coding assistants the way we should, or just the way we're used to using search engines? Because "throw everything at it and see what sticks" is a mindset that made sense for Google. For token-limited AI tools, it's a resource drain.

Batching and Model Switching

The video suggests treating each five-hour usage window like a sprint. Before opening Claude Code, list your tasks. Prioritize them. Knock out the most important stuff first in one focused session.

This approach isn't just about tokens—it's about workflow design. But it does highlight something interesting about how these tools shape our working patterns. We're not just optimizing for the AI's limitations; we're restructuring how we think about development sessions.

For developers on Claude's Max plan, there's another optimization lever: strategic model switching. Opus is powerful but token-expensive. "Use Opus for high-level planning, complex logic, and deep debugging," the creator recommends. "Then switch to the set model for buildout, follow-ups and light edits."

It's a tier system—use the expensive tool for expensive problems, the cheaper tool for routine work. Makes sense in theory. In practice, it requires constantly evaluating whether your current task is "Opus-worthy." That's cognitive overhead.

The MCP Problem

Model Context Protocol (MCP) is supposed to solve a real problem: providing standardized ways for AI models to connect with external tools. Before MCP, every integration required custom code. Not scalable. With MCP, tools can plug in easily.

But here's the catch: every MCP you install adds its definitions to your model's context. And those definitions consume tokens. In the video demonstration, adding Playwright MCP consumed 17.6k tokens. Adding Supabase MCP on top of that? 38.5k tokens.

"Imagine what will happen if you load dozens of MCPs," the creator asks. You don't have to imagine—you'll consume your entire context window before writing a single line of code.

So MCP solves one problem (integration complexity) while creating another (context bloat). That's not a criticism exactly—it's just the reality of working with tools that have architectural constraints. Understanding those constraints is the only way to navigate them.

What This Actually Means

The technical advice here is useful. Clear your context. Be precise in prompts. Batch your work. Switch models strategically. Watch your MCP usage.

But zoom out and there's a larger pattern: we're in an awkward transitional phase where AI coding tools are powerful enough to be essential but constrained enough to require constant optimization. You can't just use them naturally; you have to use them correctly.

That'll probably change. Token limits will increase, pricing models will evolve, context windows will expand. But right now, today, developers are in this weird position of needing to understand token economics just to write code efficiently.

The question isn't whether these optimization strategies work—they clearly do. The question is whether we're building sustainable workflows or just finding clever workarounds for tools that aren't quite ready yet.

—Zara Chen, Tech & Politics Correspondent