Why Your Claude Code Sessions Cost More Than They Should

Something changed in the past few weeks. Users paying $200 a month for Claude Code started hitting their session limits faster than expected. What once consumed 1% of their allocation now burns through 10%. The complaints spread across X, then Reddit, then developer forums. Anthropic responded with adjustments to peak and off-peak hours. The complaints continued.

Nate Herk, an AI automation consultant, spent the intervening time testing approaches and tracking results. His conclusion: most people don't need a higher plan. They need to understand what's actually happening when they use Claude.

The core issue is simple but not obvious. Every time you send a message to Claude, the system rereads your entire conversation from the beginning. Message one gets processed once. Message thirty gets reprocessed thirty times. "This means as you're having a conversation with Claude, your cost is compounding, not just adding, it's exponentially growing," Herk explains in a recent video breaking down token management strategies.

One developer tracked a conversation that exceeded 100 messages. The analysis revealed that 98.5% of tokens were spent rereading old chat history. Only 1.5% went toward processing new information.

This creates a predictable pattern. Your first message might cost 500 tokens. Your thirtieth message costs 15,000 tokens—not because it's more complex, but because Claude is processing everything that came before it. After thirty exchanges, you might have burned through a quarter million tokens. The conversation hasn't grown that much. The overhead has.

The compounding extends beyond visible messages. Claude reloads your configuration file, your MCP servers, your system prompts, your skills, and your files on every single turn. This invisible overhead drips into your context window whether you're aware of it or not.

Herk ran the /context command in a completely fresh session with no chat history. Before typing a single prompt, he was already down 51,000 tokens—consumed by system prompts, tools, agents, skills, and memory files. One MCP server alone can add 18,000 tokens per message.

The efficiency problem compounds into a quality problem. Research shows AI models exhibit "loss in the middle"—they pay closest attention to the beginning and end of a session while the material between those points gets progressively ignored. Bloated context costs more money and produces worse output simultaneously.

Herk organizes his solutions into three tiers. The first tier requires no technical expertise. Start fresh conversations between unrelated tasks. Don't carry context about topic A into a discussion about topic B. Batch multiple prompts into one message instead of sending three separate requests—three messages cost three times what one combined message costs. Use /context and /cost commands to see where tokens actually go. Most users have no idea.

Before dropping a large file or document into the conversation, ask whether Claude needs to read the whole thing. If the bug lives in one function, paste that function. If you need context from one paragraph, paste that paragraph. The precision you demand from Claude should match the precision you exercise yourself.

Watch Claude work, especially on longer tasks. It sometimes goes down wrong paths or gets stuck in loops, rereading the same files. Stopping those detours early saves thousands of tokens. "Why would you let it go down the wrong path, waste all your tokens, and then just have to scrap it all," Herk asks.

The second tier requires understanding how Claude structures its operations. Your claude.md file gets read at the start of every single message. Keep it under 200 lines. Treat it like an index pointing to where more data lives rather than a repository for all your data. This represents a complete mindset shift—the file tells Claude where everything is, not what everything contains.

Compact your context around 60% capacity rather than waiting until 95% when auto-compact triggers. By that point, quality has already degraded. After three or four compacts, get a session summary, clear everything, feed the summary back, and continue.

Short breaks cost money. Claude uses prompt caching to avoid reprocessing unchanged context, but the cache expires after five minutes. Step away for six minutes and your next message reprocesses everything from scratch at full cost. Some users report mysterious usage spikes that correspond exactly to when they returned from breaks.

Command output enters your context window at full token cost. A command returning 200 commits means all that data gets tokenized and sent to the model. The output appears as one line in your interface, but the token cost is invisible.

The third tier addresses strategic decisions. Choose the right model for each task. Sonnet for most coding work. Haiku for sub-agents, formatting, and simple tasks. Opus for deep architectural planning and only when Sonnet wasn't enough. Try keeping Opus usage under 20% unless a project specifically demands it.

Agent workflows consume roughly seven to ten times more tokens than standard single-agent sessions because each sub-agent wakes up with its own full context. They reload files, system tools, everything. Agent teams produce interesting results but at severe token cost.

Peak hours matter. Anthropic now drains your five-hour session window faster during peak demand—8 a.m. to 2 p.m. Eastern time on weekdays. Off-peak hours in afternoons, evenings, and weekends stretch that budget further. Schedule big refactors and multi-agent sessions accordingly.

If you're near a session reset with room left in your allocation, use it. Get your money's worth. Let agents run. If you're approaching your limit with significant time remaining, step away. Come back with a full budget instead of getting stuck mid-task.

Herk makes one point that runs counter to the optimization narrative. Hitting your limit shouldn't carry negative connotations. "If you're doing a lot of these hacks and you are not just being wasteful with tokens, then hitting your limit is actually a good thing," he argues. It means you're using the tool enough to extract real value.

People who never hit their limits aren't necessarily managing tokens better. They might not be using Claude enough to matter. The goal isn't to preserve tokens—it's to get maximum productive work from the ones you're paying for.

The invisible costs become visible with measurement. The compounding becomes manageable with structure. What looked like insufficient capacity often turns out to be inefficient usage. The difference between those two problems is that one requires more money while the other just requires understanding how the system actually works.

Bob Reynolds is Senior Technology Correspondent for Buzzrag