When Being Less Articulate Makes AI Models More

A GitHub repository called Caveman gained 5,000 stars in 72 hours by doing something [that sounds completely absurd: forcing Claude Code to communicate like a Neanderthal. Strip out the pleasantries, lose the elaboration, just give the technical answer and shut up. The pitch was simple—save tokens, keep the same technical accuracy.

But buried in that repo was a link to a research paper that suggests something far more interesting than token savings. The paper, "Brevity Constraints Reverse Performance Hierarchies in Language Models," published in March 2025, documents a phenomenon that challenges a core assumption about AI: that bigger models are categorically better.

The study evaluated 31 open-weight models across 1,500 problems and found that on nearly 8% of those problems, larger language models underperformed smaller ones by 28 percentage points—despite having up to 100 times more parameters. In some cases, a 2 billion parameter model outperformed a 400 billion parameter model. Not occasionally. Repeatedly.

The researchers' hypothesis: large models talk themselves into wrong answers.

The Marketing vs. The Math

First, let's be clear about what Caveman actually delivers, because the repo's numbers are... optimistic. It claims to cut 75% of output tokens and 45% of input tokens. That's technically true for specific components, but deeply misleading about total token usage.

In a typical 100,000 token Claude Code session, you're looking at roughly 75,000 input tokens and 25,000 output tokens. Output breaks down into tool calls, code blocks, and prose responses—the actual text you read in the terminal. Caveman only touches that prose portion, which might represent 6,000 tokens. Compress that by 75% and you've saved 4,000 tokens. That's 4% of your total session.

Same story on the input side. The companion tool compresses memory files like .claud.md into "caveman speak," but those files are a fraction of total input tokens. Add it all up and you're looking at maybe 5% total token savings per session.

Not nothing—if you're on a usage plan, 5% matters over time. But this isn't going to let you suddenly run four times as many sessions.

When Overthinking Becomes a Bug

The research behind Caveman is where things get genuinely interesting. The study's authors identified what they call "spontaneous scale-dependent verbosity"—essentially, larger models develop a tendency to over-elaborate that actively degrades their performance.

As the researchers put it: "Large models generate excessively verbose responses that obscure correct reasoning, a phenomenon we termed overthinking."

By forcing large models to produce brief responses, accuracy improved by 26 percentage points and reduced performance gaps by up to two-thirds. In many cases, brevity constraints completely reversed the hierarchy—models that were losing to smaller counterparts started winning, simply by being told to be concise. Nothing changed under the hood. Same reasoning process, same capabilities. Just less talking.

The mechanism appears to be error accumulation through elaboration. Instead of stating the answer and moving on, the model generates additional context, explores tangents, and somewhere in that process introduces mistakes or obscures the correct reasoning it already had.

The RLHF Problem

The researchers point to reinforcement learning from human feedback (RLHF) as a likely culprit. During training, humans grade model outputs, choosing which responses they prefer. And humans, apparently, prefer thorough answers. Detailed explanations. Models that show their work.

So models learn to be verbose—not because verbosity improves accuracy, but because it improves human satisfaction scores during training. The researchers note: "The learned tendency towards thoroughness becomes counterproductive, introducing error accumulation."

This creates a perverse incentive structure. We train models to be chatty because we like chatty responses, then discover that being chatty makes them worse at actually being correct.

The study focused on open-weight models, not frontier models like Claude Opus or GPT-4. Whether Anthropic's and OpenAI's latest releases exhibit the same behavior to the same degree remains an open question. But the pattern documented in the research tends to show up across model families, even if the magnitude varies.

What This Means for Developers

Caveman is implemented as a simple skill for Claude Code. You invoke it with /caveman or just tell Claude to "talk like a caveman" or "use fewer words." There are even levels—"ultra caveman" versus "light caveman," depending on how concise you want responses.

The tool doesn't touch error messages (those are quoted exactly) or anything involving actual code generation. It's purely about prose output—the explanatory text Claude provides between code blocks.

Chase, the video creator, frames this as worth trying because there's no real downside. Even if the performance improvements documented in the research don't fully translate to frontier models, you're still saving tokens. And if brevity does help Claude reason more clearly on straightforward problems, that's a bonus.

The broader implication is that developers might want to add something like "be concise, no filler, straight to the point" to their Claude configuration files. Not because of meme appeal, but because there's actual evidence that verbosity constraints improve model performance.

The Uncomfortable Question

What makes this research uncomfortable is what it suggests about how we've been evaluating and training these models. We've been optimizing for human preference—which includes preference for verbose, thorough responses—without fully accounting for whether that verbosity helps or hurts task performance.

The study's title, "Brevity Constraints Reverse Performance Hierarchies," captures the stakes. We've built larger models expecting them to be categorically better, only to discover that their scale-dependent behaviors sometimes work against them. And a simple constraint—"be brief"—can flip that dynamic entirely.

That's not a bug in a specific model. That's a question about how we're building these systems at all.

—Dev Kapoor covers open source, AI tooling, and the human dynamics behind code for Buzzrag