A 4B Model Beat a 235B Model for Under $500

Here's a scenario worth sitting with: a 235-billion parameter AI model—one of the largest, most capable reasoning engines available today—gets asked a straightforward financial question. What's YouTube's year-over-year ad revenue growth from 2023 to 2024? It queries a table that doesn't exist. Gets nothing back. Queries again. Still nothing. Then, having exhausted its two-attempt patience, it just... makes up an answer.

Meanwhile, a 4-billion parameter model that cost less than $500 to fine-tune calls get_table_name first, inspects the schema, runs a query, hits a column error, corrects itself, and returns the right number.

This is the central demonstration in a recent talk by Kobie Crawford, developer advocate at Snorkel AI, delivered at the AI Engineer conference. It's a clean, well-constructed argument—and it challenges one of the AI industry's most deeply ingrained reflexes: when a model underperforms, make it bigger.

The Sledgehammer Problem

The reflex isn't irrational. Larger models genuinely do reason better across a wide range of tasks. The scaling laws that drove the last several years of AI development are real. But Crawford—working in partnership with the RLLM lab at UC Berkeley—is pushing on something more specific: whether raw reasoning capability is actually what's missing when enterprise AI deployments fail.

His framing, borrowed from the RLLM team, is what they call the "Terence Tao effect." Tao is widely regarded as one of the greatest living mathematicians—someone who can work across virtually any domain of mathematics at the highest level. But that breadth of brilliance isn't what a financial analyst needs. A financial analyst needs to pull data from a database, do some arithmetic, and not hallucinate the inputs. Those are different skills, and piling on more reasoning capacity doesn't automatically fix a gap in procedural discipline.

"The idea that you must always get to a much smarter model to do something or deeper reasoning to get something done well," Crawford said, "is the thing we're challenging here."

The 235B model in the demo—Qwen 3, quantized—isn't a bad model. It's an extraordinarily capable one. But when it sat in front of a tool-use environment without first checking what tools were available, no amount of mathematical sophistication helped it. It assumed tables existed that didn't. It never called get_table_name. The instrument was in the toolkit; the model just didn't reach for it.

What the Small Model Actually Learned

The 4B model Snorkel fine-tuned went through a notably different sequence. First: discover the available tables. Second: inspect the schema. Third: run the query. Fourth, when that query returned an error because the column name was wrong: observe the error, correct, and try again.

That last part—the self-correction—is worth pausing on. The model wasn't programmed with a fallback rule. It learned, through reinforcement learning on a purpose-built financial dataset, that errors are information to act on rather than dead ends that warrant hallucination. The 235B model's failure mode was to treat two empty query results as license to invent. The 4B model's learned behavior was to treat an error as a prompt to investigate.

Crawford's team used GRPO—a reinforcement learning algorithm—applied to their FinQA environment, a self-contained setup developed with the UC Berkeley RLLM group and published openly on Hugging Face, OpenEnv, and PrimeIntellect. The training dataset was built with domain experts—financial analysts and PhD-level contributors—specifically to ensure the questions were correctly answerable and the answers verifiable. The whole training run took roughly 21 hours. Total cost: under $500.

The benchmark improvements were real and measurable. On the core FinQA task, the fine-tuned model roughly doubled its pass@1 rate compared to baseline. On FinQA Reasoning—the harder multi-table version of the benchmark—performance jumped from 13.9% to 26.6%.

The Transfer Learning Surprise

The multi-table result is arguably the most interesting data point here, because of how it was produced. When Snorkel's team ran ablations comparing different training curricula—single-table only, mixed single-and-multi-table, and progressive curriculum learning from simple to complex—the single-table training regime won. Not just on single-table tasks. On the harder multi-table benchmark too.

Doubling performance on questions the model wasn't explicitly trained on, using only simpler examples, suggests something about what was actually being learned. The improvement wasn't domain-specific SQL knowledge. It was procedural: check what exists before you query, inspect before you assume, correct when you're wrong. Those behaviors, instilled through relatively narrow training data, generalized cleanly to more complex problems.

"Tool discipline," Crawford argues, "just knowing how to use the tools that are in the environment, turned out to be a bigger deal than anything else."

That's a meaningful diagnostic. If the failure mode is behavioral rather than cognitive—a missing habit rather than a missing capability—then the solution is also behavioral. You don't need a bigger model. You need a better-trained one.

The Diagnostic Layer: Rubrics Before Data

One of the more practically useful ideas Crawford surfaces is how Snorkel approaches evaluation before it even starts building training data. Rather than measuring only final output correctness—right or wrong—the team builds rubrics that decompose a model's response into multiple assessable dimensions. Did the model use the correct tool? Did it inspect the schema first? Did it handle errors gracefully?

This rubric-based analysis is what allows you to find the specific failure mode worth targeting. In this case, the rubric revealed that the bottleneck wasn't reasoning depth or domain knowledge—it was tool discipline. Without that diagnostic layer, the instinct would likely have been to upgrade to a more powerful model, repeat the eval, see marginal improvement, and conclude the problem was fundamentally hard.

"Instead of simply knowing yes or no at the final," Crawford explains, "you can use the rubric to help you do an analysis of what are the behaviors that you want to actually generate data sets to help you with."

The reinforcement learning loop itself still operates on a single scalar reward—GRPO doesn't consume multi-dimensional rubric outputs directly. But the rubric informs what training data to generate and what behavioral targets to set. It's a diagnostic tool that sits upstream of the training pipeline.

What This Doesn't Settle

Crawford is careful not to overclaim. The talk title is deliberately punchy—"Stop Making Models Bigger"—and he acknowledges upfront that large models aren't inherently wrong. The argument is domain-specific: for agentic tool-use tasks in constrained enterprise environments, behavioral fine-tuning on high-quality narrow data can beat brute-force scaling.

That's a more limited claim than it might appear. Financial analysis with structured database tools is a relatively well-defined problem space. The tools have known schemas. The correct answers are verifiable. The failure modes are diagnosable. Not every enterprise AI problem has those properties.

There's also the question of where this approach hits its ceiling. A 4B model can learn tool discipline; whether it can learn the domain reasoning required for genuinely complex financial analysis—multi-step inference, ambiguous data interpretation, judgment calls—is a different question. Crawford's demo establishes that the bottleneck for this class of tasks was behavioral, not cognitive. It doesn't establish that's universally true.

And Snorkel is, not incidentally, a company whose core business is high-quality data for AI training. Crawford is forthright about this—he's a developer advocate, the talk is sponsored, and the research was conducted by Snorkel's own team. The argument that data quality matters more than model size is also an argument for Snorkel's value proposition. That doesn't make the research wrong, but it's context worth holding.

What the demonstration does establish clearly is that the "just use a bigger model" reflex has real costs—computational, financial, operational, and in regulated industries, potentially legal—and that those costs aren't always buying what teams assume they're buying. A model that can't use its tools correctly isn't more reliable at 235 billion parameters than it was at 4 billion. It's just more expensive at being unreliable.

The more interesting question, which this research opens rather than closes, is how broadly the tool-discipline finding generalizes. If the primary failure mode in agentic AI systems is behavioral rather than cognitive, that reframes the entire problem of enterprise AI deployment—and shifts a significant portion of the work from model selection to training methodology.

Marcus Chen-Ramirez is a senior technology correspondent for Buzzrag covering AI, software development, and the intersection of technology and society.