Google's Gemini 3.1 Pro: Genius on Paper, Disaster in Practice
Gemini 3.1 Pro crushes benchmarks but fails at basic tasks. Developer Theo tests Google's 'smartest model ever' and finds a genius that can't follow instructions.
Written by AI. Zara Chen
February 21, 2026

Photo: Theo - t3․gg / YouTube
Google just dropped Gemini 3.1 Pro, and developer Theo spent days putting it through its paces. The results? A model that's simultaneously the most intelligent AI ever benchmarked and genuinely frustrating to actually use. It's like hiring a polymath who can't remember to close the door on their way out.
The numbers are legitimately wild. Gemini 3.1 Pro scored four points higher on the Artificial Intelligence Index than any previous model, including Anthropic's Opus 4.6 Max. But here's the kicker: it cost less than half as much to achieve that score—$892 versus nearly $2,500 for Opus. On Google's own metrics, it's crushing everything, including a 78% on ARC AGI 2, a benchmark specifically designed to be challenging for language models.
"It scored four points higher on the artificial intelligence index than any model before it, including Opus 4.6 Max," Theo notes. "What's even crazier, though, is it cost less than half as much to get that score."
When Intelligence Doesn't Mean Competence
Here's where things get weird. Theo created a custom benchmark called Skate Bench that tests whether models can identify skateboarding tricks based on descriptions—a combination of niche knowledge and spatial reasoning. GPT-4 scored 75%. The unreleased GPT-5 that Theo tested at OpenAI's office scored 98%. Current OpenAI models have regressed to around 87%.
Gemini 3.1 Pro? A perfect 100%. Consistently.
The model also excelled at tasks previous AI struggled with. It generated usable SVG animations—"an intricate science that is not easy to do," according to Theo. It designed clean, shippable website layouts on first try. It even turned out to be unexpectedly funny when Theo had various AI models play Quiplash against each other. (Apparently Google built a model with better comedic timing than Grok. Nobody saw that coming.)
But then you try to actually use it for work, and the dissonance becomes jarring.
The Tool-Calling Problem
Tool-calling—the ability for AI models to properly use external functions and APIs—is where Gemini 3.1 Pro falls apart. Theo's experience is telling: the model would randomly fail to edit files it had just read, passing bad syntax. It would spam his terminal with nonsense. The official Google CLI kept switching between different models mid-task without being asked.
"These models suck at tool calls. They just don't do them well," Theo explains. "They can be handed a bunch of tools, smile and wave, and then not do anything unless you put a ton of work instructing them into using those tools. And if you're not careful, they'll instead massively overuse them."
The comparison to Claude 4.5 Haiku is instructive. Haiku scores a 37 on the intelligence index—not even in the main chart. It's a "tiny cheap model." But as Theo puts it: "It does its goddamn job." When you tell Haiku how a tool works, it uses it correctly and consistently. Gemini 3.1 Pro will "rotate between using it too much, not using it at all, and using it incorrectly with the occasional correct usage."
Benchmarks vs. Reality
This gap reveals something important about the current AI race. Gemini 3.1 Pro excels on the Omniscience benchmark from Artificial Analysis, which rewards models for saying "I don't know" rather than hallucinating. Its hallucination rate dropped by nearly half compared to Gemini 3 Pro—a massive improvement. It knows more information and isn't afraid to admit ignorance.
But knowledge isn't the same as usefulness. The Meter benchmark tracks how long of a task models can complete autonomously. Opus 4.6 and GPT-5.2 can handle tasks that would take humans 16+ hours with a 50% success rate. Theo suspects Gemini 3.1 Pro can't perform well here at all because "they tend to get really confused and lost when given the ability to like go do things for a while."
The theory? Other AI labs are using interaction histories to train their models through reinforcement learning—they see what worked, what didn't, and adjust. Google seems focused on benchmark performance instead of real-world tool integration. The result is a model that feels, as Theo describes it, "like somebody took something from like the old llama days... and stuffed infinite intelligence into it, but forgot to put the competence in."
The Warning Nobody Expected
There's also a practical safety concern. Theo mentions that he's seen Gemini 3.1 Pro "deleting things it shouldn't be"—including an incident where "it nuked a bunch of assets that it was not supposed to touch." His advice: "Be careful when you're giving it file write access for sure."
For a model this intelligent to have behavioral quirks this severe raises questions about the development process. It's not that Gemini is malicious—it's that it's unpredictable in ways that smarter-on-paper doesn't fix.
Google has created something genuinely impressive: a model that understands spatial relationships better than anything before it, that can generate complex animations, that knows when it doesn't know something. The raw intelligence is undeniable. But between the buggy CLI, the inconsistent tool usage, and the occasional file-deletion spree, using it feels like employing a brilliant consultant who keeps wandering off mid-sentence.
The question isn't whether Gemini 3.1 Pro is the smartest model ever made. By several measures, it probably is. The question is whether that intelligence translates into models we can actually rely on—and right now, that answer is complicated.
—Zara Chen
Watch the Original Video
Gemini 3.1 Pro is the smartest model ever made
Theo - t3․gg
24m 46sAbout This Source
Theo - t3․gg
Theo - t3.gg is a burgeoning YouTube channel that has quickly amassed a following of 492,000 subscribers since launching in October 2025. Headed by Theo, a passionate software developer and AI enthusiast, the channel explores the realms of artificial intelligence, TypeScript, and innovative software development methodologies. Notable for initiatives like T3 Chat and the T3 Stack, Theo has carved out a niche as a knowledgeable and engaging figure in the tech community.
Read full source profileMore Like This
Google's Gemini 3.1 Pro: Testing the Hype vs. Reality
Google's Gemini 3.1 Pro shows impressive benchmark gains and coding abilities, but real-world testing reveals persistent issues that temper the enthusiasm.
That Agent.md File Might Be Making Your AI Worse
New research shows those popular Agent.md and Claude.md files could actually hurt AI coding performance. Here's what developers need to know about context.
Google's Gemini 3.1 Pro: When Benchmark Wins Stop Mattering
Gemini 3.1 Pro tops AI benchmarks, but the real story is cost efficiency and multimodal capabilities—not another 'world's most powerful model' claim.
Chinese AI Models Are Suddenly Catching Up—And Fast
GLM-5 claims to beat major US models on reliability while open-source agents hit near-human scores. The AI race just got a lot more complicated.