Google's Gemini 3.1 Pro: Genius on Paper

Google just dropped Gemini 3.1 Pro, and developer Theo spent days putting it through its paces. The results? A model that's simultaneously the most intelligent AI ever benchmarked and genuinely frustrating to actually use. It's like hiring a polymath who can't remember to close the door on their way out.

The numbers are legitimately wild. Gemini 3.1 Pro scored four points higher on the Artificial Intelligence Index than any previous model, including Anthropic's Opus 4.6 Max. But here's the kicker: it cost less than half as much to achieve that score—$892 versus nearly $2,500 for Opus. On Google's own metrics, it's crushing everything, including a 78% on ARC AGI 2, a benchmark specifically designed to be challenging for language models.

"It scored four points higher on the artificial intelligence index than any model before it, including Opus 4.6 Max," Theo notes. "What's even crazier, though, is it cost less than half as much to get that score."

When Intelligence Doesn't Mean Competence

Here's where things get weird. Theo created a custom benchmark called Skate Bench that tests whether models can identify skateboarding tricks based on descriptions—a combination of niche knowledge and spatial reasoning. GPT-4 scored 75%. The unreleased GPT-5 that Theo tested at OpenAI's office scored 98%. Current OpenAI models have regressed to around 87%.

Gemini 3.1 Pro? A perfect 100%. Consistently.

The model also excelled at tasks previous AI struggled with. It generated usable SVG animations—"an intricate science that is not easy to do," according to Theo. It designed clean, shippable website layouts on first try. It even turned out to be unexpectedly funny when Theo had various AI models play Quiplash against each other. (Apparently Google built a model with better comedic timing than Grok. Nobody saw that coming.)

But then you try to actually use it for work, and the dissonance becomes jarring.

The Tool-Calling Problem

Tool-calling—the ability for AI models to properly use external functions and APIs—is where Gemini 3.1 Pro falls apart. Theo's experience is telling: the model would randomly fail to edit files it had just read, passing bad syntax. It would spam his terminal with nonsense. The official Google CLI kept switching between different models mid-task without being asked.

"These models suck at tool calls. They just don't do them well," Theo explains. "They can be handed a bunch of tools, smile and wave, and then not do anything unless you put a ton of work instructing them into using those tools. And if you're not careful, they'll instead massively overuse them."

The comparison to Claude 4.5 Haiku is instructive. Haiku scores a 37 on the intelligence index—not even in the main chart. It's a "tiny cheap model." But as Theo puts it: "It does its goddamn job." When you tell Haiku how a tool works, it uses it correctly and consistently. Gemini 3.1 Pro will "rotate between using it too much, not using it at all, and using it incorrectly with the occasional correct usage."

Benchmarks vs. Reality

This gap reveals something important about the current AI race. Gemini 3.1 Pro excels on the Omniscience benchmark from Artificial Analysis, which rewards models for saying "I don't know" rather than hallucinating. Its hallucination rate dropped by nearly half compared to Gemini 3 Pro—a massive improvement. It knows more information and isn't afraid to admit ignorance.

But knowledge isn't the same as usefulness. The Meter benchmark tracks how long of a task models can complete autonomously. Opus 4.6 and GPT-5.2 can handle tasks that would take humans 16+ hours with a 50% success rate. Theo suspects Gemini 3.1 Pro can't perform well here at all because "they tend to get really confused and lost when given the ability to like go do things for a while."

The theory? Other AI labs are using interaction histories to train their models through reinforcement learning—they see what worked, what didn't, and adjust. Google seems focused on benchmark performance instead of real-world tool integration. The result is a model that feels, as Theo describes it, "like somebody took something from like the old llama days... and stuffed infinite intelligence into it, but forgot to put the competence in."

The Warning Nobody Expected

There's also a practical safety concern. Theo mentions that he's seen Gemini 3.1 Pro "deleting things it shouldn't be"—including an incident where "it nuked a bunch of assets that it was not supposed to touch." His advice: "Be careful when you're giving it file write access for sure."

For a model this intelligent to have behavioral quirks this severe raises questions about the development process. It's not that Gemini is malicious—it's that it's unpredictable in ways that smarter-on-paper doesn't fix.

Google has created something genuinely impressive: a model that understands spatial relationships better than anything before it, that can generate complex animations, that knows when it doesn't know something. The raw intelligence is undeniable. But between the buggy CLI, the inconsistent tool usage, and the occasional file-deletion spree, using it feels like employing a brilliant consultant who keeps wandering off mid-sentence.

The question isn't whether Gemini 3.1 Pro is the smartest model ever made. By several measures, it probably is. The question is whether that intelligence translates into models we can actually rely on—and right now, that answer is complicated.

—Zara Chen