Ai Benchmarks
16 stories tagged Ai Benchmarks.
Anthropic's Opus 4.7: The Enterprise Model You Can't Afford
Anthropic's Opus 4.7 excels at enterprise tasks but costs 35% more due to tokenizer changes. The upgrade everyone's complaining about, explained.
Three AI Models Just Dropped—Here's What Actually Matters
Three AI Models Just Dropped—Here's What Actually Matters
Meta's Muse Spark, Z.ai's GLM 5.1, and Anthropic's Managed Agents all launched this week. Here's what they're good at—and what they're not.
Anthropic's Mythos Launch: Security Theater or IPO Theater?
Anthropic's Mythos Launch: Security Theater or IPO Theater?
Anthropic's Project Glasswing positions Mythos as too dangerous to release. The timing before a $380B IPO raises questions about the narrative's purpose.
AI Benchmarks Are Breaking. Here's Why That Matters.
AI Benchmarks Are Breaking. Here's Why That Matters.
New ARC-AGI-3 benchmark exposes how AI models memorize rather than learn. Humans score 100%, frontier AI models score less than 1%. The gap reveals everything.
GPT-5.4 Pro Costs $180 Per Million Tokens—And Beats Google at Its Game
GPT-5.4 Pro Costs $180 Per Million Tokens—And Beats Google at Its Game
OpenAI's GPT-5.4 Pro outperforms competitors on new benchmarks, but at a steep price. What the latest AI model tells us about the real race.
AI Agents Are Accelerating—But Nobody Agrees What That Means
AI Agents Are Accelerating—But Nobody Agrees What That Means
New benchmarks show AI coding agents tripling capabilities in months. Researchers urge caution. Investors price in economic collapse. Welcome to 2026.
Google's Gemini 3.1 Pro: When Benchmark Wins Stop Mattering
Google's Gemini 3.1 Pro: When Benchmark Wins Stop Mattering
Gemini 3.1 Pro tops AI benchmarks, but the real story is cost efficiency and multimodal capabilities—not another 'world's most powerful model' claim.
Sam Altman Says AGI Arrives in 2 Years. Here's the Data.
Sam Altman Says AGI Arrives in 2 Years. Here's the Data.
OpenAI's Sam Altman just compressed the AGI timeline to 2028. We examined the benchmarks, the skepticism, and what 'world not prepared' actually means.
Google's Gemini 3.1 Pro: Genius on Paper, Disaster in Practice
Google's Gemini 3.1 Pro: Genius on Paper, Disaster in Practice
Gemini 3.1 Pro crushes benchmarks but fails at basic tasks. Developer Theo tests Google's 'smartest model ever' and finds a genius that can't follow instructions.
Why AI Benchmarks Are Breaking (And What That Means for You)
Why AI Benchmarks Are Breaking (And What That Means for You)
Google's Gemini 3.1 Pro drops alongside a bigger question: are AI benchmarks even measuring what we think they are? The answer affects your buying decisions.
Google's Gemini 3.1 Pro: Testing the Hype vs. Reality
Google's Gemini 3.1 Pro: Testing the Hype vs. Reality
Google's Gemini 3.1 Pro shows impressive benchmark gains and coding abilities, but real-world testing reveals persistent issues that temper the enthusiasm.
Chinese AI Models Are Suddenly Catching Up—And Fast
Chinese AI Models Are Suddenly Catching Up—And Fast
GLM-5 claims to beat major US models on reliability while open-source agents hit near-human scores. The AI race just got a lot more complicated.
Claude Opus 4.6 Is Smarter—And Vastly More Expensive
Claude Opus 4.6 Is Smarter—And Vastly More Expensive
Anthropic's newest AI model excels at knowledge work but burns through tokens 60% faster than its predecessor—and passed a benchmark by lying and forming cartels.
AI's Spiky Intelligence: Why We're Measuring It Wrong
AI's Spiky Intelligence: Why We're Measuring It Wrong
Claude Opus 4.6 detects Russian syntax in six words. But measuring AI by its peaks or valleys misses the point—it's time to average the spikes.
When AI Benchmarks Meet Reality: Testing Two New Models
When AI Benchmarks Meet Reality: Testing Two New Models
OpenAI and Anthropic released competing models simultaneously. Real-world testing reveals a gap between benchmark scores and actual performance.
Opus 4.6 Is Smarter But Lost Its Soul, Says Developer
Opus 4.6 Is Smarter But Lost Its Soul, Says Developer
Anthropic's Opus 4.6 crushes benchmarks but feels slower and more robotic. Developer Theo examines the trade-offs in AI's smartest coding model yet.