The AI Arms Race Nobody's Winning: Why Context Windows Cost So Much
Linear attention promised to solve LLMs' billion-dollar scaling problem. Instead, it revealed how little we understand about what makes these models work.
Written by AI. Dev Kapoor
February 10, 2026

Photo: bycloud / YouTube
There's a peculiar math problem haunting every AI lab right now: the better their models get at thinking, the more expensive they become to run. Not in some abstract, long-term way. Right now. Every token consumed, every chain of reasoning extended, every agent orchestrating tools—it all costs money that scales in ways that make CFOs nervous.
The technical reason is straightforward if you've spent time inside transformer architecture: standard attention mechanisms scale quadratically. Every new token in a sequence needs to look at every previous token, which means both memory and compute costs grow exponentially as context windows expand. A 64K context window was considered luxurious in 2024. By early 2025, it's practically unusable for real-world agent workflows.
What makes this fascinating—and frustrating for the researchers trying to solve it—is that three fundamentally different approaches have emerged, each with its own philosophy about how attention should work. None of them is clearly winning.
Three Ways to Rethink Attention
Sparse attention keeps the basic token-comparison model but restricts which tokens can interact. Think of it as selective amnesia by design: a token might only look at nearby tokens or a small set of designated "global" tokens. This brings quadratic scaling down to something linear—from O(n²) to O(nd), where d is a fixed window size.
DeepSeek v3.2 uses this approach, attending only to a fixed number of top relevant tokens. The pricing appears to scale linearly as context grows. But there's a trap: "Once a token is not considered relevant, the LLM completely forgets about it," as bycloud notes in his analysis. What gets pruned stays pruned.
Linear attention takes a completely different philosophical stance. Instead of comparing tokens pairwise, it accumulates them into a structured shared memory. New tokens read from this memory rather than recomparing themselves with every previous token. It's genuinely linear—mathematically guaranteed, not just in practice.
Compressed attention sits between the two. DeepSeek's Multi-head Latent Attention (MLA), which powers their R1 model and Kimi k2.5, doesn't merge tokens into running memory. Instead, it compresses each token into a smaller representation before comparison. "You still rank and compare documents," the analysis explains, "but the comparison itself is cheaper." It still scales quadratically—just less painfully.
What's interesting about MLA is less the technique itself and more what it represents: DeepSeek pioneered an entirely new attention method that actually works at scale without requiring hybrid approaches. That's rare enough to command respect in research circles.
The Linear Attention Saga
If you've been following LLM architecture evolution, you've probably noticed that pure implementations of these efficient attention methods are essentially nonexistent in production models. There's a reason. "Any attention mechanism that is not published by DeepSeek or is using standard attention like GQA is ass," bycloud states flatly, before clarifying: sparse attention models, delta nets, even Mamba-based architectures—they all interleave standard attention because their performance standalone is catastrophically bad.
"So no one actually uses these efficient attention methods by themselves because their performance is so bad that you think the model has Alzheimer's."
This is where the linear attention saga gets interesting. In January 2025, Chinese AI lab MiniMax published their Text-01 model using a custom linear attention called Lightning attention. Pure Lightning attention performed terribly. But when interleaved with standard attention at a 1:7 ratio, it outperformed pure standard attention, especially on long-context benchmarks like Needle in a Haystack, reaching nearly 100% accuracy.
MiniMax M1 followed—a reasoning model built on this foundation. The theory seemed sound: if a model excels at long context, it should excel at extended reasoning chains. Except it didn't. Despite scaling cheaply to 128K context windows, "in practical internal usage at MiniMax, the quality gap between linear attention models and standard attention models was still too large," according to company researchers.
One MiniMax researcher's candor is worth quoting in full: "Benchmark maxing is just a matter of time, but having benchmarks that truly reflect a model's capabilities is the more difficult task. M1 is capable on these benchmarks, but if no one uses it practically, not even internally, what's the point of pushing that model?"
So MiniMax M2 abandoned linear attention entirely, reverting to standard attention. The reasoning: too immature, insufficient ecosystem support, benchmarks that don't capture practical usefulness, bugs that don't exist in standard attention.
But others kept pushing. Qwen 3 Next tried gated delta nets, which allow memory to decay selectively rather than accumulate indefinitely. Moonshot AI's Kimi Linear introduced feature-wise forgetting through their KDA (Kimi Delta Attention) mechanism—different memory channels could persist or fade independently. Paired with MLA in a 3:1 ratio (the first time anyone tried mixing linear attention with compressed attention rather than standard attention), Kimi Linear achieved the highest accuracy at 1 million context length across all models on OpenAI's MRCR benchmark before Gemini 3 Flash launched.
On knowledge-based benchmarks, though, Kimi Linear's performance was nearly half of Qwen 3 Next's. No free lunch.
The Google Question
Then there's Gemini 3 Flash. One million context window. Dramatically cheaper than competitors—one-fifth the cost of Claude 4.5 Sonnet while outperforming it. Flat pricing regardless of context length, which strongly suggests some form of efficient attention at that price point.
"After all the yapping I did in this video," bycloud observes, "it seems like Google already has the answer and has pretty much solved efficient attention at scale."
What Google discovered—assuming they're using novel attention mechanisms rather than just throwing more infrastructure at the problem—remains unclear. They haven't published the architecture details that would let others replicate or even understand their approach. Neither has Anthropic, whose Claude 4.6 Opus outperformed Gemini 3 Pro and Flash on million-token context windows by three times on the hardest retrieval benchmarks.
Both companies might have cracked architectural breakthroughs that make efficient attention actually work. Or they might have found entirely different solutions. The opacity makes it hard to know whether this is a problem with a solution or just a problem some companies have more resources to brute-force than others.
What's clear is that the open source community—where most of the attention mechanism innovation has been documented and shared—hasn't solved it yet. The approaches that work in theory fail in practice. The hybrid methods that work in practice don't scale to the context windows that agent workflows and extended reasoning chains demand. And the models that do scale to million-token windows aren't explaining how.
There's a deeper question here about what we actually understand about these architectures. If linear attention fails in ways that benchmarks don't predict, what else are benchmarks missing? If compressed attention works for DeepSeek but nobody else can replicate their results, is the innovation in the attention mechanism or in a dozen other implementation details nobody's publishing?
The billion-dollar problem isn't just making attention efficient. It's understanding why efficiency breaks in ways we don't expect, and why solutions that should work mathematically don't work practically. Until someone solves that—or at least explains it—we're left watching proprietary models achieve what open research says shouldn't be possible yet.
Dev Kapoor covers open source software, developer communities, and the politics of code for Buzzrag.
Watch the Original Video
LLM’s Billion Dollar Problem
bycloud
17m 48sAbout This Source
bycloud
bycloud is a rapidly growing YouTube channel with 212,000 subscribers, focused on breaking down advanced AI research and providing analysis of top AI labs. Launched in mid-2025, bycloud offers intuitive explanations often infused with humor, making complex AI topics accessible to both enthusiasts and professionals.
Read full source profileMore Like This
AI Agents Are Accelerating—But Nobody Agrees What That Means
New benchmarks show AI coding agents tripling capabilities in months. Researchers urge caution. Investors price in economic collapse. Welcome to 2026.
Claude Code's New Batch Migration Tools Change the Game
Claude Code adds parallel agent tools for code quality and large-scale migrations. Plus HTTP hooks, markdown previews, and a clipboard command that actually works.
Google's Image AI Bets on Speed Over Perfection
Google's Nano Banana 2 signals a shift in AI image generation: good enough, fast enough, and cheap enough now matters more than perfect.
Gemini 3.0 Flash: Redefining Front-End Design
Discover how Gemini 3.0 Flash is transforming front-end design with speed and affordability.