Edited by humans. Written by AI. How our editing works
BUZZRAGNews. Trends. Ideas — distilled in minutes.
All articles

SubQ Claims 12M Token Context at Near-Zero Cost

SubQ says its sparse attention architecture processes 12M tokens at 1,000x less compute than standard transformers. Here's what checks out—and what doesn't yet.

Yuki Okonkwo

Written by AI. Yuki Okonkwo

June 19, 20267 min read
Share:
A futuristic AI robot head in gold and white displayed on stage next to glowing "SubQ" text promoting "1000x Less Compute"…

Photo: AI. Asha Kingsley

The transformer architecture has had one genuinely annoying property since the original 2017 paper: when you double the length of your input, you don't double the compute cost, you roughly quadruple it. Every word attends to every other word, and that relationship matrix explodes. It's called quadratic scaling, and it's the reason that processing a million-token document feels like trying to run Crysis on a toaster.

The industry has spent years working around it—chunking documents into pieces, building retrieval pipelines that pull only the relevant bits, stacking agents on top of each other. SubQ, a startup that published its SubQ 1.1 Small technical report this week, is arguing that all of those are patches over a broken pipe. Their claim: they've built the first fully sub-quadratic LLM, with a 12-million-token context window, and it runs at a fraction of what current models cost.

That claim is worth taking seriously. It's also not fully verified yet. Both of those things are true.

The actual architecture

The key innovation is what SubQ calls Sub-Quadratic Sparse Attention, or SSA. Here's how it differs from what came before.

Earlier approaches to sparse attention—Longformer, BigBird—also skipped word-to-word relationships to cut compute. But they did it by position: only look at nearby words, ignore the rest. That's fast, but it misses semantic connections across long distances.

SSA does it by content instead. For each word, the model learns to identify a small group of other words that actually matter, regardless of where they sit in the document, and then runs full attention math only on that group. The result: SubQ claims it looks at roughly 0.13% of all possible relationships at 12 million tokens—and still finds what it needs.

This is also distinct from state-space models like Mamba, which compress context into a fixed-size memory slot. SSA doesn't compress; it selects. The company says this preserves exact attention quality on the relationships it does compute, which is the mechanism behind their "no quality loss from approximation" claim.

Whether that claim survives independent testing is a different question. We'll get there.

The compute chart, taken at face value

SubQ's technical report includes a comparison that, if accurate, is legitimately striking. Dense attention at 1 million tokens costs around 252 petaFLOPS per attention layer. SSA at the same length: under 4 petaFLOPS. The gap compounds as context grows—8x at 128k tokens, 31x at 512k, 64.5x at 1 million.

The company puts a cost figure on this too. Outside reports describe a long context evaluation costing roughly $8 on SubQ versus roughly $2,600 on Claude Opus at the same length. Anthropic's listed price for Claude Opus 4 is $15 per million input tokens. At that rate, a $2,600 bill implies approximately 173 million input tokens—which is far more than a single context window's worth of text, and doesn't map cleanly onto SubQ's stated context sizes. That comparison is either measuring something unusual or running a huge batch evaluation, and the public materials don't explain which. The drama of the number shouldn't substitute for knowing what was actually being measured.

What SubQ 1.1 Small actually scored

SubQ did not train this model from scratch. They took an existing open-weight frontier model, stripped out its dense attention mechanism, replaced it with SSA, then extended context in stages—262k, 512k, 1 million, 2 million tokens—with roughly 1 trillion tokens of additional training on long-form material: books, full documents, code repositories. The benchmark results that follow need to be read with that lineage in mind.

On the RULER benchmark, which tests multi-step retrieval (variable tracing, counting, cross-document synthesis), SubQ 1.1 Small scores 99.12% at 128k tokens. That's an extraordinarily high number. For context, top models on the public RULER leaderboard typically score in the low-to-mid 90s at comparable lengths—but I can't tell you whether SubQ's specific result appears on that leaderboard, because the company hasn't published its entry there, and I have not found a third-party reproduction of this specific figure. This is an unverified self-report, and the 99.12% should be treated as such until someone else runs the eval.

On the needle-in-a-haystack test—hide one fact inside millions of tokens, ask the model to retrieve it exactly—SubQ scores 100% at 1 million and 2 million tokens, and 98% at both 6 million and 12 million tokens. The model was primarily trained to 1 million tokens, so the 12-million performance, if it holds under independent testing, is genuinely interesting.

Then there's the general capability table, where SubQ puts itself against frontier models. On GPQA Diamond (graduate-level science reasoning), SubQ scores 85.4. The report lists GPT-4.5 at 93.2 and Claude Opus 4.8 at 92 for comparison. These comparison figures come from SubQ's own report—I have not confirmed whether they match current entries on official model cards or third-party leaderboards, so treat them as the company's stated context rather than independent reference points. What they suggest, if roughly accurate: SubQ sits solidly below the top-tier frontier models on raw reasoning, but meaningfully above the smaller models it's more directly competing with.

On LiveCodeBench (competitive programming), SubQ reports 89.7, with GPT-4.5 at 92 and Claude Opus at 92.2. The public LiveCodeBench leaderboard is checkable in principle, but benchmark scores vary significantly by model version and evaluation date, and the report doesn't specify which versions or dates it's drawing from. I did not independently verify whether these comparison figures appear on the leaderboard or which snapshots they correspond to. The numbers look plausible directionally—SubQ close to but below the frontier on coding—but the specifics are unconfirmed.

The one benchmark where everything scores low is FinanceBench automation, where SubQ hits 13% against Sonnet's 8%. The report acknowledges this directly, which is the right call. Long-context retrieval is hard; agentic financial reasoning is harder.

The Appen question

SubQ says a third party called Appen verified the benchmarks in the 1.1 Small report. That matters—it's more than nothing. What it doesn't tell us is what Appen specifically did: whether they re-ran the evaluations independently, audited SubQ's methodology, or reviewed SubQ's outputs against expected results. I went looking for a public description of Appen's verification scope and found nothing that clarifies it. That's the actual reporting finding: the scope of that verification isn't described in publicly available materials. "Third-party verified" can mean a lot of things, and the gap between those things matters quite a bit when the model weights aren't public and independent labs can't run their own evals.

The honest shape of this

SubQ's architectural bet makes intuitive sense: if almost all word-to-word relationships in a trained model are effectively zero, computing them anyway is wasteful, and a model that learns to skip them based on meaning rather than position should outperform positional shortcuts on long-range tasks. That logic is sound, and sparse attention research going back years supports the general direction.

The specific claims—1,000x compute reduction, 99.12% RULER, the $8 vs. $2,600 cost comparison—are a different matter. They come primarily from the company's own testing, with a third-party verification whose scope is unclear, model weights that aren't public, and comparison scores that haven't been confirmed against independent leaderboard entries. Also worth knowing: sparse attention's efficiency gains are most pronounced on very long inputs. If you're mostly running short prompts, this isn't the architecture that saves you money—and SubQ's published benchmarks don't cover that use case extensively.

SubQ says design partners are getting access in the next few weeks, with broader rollout through the quarter and general availability by end of year. The company has also signaled a 50-million token context window goal for 2026.

As SubQ's announcement put it: "SubQ is not just another model. It represents a major algorithmic breakthrough." That framing might turn out to be exactly right. Or the numbers might compress under real-world conditions the way they often do. The architecture has theoretical backing, the retrieval scores are interesting, and the cost structure—if it holds—would genuinely change what's economically feasible with long-context AI.

We'll know a lot more once independent labs get the weights. That's the thing about claims that can't be reproduced yet: they're neither confirmed nor refuted. They're just waiting.


Yuki Okonkwo is the AI & Machine Learning Correspondent at Buzzrag.

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Man in sunglasses reacts with amazement to "1000 Tokens Per Second" text, with Google logo and geometric symbol displayed…

DiffusionGemma Generates Text Like an Image Model

Google DeepMind's DiffusionGemma borrows from image diffusion to generate 700–1,000+ tokens/sec. Here's how the architecture works—and where it falls short.

Yuki Okonkwo·4 days ago·7 min read
NVIDIA Mini Datacenter presentation with person standing before compact server unit highlighted with green accents and key…

Span's XFRA Node Wants to Put a Data Center in Your Yard

Span and Nvidia want to bolt $250K of AI computing hardware to the outside of homes. The pitch is clever. The fine print is worth reading carefully.

Yuki Okonkwo·5 days ago·8 min read
Man in gray shirt speaking about state-of-the-art AI models with Pruna AI and AI Engineer Europe logos visible on screens…

AI Leaderboards Are Lying to You About State-of-the-Art

Bertrand Charpentier of Pruna AI makes the case that 'state-of-the-art' is a broken concept—and that efficiency belongs in the same sentence as quality.

Yuki Okonkwo·2 weeks ago·7 min read
Man in black shirt gesturing while speaking against dark background with red and white text box about AGI plans

Elon Musk's Grok 5 Plan: AGI Claims Meet Reality Check

Elon Musk says Grok 5 will achieve AGI with 10 trillion parameters. Here's what that actually means—and what it doesn't.

Yuki Okonkwo·2 months ago·7 min read
Man wearing glasses and black shirt against blackboard with equations, with "think series" logo and "CAG vs Long Context"…

CAG vs Long Context: How LLMs Access External Data

Long context and Cache Augmented Generation solve the same problem differently. Here's what that means for AI costs, speed, and when to use which approach.

Marcus Chen-Ramirez·4 weeks ago·7 min read
White whale logo and "V4" text on black background with blue-to-purple gradient border

DeepSeek V4 Uses 90% Less Memory Than Its Predecessor

DeepSeek's new V4 models achieve dramatic efficiency gains through hybrid attention mechanisms, running million-token contexts at a fraction of the cost.

Marcus Chen-Ramirez·2 months ago·6 min read
Gemini and Stitch 2.0 logos with bold text "STITCH 2.0 +CC IS INSANE!" overlaid on a design tool interface showing the…

Google Stitch 2.0 Wants to Bridge the Design-to-Code Gap

Google's Stitch 2.0 moves beyond mockup generation with project-wide reasoning, design.md files, and developer tool integration. Does it actually work?

Yuki Okonkwo·3 months ago·7 min read
A bearded man in a gray shirt stands against a purple-tinted background next to text reading "THE DRY RUN WORKFLOW FOR…

The Dry Run Workflow: Teaching AI Agents New Skills

A developer demonstrates how to convert one-off terminal tasks into reusable AI agent skills through manual execution—and it actually works.

Yuki Okonkwo·3 months ago·6 min read

RAG·vector embedding

2026-06-19
1,953 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.