SubQ Claims 12M Token Context at Near-Zero Cost

The transformer architecture has had one genuinely annoying property since the original 2017 paper: when you double the length of your input, you don't double the compute cost, you roughly quadruple it. Every word attends to every other word, and that relationship matrix explodes. It's called quadratic scaling, and it's the reason that processing a million-token document feels like trying to run Crysis on a toaster.

The industry has spent years working around it—chunking documents into pieces, building retrieval pipelines that pull only the relevant bits, stacking agents on top of each other. SubQ, a startup that published its SubQ 1.1 Small technical report this week, is arguing that all of those are patches over a broken pipe. Their claim: they've built the first fully sub-quadratic LLM, with a 12-million-token context window, and it runs at a fraction of what current models cost.

That claim is worth taking seriously. It's also not fully verified yet. Both of those things are true.

The actual architecture

The key innovation is what SubQ calls Sub-Quadratic Sparse Attention, or SSA. Here's how it differs from what came before.

Earlier approaches to sparse attention—Longformer, BigBird—also skipped word-to-word relationships to cut compute. But they did it by position: only look at nearby words, ignore the rest. That's fast, but it misses semantic connections across long distances.

SSA does it by content instead. For each word, the model learns to identify a small group of other words that actually matter, regardless of where they sit in the document, and then runs full attention math only on that group. The result: SubQ claims it looks at roughly 0.13% of all possible relationships at 12 million tokens—and still finds what it needs.

This is also distinct from state-space models like Mamba, which compress context into a fixed-size memory slot. SSA doesn't compress; it selects. The company says this preserves exact attention quality on the relationships it does compute, which is the mechanism behind their "no quality loss from approximation" claim.

Whether that claim survives independent testing is a different question. We'll get there.

The compute chart, taken at face value

SubQ's technical report includes a comparison that, if accurate, is legitimately striking. Dense attention at 1 million tokens costs around 252 petaFLOPS per attention layer. SSA at the same length: under 4 petaFLOPS. The gap compounds as context grows—8x at 128k tokens, 31x at 512k, 64.5x at 1 million.

The company puts a cost figure on this too. Outside reports describe a long context evaluation costing roughly $8 on SubQ versus roughly $2,600 on Claude Opus at the same length. Anthropic's listed price for Claude Opus 4 is $15 per million input tokens. At that rate, a $2,600 bill implies approximately 173 million input tokens—which is far more than a single context window's worth of text, and doesn't map cleanly onto SubQ's stated context sizes. That comparison is either measuring something unusual or running a huge batch evaluation, and the public materials don't explain which. The drama of the number shouldn't substitute for knowing what was actually being measured.

What SubQ 1.1 Small actually scored

SubQ did not train this model from scratch. They took an existing open-weight frontier model, stripped out its dense attention mechanism, replaced it with SSA, then extended context in stages—262k, 512k, 1 million, 2 million tokens—with roughly 1 trillion tokens of additional training on long-form material: books, full documents, code repositories. The benchmark results that follow need to be read with that lineage in mind.

On the RULER benchmark, which tests multi-step retrieval (variable tracing, counting, cross-document synthesis), SubQ 1.1 Small scores 99.12% at 128k tokens. That's an extraordinarily high number. For context, top models on the public RULER leaderboard typically score in the low-to-mid 90s at comparable lengths—but I can't tell you whether SubQ's specific result appears on that leaderboard, because the company hasn't published its entry there, and I have not found a third-party reproduction of this specific figure. This is an unverified self-report, and the 99.12% should be treated as such until someone else runs the eval.

On the needle-in-a-haystack test—hide one fact inside millions of tokens, ask the model to retrieve it exactly—SubQ scores 100% at 1 million and 2 million tokens, and 98% at both 6 million and 12 million tokens. The model was primarily trained to 1 million tokens, so the 12-million performance, if it holds under independent testing, is genuinely interesting.

Then there's the general capability table, where SubQ puts itself against frontier models. On GPQA Diamond (graduate-level science reasoning), SubQ scores 85.4. The report lists GPT-4.5 at 93.2 and Claude Opus 4.8 at 92 for comparison. These comparison figures come from SubQ's own report—I have not confirmed whether they match current entries on official model cards or third-party leaderboards, so treat them as the company's stated context rather than independent reference points. What they suggest, if roughly accurate: SubQ sits solidly below the top-tier frontier models on raw reasoning, but meaningfully above the smaller models it's more directly competing with.

On LiveCodeBench (competitive programming), SubQ reports 89.7, with GPT-4.5 at 92 and Claude Opus at 92.2. The public LiveCodeBench leaderboard is checkable in principle, but benchmark scores vary significantly by model version and evaluation date, and the report doesn't specify which versions or dates it's drawing from. I did not independently verify whether these comparison figures appear on the leaderboard or which snapshots they correspond to. The numbers look plausible directionally—SubQ close to but below the frontier on coding—but the specifics are unconfirmed.

The one benchmark where everything scores low is FinanceBench automation, where SubQ hits 13% against Sonnet's 8%. The report acknowledges this directly, which is the right call. Long-context retrieval is hard; agentic financial reasoning is harder.

The Appen question

SubQ says a third party called Appen verified the benchmarks in the 1.1 Small report. That matters—it's more than nothing. What it doesn't tell us is what Appen specifically did: whether they re-ran the evaluations independently, audited SubQ's methodology, or reviewed SubQ's outputs against expected results. I went looking for a public description of Appen's verification scope and found nothing that clarifies it. That's the actual reporting finding: the scope of that verification isn't described in publicly available materials. "Third-party verified" can mean a lot of things, and the gap between those things matters quite a bit when the model weights aren't public and independent labs can't run their own evals.

The honest shape of this

SubQ's architectural bet makes intuitive sense: if almost all word-to-word relationships in a trained model are effectively zero, computing them anyway is wasteful, and a model that learns to skip them based on meaning rather than position should outperform positional shortcuts on long-range tasks. That logic is sound, and sparse attention research going back years supports the general direction.

The specific claims—1,000x compute reduction, 99.12% RULER, the $8 vs. $2,600 cost comparison—are a different matter. They come primarily from the company's own testing, with a third-party verification whose scope is unclear, model weights that aren't public, and comparison scores that haven't been confirmed against independent leaderboard entries. Also worth knowing: sparse attention's efficiency gains are most pronounced on very long inputs. If you're mostly running short prompts, this isn't the architecture that saves you money—and SubQ's published benchmarks don't cover that use case extensively.

SubQ says design partners are getting access in the next few weeks, with broader rollout through the quarter and general availability by end of year. The company has also signaled a 50-million token context window goal for 2026.

As SubQ's announcement put it: "SubQ is not just another model. It represents a major algorithmic breakthrough." That framing might turn out to be exactly right. Or the numbers might compress under real-world conditions the way they often do. The architecture has theoretical backing, the retrieval scores are interesting, and the cost structure—if it holds—would genuinely change what's economically feasible with long-context AI.

We'll know a lot more once independent labs get the weights. That's the thing about claims that can't be reproduced yet: they're neither confirmed nor refuted. They're just waiting.

Yuki Okonkwo is the AI & Machine Learning Correspondent at Buzzrag.