Ternary Models Promise Full AI Power at Fraction

Here we go again. Another startup promises to revolutionize AI by making models smaller, faster, and somehow just as smart. I've watched this movie before—different compression scheme, same breathless proclamations.

Except this time, the math is interesting enough that I actually downloaded the models.

PrismML just released what they're calling "ternary models"—a refinement of their earlier one-bit models that theoretically delivers full FP16 accuracy at seven to eight times smaller memory footprint. That's the pitch, anyway. The reality is more nuanced, as it always is.

The Compression Question We Keep Asking

The core problem hasn't changed since I was running Netscape Navigator: how do you make powerful AI models small enough to run on normal hardware without lobotomizing them in the process?

Traditional quantization—the technique we've been using for years—works by essentially chopping digits off the decimal places in model weights. A standard FP16 model uses 16-bit floating point numbers for calculations. Quantize it down to 8-bit or 4-bit, and you get a smaller file that needs less memory. The trade-off? The model gets progressively dumber as you compress it more aggressively.

Tim Carambat, who creates AnythingLLM and tested these ternary models, explains the usual quantization problem: "Running the two-bit quantized version of a model is often horrible. It is in no way reflective of the original model. So much data has been pruned, excluded, or removed outright that the model essentially you're not even running the real model anymore. You're running some like copy of a copy of a copy."

That's where one-bit models entered the conversation. Instead of trying to compress existing models, Microsoft's BitNet research asked: what if you train a model from scratch to use only -1 or 1 as values? No complex matrix multiplication—just addition. CPUs can handle that. Memory requirements drop dramatically. File sizes shrink by factors of 14 to 16.

The catch? Microsoft's BitNet models were research demos. Completely unusable in practice. The theory was sound, but nobody had actually built a one-bit model worth running until PrismML shipped one in March.

Enter Ternary: The Goldilocks Solution?

Now PrismML is back with ternary models, which add a third value to the mix: zero. So instead of just -1 and 1, you get -1, 0, and 1. Technically that's 1.58 bits, but computers don't do fractional bits, so "ternary" it is.

The promise is compelling: maintain FP16-level accuracy while still being seven to eight times smaller than standard models. Not quite as tiny as pure one-bit models, but supposedly smarter.

Carambat's benchmarks show the progression. A standard Qwen 3 8B model at FP16 precision scores 79.3 average across benchmarks and weighs in at 16GB. The one-bit version scores 70 and takes up about 1-2GB. The new ternary version? Scores 75.5 while still staying under 2GB.

That's a meaningful difference. The question is whether it holds up beyond benchmarks.

The Benchmark Problem

I have complicated feelings about benchmarks. They're useful as indicators—if a model scores terribly across the board, that tells you something. But they're not gospel, and they're increasingly gameable.

Carambat addresses this directly: "Benchmarks are not perfect. In my opinion, if you are a lay person or you don't want to get into all of the nuance about what it means to have a great local model, the easiest way to think about this is think of benchmarks as an indicator."

He's right, but I'd go further. Benchmarks have become a marketing tool. Companies optimize for them. The MMLU Redux benchmark shows ternary at 72.6 versus 83 for standard Qwen—a 10-point gap that looks significant. But does that translate to real-world difference? Only actual use tells you.

The beauty of local models, as Carambat notes, is you can test them yourself without paying per token. So I did.

What Actually Running These Feels Like

Getting ternary models running requires PrismML's custom fork of llama.cpp—the main branch hasn't integrated support yet because, frankly, PrismML is the only source for these models right now. That should raise a yellow flag for anyone who remembers vendor lock-in.

The installation process involves command line work, which immediately excludes a chunk of the "run AI on your phone" audience these models theoretically enable. Carambat walks through it clearly—download the GGUF model file, grab the PrismML llama.cpp release for your platform, run a server command with your desired context window.

On his M4 Max with 48GB RAM, he's getting around 119 tokens per second. Performance on more modest hardware will vary, which is sort of the whole point—these models are supposed to run on devices that couldn't handle standard 8B models.

The energy efficiency numbers are striking. According to PrismML's data, ternary models consume significantly less power per token than FP16 equivalents. That matters for battery life, thermal management, and operational costs at scale.

The Question That Actually Matters

Here's what I keep coming back to: can this approach scale beyond 8B parameters?

Eight billion parameter models are useful. They're surprisingly capable for many tasks. But they're not competing with frontier models. They're not replacing cloud APIs for serious work. They're complementary tools.

Carambat identifies the crucial limitation: "8B is great and to have it be fractional in memory but still give FP16 intelligence is nothing to scoff at—but the world really needs bigger models here to play against cloud in any meaningful way."

This is where the promise meets reality. If ternary models max out at 8B, they're an optimization for edge cases—literally, running AI on the edge of networks, on devices with tight resource constraints. Valuable, but not revolutionary.

If PrismML or someone else figures out how to build viable 12B, 27B, or larger ternary models, that changes the equation. Then you're talking about desktop machines running models that currently require expensive cloud infrastructure.

But that's a big "if." Training large models is expensive and technically challenging. Training them with novel architectures that haven't been battle-tested? That's research, not product.

Pattern Recognition

I've seen this cycle enough times to recognize the shape. Promising research leads to startup. Startup demonstrates proof of concept. Early adopters get excited. Then comes the hard part: scaling, productizing, competing with established players who have deeper pockets and more data.

PrismML has done something genuinely impressive—they've made one-bit and ternary models that actually work, which is further than Microsoft's research got. But they're also the only source for these models, using custom tooling, targeting a niche use case.

That doesn't make it unimportant. Edge AI matters. Privacy matters. Energy efficiency matters. Models that run on devices you already own matter.

But let's be clear about what we're looking at: an interesting advancement in model compression that enables specific use cases, not a wholesale replacement for how we currently deploy AI. The gap between an optimized 8B model and GPT-4 class performance remains massive, regardless of how efficiently you can run the smaller model.

The future Carambat is cautiously optimistic about—where ternary models scale up and truly compete with cloud—requires breakthroughs we haven't seen yet. Until then, this is a tool for people who value local deployment enough to accept some performance trade-offs.

Which is fine. Not everything needs to change everything.

—Mike Sullivan