NVFP4 vs INT4: The Quantization Format That's 27%

Here's something that shouldn't work but does: a 4-bit language model outperforming an 8-bit version. Not in some synthetic benchmark designed to make a press release look good, but in actual, measurable throughput and quality.

Alex Ziskind, who runs a YouTube channel focused on ML infrastructure, got access to an Nvidia HGX server with eight B200 GPUs (each pulling 1,000W with 184GB of VRAM, because subtlety is dead). He used it to pit two 4-bit quantization formats against each other: traditional integer-based INT4, which we've been using for years, and Nvidia's newer NVFP4—a floating-point format that takes a fundamentally different approach to compression.

Both formats squeeze model weights down to four bits. The difference is how they do it, and that difference matters more than you'd think.

The Speed Gap Is Real

Ziskind tested both formats on Kimi K1.5, the open-weights model that's become something of a darling in the local LLM community. DHH (creator of Ruby on Rails) calls it his "daily driver for all the basic stuff where I don't need PhD level intelligence." Cursor AI's founder Aman Sanger mentioned that after testing multiple options, Kimi K1.5 became the base for their Composer 2 model.

The setup was genuinely apples-to-apples: same hardware, same model (Kimi K1.5), same prompts, same inference server (vLLM). Only the quantization format changed.

At single-user concurrency—basically chat mode—NVFP4 generated 133 tokens per second. INT4 managed 105. That's a 27% advantage right out of the gate.

Pushing to 32 concurrent requests (imagine 32 users hitting the model simultaneously, or an agent system with 32 sub-agents), both formats scaled impressively—about 11x total throughput. But NVFP4 maintained its lead the entire way, peaking around 2,200 tokens per second while INT4 topped out at 1,800.

The performance gap stayed consistent at every concurrency level: 20-27% faster across the board. On modern GPUs where memory bandwidth is the primary bottleneck, that's not a rounding error.

Why NVFP4 Wins on Speed

The technical reason is straightforward once you know where to look. NVFP4 quantizes both weights and activations down to four bits. INT4 only quantizes the weights—it keeps activations at full 16-bit precision.

That means INT4 is pushing 2.5 times more data through memory per operation (20 bits total versus 8 bits for NVFP4). When your bottleneck is memory bandwidth—and on these GPUs, it is—you're paying a tax on every single operation.

But speed means nothing if the output degrades. Quantization is always a trade-off between efficiency and quality, and more aggressive compression should hurt accuracy. The question was: does it?

The Quality Test That Almost Failed

Ziskind ran four standard benchmarks: GSM-8K (grade school math), HumanEval (coding problems with executable tests), IFEval (instruction-following), and Needle in a Haystack (retrieval from long context).

Initial results seemed clear: math was tied, retrieval was perfect on both, but INT4 pulled ahead on code (59% vs 54%) and instructions (68% vs 65%). The story looked clean—NVFP4 is faster, INT4 is slightly better quality.

Except something was wrong. 54% on HumanEval for one of the best coding models available? That's suspiciously low.

Digging into the raw data revealed the issue: 62 out of 164 problems were truncated. The model never finished writing code—not because it couldn't solve the problem, but because it ran out of tokens thinking about it.

Kimi K1.5 is a reasoning model. Before it writes any code, it thinks internally using hidden tokens that count against your token budget. With max_tokens set at 248, the model would burn through 1,800 tokens reasoning, leaving almost nothing for actually generating the solution. Sometimes it produced nothing at all.

"See, Kim 2.5 is a reasoning model. Before it writes any code, it thinks internally in hidden tokens that count against your token budget," Ziskind explains in the video.

Rerunning the benchmark with max_tokens at 8,192—giving the model room to both think and write—changed everything. NVFP4 jumped from 54% to 88%. INT4 jumped from 59% to 92%. The same thing happened with IFEval: both models shot up to around 80% once truncation was eliminated.

The INT4 "quality advantage" was a measurement artifact. NVFP4's reasoning tends to be more verbose—it thinks longer before answering—so it got truncated more often under restrictive token limits. With adequate tokens, quality was essentially tied across the board. INT4 maintained a small 4% edge on code; everything else was a statistical wash.

The Trap in Benchmarking Reasoning Models

This is worth dwelling on because it's a methodological trap that anyone benchmarking reasoning models can fall into. Standard token limits—the kind we've been using for years to test traditional LLMs—silently break your results when the model thinks before it answers.

"This is the trap that anyone benchmarking reasoning models can fall into," Ziskind notes. "Standard token limits silently break your results. What ends up happening is the model that thinks harder looks worse, which is exactly backwards."

The model that reasons more thoroughly appears worse under constrained benchmarks. That's exactly backwards from reality, but it's what happens when your measurement tools don't account for internal reasoning overhead.

This has implications beyond this specific comparison. As reasoning models become more common—and they are, rapidly—benchmark infrastructure needs to adapt. Token budgets that worked fine for GPT-3 will systematically underestimate the capabilities of models that think before they respond.

The 8-Bit Surprise

Ziskind threw in one more test: FP8, an 8-bit floating-point format with twice the precision of NVFP4. In theory, it should be slower but more accurate.

FP8 was indeed slower—96 tokens per second versus NVFP4's 133. That's 28% slower, and exactly what you'd expect when you're moving more data through memory.

But the quality results were not what you'd expect. On HumanEval, FP8 scored 76%—lower than both 4-bit reasoning models at 88-92%. On instruction-following, FP8 hit 64% while the reasoning models scored 80%.

The caveat: the FP8 version wasn't a reasoning model. It was Kimi K2 base, which lacks that internal thinking step. So the comparison isn't perfectly clean. But it reveals something important—on complex tasks like coding and instruction-following, the ability to reason matters more than bit precision.

Interestingly, FP8 did outperform on simple math: 97.5% versus the reasoning models' ~95%. Sometimes reasoning gets in the way. On straightforward arithmetic problems, overthinking is worse than just calculating.

"Sometimes reasoning actually gets in the way like with simple math problems," Ziskind observes. "That's why the model of this channel is probably going to be uh use the right tool for the right job."

What This Actually Means

If you're choosing a quantization format for deploying Kimi K1.5 or similar reasoning models, NVFP4 is the clear winner: same quality as INT4, 27% faster throughput, and that advantage holds under load.

But the bigger story is what this reveals about how we measure model performance. Benchmarks designed for one generation of models can systematically misrepresent the next generation. Reasoning models think before they answer, and if your benchmark doesn't account for that cognitive overhead, you're measuring the wrong thing.

The other takeaway: more bits don't automatically mean better results. Whether a model can reason matters more than whether it operates at 4-bit or 8-bit precision. FP8 had twice the precision of NVFP4 but performed worse on complex tasks because it lacked the reasoning architecture.

Ziskind's testing was done on enterprise hardware (eight B200 GPUs), which isn't exactly consumer-accessible. But the principles transfer. If you're running local models—whether on a Mac Studio cluster or a homelab server—understanding these trade-offs matters. Speed isn't just about user experience; it's about what you can actually build. An agent system that needs to spawn 32 concurrent reasoning chains becomes viable at 2,200 tokens/second. At 1,800? The math gets harder.

The industry is moving toward local inference for good reasons—cost, privacy, control. But local deployment means caring about efficiency in ways cloud API users never have to think about. Quantization isn't just a compression technique; it's the technology that makes local AI actually practical. And as this testing shows, the format you choose matters more than the bit count suggests.

— Yuki Okonkwo, AI & Machine Learning Correspondent