Java Parsed 1 Billion Rows in 1.5 Seconds. Here's

New Year's Day, still full of oliebollen (Dutch fried dough — don't ask), and developer Roy van Rijn sees a challenge posted online by Gunnar Morling: how fast can you parse a file with one billion rows of weather data using plain vanilla Java? One file. One billion city-name/temperature pairs. Find the min, max, and average per city. Print it out.

Van Rijn submitted something early — not to win, he says in a 2025 retrospective talk, but to "challenge others to beat me." That impulse kicked off one of the more genuinely interesting open-source competitions of recent memory, and he just walked through the whole thing in roughly 40 minutes. I watched it twice. The second time I took notes.

Starting point: almost five minutes 😬

The baseline implementation Morling provided uses file.lines in Java — single-threaded, clean, totally reasonable code. It processes a billion rows in about 4 minutes and 50 seconds. For context, the file is roughly 16 GB. That's not embarrassing. That's just what you get when you treat a billion-row file like a normal file.

Van Rijn's first move, hungover on January 1st, was adding .parallel() and a concurrent hashmap. Runtime: down to two minutes. Not elegant, but it works — parallelism distributes the row-processing across CPU cores, and the concurrent hashmap keeps things from exploding when multiple threads write at the same time. Two lines of real change, 60% of the time gone.

Then things got weird.

The layer cake nobody talks about

At some point in the talk, van Rijn quotes Formula 1 legend Jackie Stewart: "You don't have to be an engineer to be a good racing driver, but you have to have some mechanical sympathy." He uses it to introduce the CPU cache hierarchy — L1, L2, L3, L4 — and the exponentially growing cost of reading data from each layer.

I'm going to be honest: I've read about CPU caches before and it's always felt abstract. Diagrams of little boxes, theoretical latency numbers. Van Rijn's version landed differently because he tied it directly to a concrete optimization decision — switching from eight large file segments to many tiny ones so all threads read from nearly the same memory address at nearly the same time, keeping hot data in shared cache levels. Suddenly the hierarchy isn't a diagram. It's load-bearing. The difference between reading from L1 versus main memory is so enormous that structuring your access pattern around it — not as a micro-optimization but as a design decision — is just correct engineering. I found myself thinking about every time I've written something that thrashes memory without a second thought. It's a lot of times.

Anyway. Back to the contest.

The tricks, ranked by "wait, WHAT??"

Treating floats as integers: Temperatures have exactly one decimal place. Instead of parsing 12.3 as a float, multiply by 10 and store 123 as an integer. Floating-point math is slower. Do this a billion times and it adds up. Obvious in retrospect, not obvious until someone does it.

Memory-mapped files: Instead of reading the file with Java's file I/O, you hand it to the OS kernel and say "put this in memory for me." You get back a byte buffer. No file I/O overhead — it's just memory access. Van Rijn says he'd never used memory-mapped files before this. Now he's explaining them at conferences.

Dropping the garbage collector entirely: The epsilon GC in Java does literally nothing — no collection, no overhead. For a program that reads a file once and terminates, there's nothing to collect. So: disable the GC. Valid move. Slightly unhinged move.

SWAR (SIMD as a register): Instead of scanning bytes one at a time to find the semicolon delimiter, read eight bytes at once as a long, XOR against a mask of eight repeated delimiter bytes, and find the match position with a bit-counting operation. One subtraction, two ANDs, one XOR. Delimiter found. This is the moment in the talk where the audience goes quiet because they're doing math in their heads.

Branchless programming: CPUs have instruction pipelines they want to keep full. if statements break that — the CPU doesn't know which branch to load next. The solution is to write code with no branches at all, using bit manipulation to handle all cases simultaneously. Van Rijn wrote a branchless temperature parser he was "certain nobody could beat." Then Quan Anh Nguyen, described in the talk as "just a random guy from Vietnam," showed up.

The thing that needs to be in a museum

Nguyen's temperature parser reads the entire value — sign, digits, decimal — as a single 64-bit long value and extracts the integer result using one multiplication with a magic constant. One. Multiplication. The dot's position in the byte sequence maps to either 12, 20, or 28 depending on the number's format; XOR with 28 gives you the required bit-shift amount; a carefully chosen multiplier collapses all the digit positions into a single integer in one operation with no overflow into adjacent bits. Van Rijn spent ten minutes explaining it and still called it genius by the end.

"This this deserves a spot in the — or maybe we should make like a museum with code. This should be in it," van Rijn says in the talk. He's not being hyperbolic.

The career turn that followed is something van Rijn mentions at the talk's close: Nguyen, apparently noticed by Oracle engineers, was later hired and relocated from Vietnam to Zurich to work on JVM internals. I want to be upfront: I searched LinkedIn and Oracle's engineering team pages trying to verify this and couldn't confirm it independently — the claim comes solely from van Rijn's talk. He presents it as settled fact, and the story is plausible enough that I'm reporting it with that caveat rather than dropping it. If it's even 80% true it's still completely unhinged — a bit-manipulation trick written for an internet contest becoming a one-way ticket to Zurich. (If anyone can confirm, my DMs are open.)

A branch miss costs 32 instructions. This one costs zero.

Late in the competition, van Rijn noticed something maddening in the perf data: branch mispredictions were killing performance. The culprit was a 50/50 split — about half of city names fit in 8 bytes, half needed 16. Coin flip. Worst case for a CPU predictor.

The fix: always parse 16 bytes. Even for 8-byte names. Just mask out the extra. This added roughly 18 instructions per row — 18% more work, on paper. But it eliminated nearly all branch mispredictions. And a single branch miss costs about 32 wasted instruction slots. Do that math a billion times.

"Everything gets executed and everything is fully predictable. That's what CPUs love," van Rijn says. The counterintuitive lesson: doing more work in a predictable pattern is faster than doing less work unpredictably. The pipeline cares more about what comes next than how much there is of it.

The kernel unmap hack (technically nobody's problem anymore)

Thomas Wuerthinger — who works on GraalVM and eventually won the contest — hit a wall where his biggest bottleneck was the OS kernel zeroing out the 16 GB memory-mapped file on program exit, a Linux security measure. His solution: immediately spawn a worker process to do the actual computation. The main process only waits for piped output from the worker, records the time when the pipe closes, and exits. The worker is still zeroing memory in the background — but the clock has already stopped. Morling said it was fine. Everybody copied it immediately.

Where it ended up

Final result: 1.5 seconds. A 16 GB file, one billion rows, processed start to finish in one and a half seconds. The baseline was four minutes and fifty seconds.

When van Rijn closed the talk I had this moment of sitting back and genuinely not knowing what to do with that number. Not because it's impressive in a press-release way — it's impressive in a way that makes you feel slightly implicated. Every layer of optimization in this story existed before the contest. Memory-mapped files: existed. SWAR: existed. Branchless programming: existed. Epsilon GC: existed. The knowledge was all there, documented, accessible. What the 1 Billion Row Challenge did was create a reason to actually use it.

That gap — between knowing a technique exists and caring enough to reach for it — is most of software performance in practice. Van Rijn didn't go into this contest knowing all of it. He learned SWAR during the contest. He learned about ILP from Wuerthinger ten months later, at a different conference. The contest was the forcing function.

So: does any of this translate? Not the sun.misc.Unsafe calls, obviously. Not the epsilon GC in anything that runs longer than one task. But the muscle memory of looking at a problem and asking what is the machine actually doing here — that part transfers everywhere. The contest just made a lot of developers actually develop it.

Zara Chen is a tech and politics correspondent for Buzzrag.