How Matrix Multiplication Goes from Slow to 180

Here's something wild: the difference between a naive matrix multiplication implementation and an optimized one isn't 10% or even 2x. We're talking about performance improvements measured in orders of magnitude—like going from taking your sweet time to absolutely screaming through calculations.

Aliaksei Sala, a lead software engineer at EPAM Systems, walked through this journey at CppCon 2025, and honestly? The progression is kind of mind-blowing. Not because it's using some exotic hardware or unrealistic setup, but because it's doing what we're supposedly already doing—just doing it correctly.

Why This Matters (Hint: It's About AI)

Before we dive into the optimization rabbit hole, let's establish why anyone should care about making matrix multiplication faster. Sala points out that in large language models—you know, the things powering ChatGPT and friends—somewhere between 70-90% of compute time is spent on matrix multiplication. Self-attention mechanisms? Matrix multiplication. Feed-forward networks? More matrix multiplication.

The compute requirements have grown absurdly fast. AlexNet needed 3 teraflops back in the day. Now we're talking exaflops, heading toward zettaflops. "According to OpenAI data, the compute which is required to train a new model doubles every four months," Sala notes. When your computational demands are doubling that fast, squeezing every drop of performance from your hardware stops being academic and starts being existential.

The Naive Baseline (Spoiler: It's Bad)

Sala started with what you'd find in a lot of GitHub repos—a straightforward triple nested loop. Take every row of matrix A, multiply it by every column of matrix B, store the result in matrix C. Clean, readable, intuitive.

And slow as hell. 1052 seconds for his test case. On his admittedly older Intel i5 (Haswell architecture), this gave him about 180 gigaflops of theoretical peak performance to aim for. The naive version? Not even close.

Optimization #1: Just... Change the Loop Order

Here's where it gets interesting. The first optimization wasn't adding anything fancy. Sala just changed the loop order from i-j-k to i-k-j.

That's it. That one change gave him a 12.6x speedup.

Why? Because of how memory actually works. When you access matrix elements with huge strides (jumping around in memory), you're essentially fighting against everything your CPU is trying to do for you. You waste cache lines—when the CPU loads one element, it pulls in neighboring elements too, but if you don't use them, you've just wasted bandwidth. You also disable the hardware prefetcher, which is trying to predict what data you'll need next.

Change the loop order so you're accessing data sequentially, and suddenly your CPU can actually help you.

Cache Blocking (Tiling): Work Smarter, Not Harder

Next up: instead of multiplying elements by elements, divide your matrices into blocks and work block-by-block. This optimization tripled performance again.

The logic is beautiful: if you're going to use a chunk of data from matrix A with multiple chunks from matrix B, keep that A chunk in fast cache memory and reuse it. Don't keep fetching it from slow main memory over and over like some kind of amnesia patient.

SIMD: Do Four Things at Once

Then came vectorization using SIMD (Single Instruction, Multiple Data) instructions. Instead of adding four numbers with four separate operations, load them all into vector registers and do one operation.

Interestingly, the compiler was already trying to do this automatically. But when Sala did it manually, he noticed something: the compiler wasn't using all available CPU registers. His manual implementation used all 16 available registers. The compiler's version? Three or four.

Manual vectorization gave another 1.23x speedup. Not massive, but when you're chasing peak performance, you take what you can get.

Getting Cache-Aware: Memory Hierarchy Is Real

This is where things get properly technical. Different levels of cache (L1, L2, L3) have different sizes and speeds. Sala calculated optimal tile sizes based on what would fit in each cache level—keeping the most frequently accessed data (matrix B tiles) in L1 cache, blocks of matrix A in L2, and so on.

Implementing cache-aware blocking meant adding more nested loops, but it paid off. Then he gave the compiler hints about register usage by manually allocating specific numbers of registers for different purposes (12 for C elements, 3 for B elements, 1 for A). That nearly doubled performance again.

Multi-Threading: Finally Using All Those Cores

Up to this point, everything ran on a single core. Sala tried various approaches to parallelize the work—different loop levels, thread pinning strategies. In the end, the simplest approach worked fine: split matrix C by rows and give each thread some rows to handle.

On his four-core CPU, this tripled performance. Math checks out.

The Modern C++ Twist

One of the cooler moments in Sala's talk: he replaced all the "scary intrinsics functions" (his words) with C++26's std::simd library. Just swapped in the standard library types and regular operators like addition and multiplication.

"The most amazing part, it just works," he said. "It's like a zero overhead abstraction. The performance remained the same and just works. Really love it."

That's... actually kind of huge? Writing vectorized code without drowning in architecture-specific intrinsics means this stuff becomes accessible to more people.

What The Assembly Reveals

Throughout the talk, Sala kept checking the generated assembly code. This revealed when the compiler was helping (automatic loop unrolling) and when it was getting in the way (not using all registers). The gap between what compilers can do and what they actually do is fascinating—and sometimes frustrating.

For the final optimization, he used C++ template metaprogramming (parameter packs and fold expressions) to generate unrolled loops at compile time. Instead of manually writing out repetitive code or losing flexibility, let the compiler generate the exact code you need based on constants.

The Bigger Picture

What Sala demonstrated isn't just "here's how to make matrix multiplication fast." It's a masterclass in understanding the gap between what we think our code does and what actually happens on hardware.

Every optimization built on understanding one more layer: memory access patterns, cache hierarchies, SIMD lanes, register allocation, thread scheduling. The naive version wasn't just "not optimized"—it was actively fighting against how modern CPUs work.

And here's the thing: this isn't niche knowledge anymore. When AI training costs are doubling every four months, when inference needs to happen in real-time, when edge devices need to run models locally—this stuff matters. The engineers who understand these optimizations aren't just making things incrementally faster. They're determining what's computationally feasible at all.

Sala's self-described motivation was simple: "I've already paid for my CPU, so I kind of must use it." Fair enough. But the journey from naive to optimized isn't just about extracting value from hardware you own. It's about understanding the massive gap between theoretical performance and what you actually get—and learning exactly which techniques close that gap.

The compute demands aren't slowing down. The question is whether we're going to meet them by throwing more hardware at the problem, or by actually learning to use what we have.

Yuki Okonkwo is Buzzrag's AI & Machine Learning Correspondent