Why Your Parallel Code Might Be Slower Than

Code to the Moon recently demonstrated something that trips up a lot of developers: they took a Rust program that ran in 3.2 seconds, added two lines of code for parallel processing with Rayon, and watched it execute in 459 milliseconds—an 8x speedup. Then they showed the same program running slower with parallelism enabled.

Both outcomes are real. Both are useful to understand. The difference comes down to knowing when parallelism helps and when it's just expensive overhead.

The Promise (And the Fine Print)

Rayon is a Rust crate that makes parallel processing almost stupidly simple. The creator's example involved processing 10 million numbers—multiplying each by a random value, then summing the results. Two lines of code switched from sequential iteration to parallel:

use rayon::prelude::*;
// later...
.into_par_iter()

Rayon automatically creates a thread pool matching your CPU cores (10 cores on their M1 MacBook Pro), splits the data across those threads, processes each chunk independently, and combines the results. The code barely changed. The runtime dropped from 3.2 seconds to 459 milliseconds.

Then they changed the dataset size to 10 items instead of 10 million. With parallel iterators: 456 microseconds. Without: 36 microseconds. The parallel version was 12 times slower.

"The overhead of the parallel iterators is drowning out the benefits that we get from it," the video explains. "The two things that typically make parallel iterators more valuable are number one, the data size, and number two, the nature of the computation that you're doing on the data."

This isn't a Rayon problem. It's a parallelism problem. Every parallel operation has fixed costs: dividing work, managing thread synchronization, combining results. When those costs exceed the actual computation time, you've made things worse.

When Simple Math Beats Clever Engineering

The computation weight matters as much as data size. In the demo, they were doing lightweight math—multiplication and summation. If the operation were heavier (hashing, encryption, complex calculations), parallelism would pay off with much smaller datasets. Maybe 100 items instead of 10 million.

There's also a sneaky optimization happening that might not be obvious: the sum() operation itself is parallelized. Rayon doesn't just split your data and map over it—it can also perform reduction operations (like summing) on each chunk, then combine those partial results. This works because summing is a reduce function: it takes two parameters of the same type and produces an output of that same type.

Formulating algorithms as reduce functions makes them naturally parallelizable. Addition, multiplication, finding minimums/maximums—they all compose nicely across chunks of data.

Beyond Collections: Rayon Join

Parallel iterators handle collections elegantly, but what about computations that aren't iterating over data? That's where rayon::join comes in.

The video contrasts spawning raw OS threads (expensive, slow to create and destroy) with using Rayon's thread pool. In their example, two blocking operations (represented as sleeps) need to run in parallel:

rayon::join(
    || { /* first computation */ },
    || { /* second computation */ }
);

"The first one is a closure that will actually be immediately run in most cases, not all. In most cases, on the current thread," they explain. "The second parameter is an operation that's going to be put on the thread queue."

The second closure might get stolen by an idle worker thread. If no threads are free, it runs on the current thread. Either way, you're using an existing thread pool instead of spawning and destroying OS threads—much lower overhead.

The Binary Tree Problem (And How Parallelism Made It Worse)

Here's where things get interesting. The video tackles that infamous tweet from the Homebrew creator about being asked to invert a binary tree in a Google interview. It's a simple recursive algorithm: swap left and right children at every node.

They built a perfect binary tree with depth 23—over 8 million nodes. Serial version: 72 milliseconds. Parallel version using rayon::join on the recursive calls: 281 milliseconds. Nearly 4x slower.

"Rayon join has to queue up all of these parallel operations," the video notes. "And this tree, again, is going to have over 8 million nodes... there's a lot of overhead in just queuing those operations."

Plus the work-stealing mechanism requires atomics and mutexes so threads can access each other's queues. At 8 million operations, that coordination cost dominates.

The solution? Be selective. Only parallelize down to a certain depth, then switch to serial:

if depth < 6 {
    rayon::join(
        || invert_tree(left, depth + 1),
        || invert_tree(right, depth + 1)
    );
} else {
    invert_tree(left, depth + 1);
    invert_tree(right, depth + 1);
}

With this hybrid approach: 15 milliseconds. That's a legitimate 4.8x speedup over the serial version, achieved by not parallelizing most of the work.

Rayon vs. Tokio: Different Tools for Different Problems

The video makes an important distinction that I think gets lost in a lot of Rust discussions: Rayon and Tokio solve different problems.

"Rayon is intended for blocking synchronous CPU-bound operations, as opposed to Tokio, which is optimized for asynchronous non-blocking operations," they explain. "You don't want to put something like a hash computation in a Tokio task. That's a good use case for rayon."

They proved this by implementing the binary tree inversion with Tokio. Runtime: 188 milliseconds with the depth-6 optimization—still 2.5x slower than the single-threaded version. Tokio's task creation overhead is designed for I/O-bound work where threads spend most of their time waiting. For CPU-bound work, that overhead becomes a tax you can't afford.

The Meta-Lesson

What I find fascinating about this is how it challenges the intuition that "more parallel = more fast." In ML work, I see this constantly—people throw more GPUs at a problem when the bottleneck is actually data loading, or they parallelize training when the batch size is already saturating compute.

The pattern here generalizes: understand your costs, measure your bottlenecks, and remember that coordination has overhead. Sometimes the fastest code is the code that doesn't try to be clever.

Rayon gives you scope() for cases where you need more than two parallel tasks, btw. It takes a closure that gives you a scope where you can call spawn() as many times as you want. Handy when your parallel structure doesn't fit into the binary branching pattern of join().

But the core insight remains: parallelism is a tool that works when the benefits exceed the costs. Knowing which side of that equation you're on requires actually measuring, not assuming.

— Yuki Okonkwo