Why Your Parallel Code Might Be Slower Than Serial Code
Rust's Rayon library promises 8x speedups with parallel processing. Sometimes it delivers. Sometimes it makes things worse. Here's when parallelism helps.
Written by AI. Yuki Okonkwo
April 24, 2026

Photo: Code to the Moon / YouTube
Code to the Moon recently demonstrated something that trips up a lot of developers: they took a Rust program that ran in 3.2 seconds, added two lines of code for parallel processing with Rayon, and watched it execute in 459 milliseconds—an 8x speedup. Then they showed the same program running slower with parallelism enabled.
Both outcomes are real. Both are useful to understand. The difference comes down to knowing when parallelism helps and when it's just expensive overhead.
The Promise (And the Fine Print)
Rayon is a Rust crate that makes parallel processing almost stupidly simple. The creator's example involved processing 10 million numbers—multiplying each by a random value, then summing the results. Two lines of code switched from sequential iteration to parallel:
use rayon::prelude::*;
// later...
.into_par_iter()
Rayon automatically creates a thread pool matching your CPU cores (10 cores on their M1 MacBook Pro), splits the data across those threads, processes each chunk independently, and combines the results. The code barely changed. The runtime dropped from 3.2 seconds to 459 milliseconds.
Then they changed the dataset size to 10 items instead of 10 million. With parallel iterators: 456 microseconds. Without: 36 microseconds. The parallel version was 12 times slower.
"The overhead of the parallel iterators is drowning out the benefits that we get from it," the video explains. "The two things that typically make parallel iterators more valuable are number one, the data size, and number two, the nature of the computation that you're doing on the data."
This isn't a Rayon problem. It's a parallelism problem. Every parallel operation has fixed costs: dividing work, managing thread synchronization, combining results. When those costs exceed the actual computation time, you've made things worse.
When Simple Math Beats Clever Engineering
The computation weight matters as much as data size. In the demo, they were doing lightweight math—multiplication and summation. If the operation were heavier (hashing, encryption, complex calculations), parallelism would pay off with much smaller datasets. Maybe 100 items instead of 10 million.
There's also a sneaky optimization happening that might not be obvious: the sum() operation itself is parallelized. Rayon doesn't just split your data and map over it—it can also perform reduction operations (like summing) on each chunk, then combine those partial results. This works because summing is a reduce function: it takes two parameters of the same type and produces an output of that same type.
Formulating algorithms as reduce functions makes them naturally parallelizable. Addition, multiplication, finding minimums/maximums—they all compose nicely across chunks of data.
Beyond Collections: Rayon Join
Parallel iterators handle collections elegantly, but what about computations that aren't iterating over data? That's where rayon::join comes in.
The video contrasts spawning raw OS threads (expensive, slow to create and destroy) with using Rayon's thread pool. In their example, two blocking operations (represented as sleeps) need to run in parallel:
rayon::join(
|| { /* first computation */ },
|| { /* second computation */ }
);
"The first one is a closure that will actually be immediately run in most cases, not all. In most cases, on the current thread," they explain. "The second parameter is an operation that's going to be put on the thread queue."
The second closure might get stolen by an idle worker thread. If no threads are free, it runs on the current thread. Either way, you're using an existing thread pool instead of spawning and destroying OS threads—much lower overhead.
The Binary Tree Problem (And How Parallelism Made It Worse)
Here's where things get interesting. The video tackles that infamous tweet from the Homebrew creator about being asked to invert a binary tree in a Google interview. It's a simple recursive algorithm: swap left and right children at every node.
They built a perfect binary tree with depth 23—over 8 million nodes. Serial version: 72 milliseconds. Parallel version using rayon::join on the recursive calls: 281 milliseconds. Nearly 4x slower.
"Rayon join has to queue up all of these parallel operations," the video notes. "And this tree, again, is going to have over 8 million nodes... there's a lot of overhead in just queuing those operations."
Plus the work-stealing mechanism requires atomics and mutexes so threads can access each other's queues. At 8 million operations, that coordination cost dominates.
The solution? Be selective. Only parallelize down to a certain depth, then switch to serial:
if depth < 6 {
rayon::join(
|| invert_tree(left, depth + 1),
|| invert_tree(right, depth + 1)
);
} else {
invert_tree(left, depth + 1);
invert_tree(right, depth + 1);
}
With this hybrid approach: 15 milliseconds. That's a legitimate 4.8x speedup over the serial version, achieved by not parallelizing most of the work.
Rayon vs. Tokio: Different Tools for Different Problems
The video makes an important distinction that I think gets lost in a lot of Rust discussions: Rayon and Tokio solve different problems.
"Rayon is intended for blocking synchronous CPU-bound operations, as opposed to Tokio, which is optimized for asynchronous non-blocking operations," they explain. "You don't want to put something like a hash computation in a Tokio task. That's a good use case for rayon."
They proved this by implementing the binary tree inversion with Tokio. Runtime: 188 milliseconds with the depth-6 optimization—still 2.5x slower than the single-threaded version. Tokio's task creation overhead is designed for I/O-bound work where threads spend most of their time waiting. For CPU-bound work, that overhead becomes a tax you can't afford.
The Meta-Lesson
What I find fascinating about this is how it challenges the intuition that "more parallel = more fast." In ML work, I see this constantly—people throw more GPUs at a problem when the bottleneck is actually data loading, or they parallelize training when the batch size is already saturating compute.
The pattern here generalizes: understand your costs, measure your bottlenecks, and remember that coordination has overhead. Sometimes the fastest code is the code that doesn't try to be clever.
Rayon gives you scope() for cases where you need more than two parallel tasks, btw. It takes a closure that gives you a scope where you can call spawn() as many times as you want. Handy when your parallel structure doesn't fit into the binary branching pattern of join().
But the core insight remains: parallelism is a tool that works when the benefits exceed the costs. Knowing which side of that equation you're on requires actually measuring, not assuming.
— Yuki Okonkwo
We Watch Tech YouTube So You Don't Have To
Get the week's best tech insights, summarized and delivered to your inbox. No fluff, no spam.
Watch the Original Video
Rust Parallelism with Rayon - Use ALL CPUs
Code to the Moon
13m 21sAbout This Source
Code to the Moon
Code to the Moon is a YouTube channel spearheaded by a veteran software developer with over 15 years in the industry. Boasting a subscriber count of 82,100, this channel has been a significant player for just over a year, focusing on advanced programming languages such as Rust, as well as next-gen development tools. It serves as a rich resource for developers eager to hone their skills in software development.
Read full source profileMore Like This
Apple's New CEO Inherits a Paradox: Did Doing Nothing Win AI?
John Ternus takes over Apple amid questions about whether the company's AI inaction was genius or fumble. Plus: Google forms a coding strike team.
Claude Code + Paperclip: Running Companies With AI Agents
Julian Goldie shows how Claude Code and Paperclip create AI agent companies with org charts, roles, and budgets—no human employees required.
Five Open Source Dev Tools That Shouldn't Be Free
From AI usage trackers to self-hosting platforms, these open source tools solve real developer problems—and they're completely free.
Async Rust Performance: What Most Developers Get Wrong
Code to the Moon breaks down async Rust and Tokio misconceptions that kill performance. Single-threaded concurrency vs parallelism explained.
Unlocking C++ Efficiency: Lazy Ranges & Parallelism
Explore how lazy ranges and parallelism in C++ can enhance code efficiency and overcome memory bottlenecks with Daniel Anderson's insights.
Claude Code's Task System: A Game Changer
Discover how Claude Code's new task system transforms coding workflows with dependency tracking and sub-agents.
Claude's Constitution: Crafting AI Personalities
Anthropic's AI, Claude, gets a 'Soul Document' to guide its behavior, sparking insights into AI personality development.
RAG·vector embedding
2026-04-24This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.