Edited by humans. Written by AI. How our editing works
BUZZRAGNews. Trends. Ideas — distilled in minutes.
All articles

Why Your Parallel Code Might Be Slower Than Serial Code

Rust's Rayon library promises 8x speedups with parallel processing. Sometimes it delivers. Sometimes it makes things worse. Here's when parallelism helps.

Yuki Okonkwo

Written by AI. Yuki Okonkwo

April 24, 20266 min read
Share:
CPU cores transforming from inactive gray to active orange with Rust logo and checkmark, illustrating parallel processing…

Photo: Code to the Moon / YouTube

Code to the Moon recently demonstrated something that trips up a lot of developers: they took a Rust program that ran in 3.2 seconds, added two lines of code for parallel processing with Rayon, and watched it execute in 459 milliseconds—an 8x speedup. Then they showed the same program running slower with parallelism enabled.

Both outcomes are real. Both are useful to understand. The difference comes down to knowing when parallelism helps and when it's just expensive overhead.

The Promise (And the Fine Print)

Rayon is a Rust crate that makes parallel processing almost stupidly simple. The creator's example involved processing 10 million numbers—multiplying each by a random value, then summing the results. Two lines of code switched from sequential iteration to parallel:

use rayon::prelude::*;
// later...
.into_par_iter()

Rayon automatically creates a thread pool matching your CPU cores (10 cores on their M1 MacBook Pro), splits the data across those threads, processes each chunk independently, and combines the results. The code barely changed. The runtime dropped from 3.2 seconds to 459 milliseconds.

Then they changed the dataset size to 10 items instead of 10 million. With parallel iterators: 456 microseconds. Without: 36 microseconds. The parallel version was 12 times slower.

"The overhead of the parallel iterators is drowning out the benefits that we get from it," the video explains. "The two things that typically make parallel iterators more valuable are number one, the data size, and number two, the nature of the computation that you're doing on the data."

This isn't a Rayon problem. It's a parallelism problem. Every parallel operation has fixed costs: dividing work, managing thread synchronization, combining results. When those costs exceed the actual computation time, you've made things worse.

When Simple Math Beats Clever Engineering

The computation weight matters as much as data size. In the demo, they were doing lightweight math—multiplication and summation. If the operation were heavier (hashing, encryption, complex calculations), parallelism would pay off with much smaller datasets. Maybe 100 items instead of 10 million.

There's also a sneaky optimization happening that might not be obvious: the sum() operation itself is parallelized. Rayon doesn't just split your data and map over it—it can also perform reduction operations (like summing) on each chunk, then combine those partial results. This works because summing is a reduce function: it takes two parameters of the same type and produces an output of that same type.

Formulating algorithms as reduce functions makes them naturally parallelizable. Addition, multiplication, finding minimums/maximums—they all compose nicely across chunks of data.

Beyond Collections: Rayon Join

Parallel iterators handle collections elegantly, but what about computations that aren't iterating over data? That's where rayon::join comes in.

The video contrasts spawning raw OS threads (expensive, slow to create and destroy) with using Rayon's thread pool. In their example, two blocking operations (represented as sleeps) need to run in parallel:

rayon::join(
    || { /* first computation */ },
    || { /* second computation */ }
);

"The first one is a closure that will actually be immediately run in most cases, not all. In most cases, on the current thread," they explain. "The second parameter is an operation that's going to be put on the thread queue."

The second closure might get stolen by an idle worker thread. If no threads are free, it runs on the current thread. Either way, you're using an existing thread pool instead of spawning and destroying OS threads—much lower overhead.

The Binary Tree Problem (And How Parallelism Made It Worse)

Here's where things get interesting. The video tackles that infamous tweet from the Homebrew creator about being asked to invert a binary tree in a Google interview. It's a simple recursive algorithm: swap left and right children at every node.

They built a perfect binary tree with depth 23—over 8 million nodes. Serial version: 72 milliseconds. Parallel version using rayon::join on the recursive calls: 281 milliseconds. Nearly 4x slower.

"Rayon join has to queue up all of these parallel operations," the video notes. "And this tree, again, is going to have over 8 million nodes... there's a lot of overhead in just queuing those operations."

Plus the work-stealing mechanism requires atomics and mutexes so threads can access each other's queues. At 8 million operations, that coordination cost dominates.

The solution? Be selective. Only parallelize down to a certain depth, then switch to serial:

if depth < 6 {
    rayon::join(
        || invert_tree(left, depth + 1),
        || invert_tree(right, depth + 1)
    );
} else {
    invert_tree(left, depth + 1);
    invert_tree(right, depth + 1);
}

With this hybrid approach: 15 milliseconds. That's a legitimate 4.8x speedup over the serial version, achieved by not parallelizing most of the work.

Rayon vs. Tokio: Different Tools for Different Problems

The video makes an important distinction that I think gets lost in a lot of Rust discussions: Rayon and Tokio solve different problems.

"Rayon is intended for blocking synchronous CPU-bound operations, as opposed to Tokio, which is optimized for asynchronous non-blocking operations," they explain. "You don't want to put something like a hash computation in a Tokio task. That's a good use case for rayon."

They proved this by implementing the binary tree inversion with Tokio. Runtime: 188 milliseconds with the depth-6 optimization—still 2.5x slower than the single-threaded version. Tokio's task creation overhead is designed for I/O-bound work where threads spend most of their time waiting. For CPU-bound work, that overhead becomes a tax you can't afford.

The Meta-Lesson

What I find fascinating about this is how it challenges the intuition that "more parallel = more fast." In ML work, I see this constantly—people throw more GPUs at a problem when the bottleneck is actually data loading, or they parallelize training when the batch size is already saturating compute.

The pattern here generalizes: understand your costs, measure your bottlenecks, and remember that coordination has overhead. Sometimes the fastest code is the code that doesn't try to be clever.

Rayon gives you scope() for cases where you need more than two parallel tasks, btw. It takes a closure that gives you a scope where you can call spawn() as many times as you want. Handy when your parallel structure doesn't fit into the binary branching pattern of join().

But the core insight remains: parallelism is a tool that works when the benefits exceed the costs. Knowing which side of that equation you're on requires actually measuring, not assuming.

— Yuki Okonkwo

From the BuzzRAG Team

We Watch Tech YouTube So You Don't Have To

Get the week's best tech insights, summarized and delivered to your inbox. No fluff, no spam.

Weekly digestNo spamUnsubscribe anytime

More Like This

Man in ORCDEV shirt with surprised expression next to calendar, AI head icon, and text "All Your AI Coding Limits In One…

Five Open Source Dev Tools That Shouldn't Be Free

From AI usage trackers to self-hosting platforms, these open source tools solve real developer problems—and they're completely free.

Yuki Okonkwo·3 months ago·6 min read
Red text "THIS IS SHOCKING" above orange starburst icon labeled Claude Code plus white paperclip icon on black circles…

Claude Code + Paperclip: Running Companies With AI Agents

Julian Goldie shows how Claude Code and Paperclip create AI agent companies with org charts, roles, and budgets—no human employees required.

Yuki Okonkwo·2 months ago·7 min read
Diagram showing the scan algorithm with five data blocks, their sums (18, 15, 14, 19, 8), and resulting prefix sums (0, 18,…

Unlocking C++ Efficiency: Lazy Ranges & Parallelism

Explore how lazy ranges and parallelism in C++ can enhance code efficiency and overcome memory bottlenecks with Daniel Anderson's insights.

Dev Kapoor·5 months ago·3 min read
A futuristic robot with the Apple logo holds a "Hardware-First" chip and "AI-First" sphere against a sunset cityscape…

Apple's New CEO Inherits a Paradox: Did Doing Nothing Win AI?

John Ternus takes over Apple amid questions about whether the company's AI inaction was genius or fumble. Plus: Google forms a coding strike team.

Yuki Okonkwo·2 months ago·6 min read
Crying developer with R logo transforms to calm figure with GO logo, illustrating language migration from Rust to Go

Why Burned-Out Rust Devs Are Eyeing Go's Simplicity

A developer compares Rust's complexity with Go's simplicity, revealing why some programmers are reconsidering their language choices.

Zara Chen·2 months ago·5 min read
Multiple rockets launching diagonally across a space background with planets and constellations, featuring gear symbols and…

Async Rust Performance: What Most Developers Get Wrong

Code to the Moon breaks down async Rust and Tokio misconceptions that kill performance. Single-threaded concurrency vs parallelism explained.

Tyler Nakamura·2 months ago·6 min read
A bearded man with a contemplative expression holds his head while an illustrated glowing brain with flames emerges from…

AI Productivity Tools Are Making Workers Exhausted, Not Efficient

Research shows AI tools intensify workloads rather than reduce them, leading to cognitive exhaustion researchers are calling 'AI brain fry.'

Yuki Okonkwo·3 months ago·6 min read
A man in a white t-shirt with an orange pixel character holds his hands out, flanked by identical clones in different…

AI Clones Are Creating Content While You Sleep

How Claude Code and AI automation are enabling creators to generate and publish daily video content without ever being on camera. The tech, the tension.

Yuki Okonkwo·3 months ago·7 min read

RAG·vector embedding

2026-04-24
1,591 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.