Three Hours of Debugging a File Compressor in C

There's something almost meditative about watching someone write a file compressor from scratch in C over the course of three and a half hours. Dr. Jonas Birch's latest coding session—the third in his series on building a custom compression tool—strips away the mythology of programming to reveal something more honest: a lot of muttering, strategic guessing about memory allocation, and the kind of careful problem-solving that looks boring until you realize how difficult it actually is.

The session picks up where previous episodes left off. Birch has already parsed files into two lists—one containing repeating 32-bit blocks, another holding mostly unique ones. Now comes the mechanically interesting part: replacing those 32-bit blocks with 16-bit indices to achieve actual compression. It's the kind of optimization that seems straightforward in theory but requires thinking through a dozen interconnected edge cases.

The Architecture of Small Decisions

Birch starts by building a file header structure—a chunk of metadata that will live at the beginning of every compressed file. It needs to store the original file size, the lengths of both block lists, a three-byte "magic string" to identify the file type, and a version number. "We know the version of the protocol in case we want to improve it later on," he explains, already thinking about maintenance before the first version even works.

This is characteristic of the entire session: anticipating problems that haven't happened yet, not out of paranoia but pragmatism. When calculating maximum memory allocation, Birch reasons through worst-case scenarios. If every block turns out to be unique, the compressed file could actually be 1.5 times larger than the original, plus headers. So he allocates for that, knowing he'll shrink it later. It's defensive programming that acknowledges compression isn't guaranteed—sometimes data is just incompressible.

The memory management reveals C's peculiar trade-offs. Modern languages handle allocation automatically, which is convenient until you need precise control over where data lives and how long it persists. Birch is working directly with pointers, manually advancing through memory in 16-bit and 32-bit increments, tracking multiple offset values simultaneously. One mistake—a pointer advanced by the wrong number of bytes—and the whole structure collapses.

When Compression Gets Complicated

The core compression logic hinges on a search function that locates blocks within the two lists. This is where the algorithm's cleverness emerges. If a block appears in the "repeating" list, Birch's code doesn't just replace it with an index—it counts consecutive occurrences and packs both the index and the repeat count into a single 16-bit value. Twelve bits for the index, four bits for the repeat counter.

"By doing this we can house both the repeat counter and the index in a 16 bit thing," Birch notes, implementing what's essentially a tiny compression scheme within the larger compression scheme. It's efficient but introduces complexity. The code has to distinguish between blocks that appear once (simple replacement) and blocks that repeat (count them, pack the data, advance the pointer correctly).

This is where the video gets genuinely tense, at least if you've ever debugged anything. "I really hope this works because it will be a nightmare to debug," Birch says while writing the repeat-counting loop. He's not being dramatic—tracking down errors in pointer arithmetic often means stepping through memory states that differ from your mental model in subtle, catastrophic ways.

The function grows to "several screens long," handling edge cases: What if you're counting repeats near the end of the file and advance past it? What if a block appears in neither list? (That shouldn't happen, but the code checks anyway and returns an error.) What if the repeat counter hits 15, the maximum storable in four bits? Each scenario needs explicit handling.

The Compiler as Conversation Partner

When Birch finally compiles, the errors arrive in a flood: undeclared variables, type mismatches, logic inversions. "Probably quite a lot to fix. Hopefully nothing complicated inside of our big function," he says with the resigned optimism of someone who knows the compiler is about to teach him where he was wrong.

The debugging phase occupies over an hour of the video. This ratio—planning and writing versus fixing—is probably more representative of actual software development than any polished tutorial. Birch forgot to declare repeats, misspelled uniques as Unix, used header where he meant header->field. These aren't interesting mistakes individually, but their accumulation is the real work.

What's notable is Birch's systematic approach. He doesn't panic or start changing random things. He reads the compiler error, locates the line, reasons about what the code is actually doing versus what it should do, makes a targeted fix. This is the unglamorous skill that separates working programmers from people who can write code that works once, under ideal conditions.

By the end, the program compiles. Whether it works remains unclear—the video cuts off during initial testing. This feels appropriate somehow. The victory isn't a working compressor (yet); it's getting three-plus hours of interconnected logic to cohere enough that the compiler accepts it as valid C.

What Low-Level Programming Reveals

There's a reason most modern developers don't write compression algorithms in C from scratch. Libraries exist. Higher-level languages handle memory. The pragmatic move is to use existing tools and spend your cognitive budget elsewhere.

But watching someone work at this level reveals something about how compression actually functions—not as an abstract concept, but as pointer arithmetic and bit shifting and carefully bounded loops. The 32-to-16 bit reduction isn't magic; it's a deliberate tradeoff enabled by maintaining lookup tables. The repeat-count packing isn't clever trivia; it's recognizing that many compressed formats waste bits that could store additional information.

Birch isn't rediscovering established techniques so much as reconstructing why they exist. The decisions he makes—how to structure the header, where to allocate memory, when to check for errors—mirror choices that compression pioneers made decades ago, albeit with different constraints. Contemporary programmers work with zip files and gzip streams without considering that someone had to make all these small decisions correctly for compression to become infrastructure.

The three-hour runtime isn't padding or poor editing. It's an accurate representation of how long this work takes. Programming isn't the part where you type—that's fast. Programming is the part where you think about memory layout and edge cases and whether size_minus_one should actually be size_minus_three, then test that reasoning against a compiler that knows when you're wrong but not why.

The session ends without a fully working compressor, which is fine. Version one rarely works. But all the infrastructure is there: the header format, the memory allocation strategy, the compression logic, the file I/O. What remains is debugging—probably a lot of it. That's the work.

Marcus Chen-Ramirez is a senior technology correspondent for Buzzrag.