When Software 'Works' But You Can't Trust It
A veteran Microsoft engineer explains the difference between software that appears to work and software that actually works—and why that gap matters.
Written by AI. Bob Reynolds
March 29, 2026

Photo: Dave's Attic / YouTube
There's a question that haunts every software engineer at 3 AM: Is it actually fixed, or does it just look fixed?
Dave Plummer, who spent years building Windows NT at Microsoft, tackled this existential crisis in a recent episode of his Shop Talk series. The discussion, driven entirely by viewer questions, orbits around a deceptively simple problem: how do you know when something is genuinely working versus just appearing to work?
"It's really an expression of how many things has that system seen," Plummer explains. "Has it been through every low memory situation? Has it been through every low resource situation you could think of? Has it been at peak load? Has it been at peak loads on February 29th?"
The question arrived from viewer Paul Stubs, refined by Jacob W into its sharpest form: What's the difference between something that's genuinely correct and something that just hasn't failed yet?
The Two Kinds of Broken
Plummer breaks the problem into two categories. The first is straightforward—something that used to work suddenly breaks. You introduce a bug, or existing code encounters a case it never handled properly. You debug it, identify the culprit, fix it, run the same data through again. If the problem disappears, you've got reasonable confidence.
"Doesn't mean there aren't other bugs," he notes. This is the important part that gets forgotten in the relief of seeing green test results.
The second category is scarier: building a complex system from scratch and wondering whether it will survive contact with reality. An operating system has tens of thousands of potential failure points. Will it work when memory runs low? When disk space vanishes? During a storm of concurrent requests? On a leap day?
Plummer reaches for an analogy that reveals more than he probably intends. He describes boarding a helicopter in Hawaii twenty years ago, noting his relief that all the pilots were "grizzled Vietnam veterans."
"They've seen some things," he says. "They've been through a whole bunch of different scenarios. They've been tested against any number of things that they didn't see coming, and then they were able to somehow survive that and then be there today."
The parallel to software is exact. Trust comes from survival. A system that's been through weird edge cases and emerged intact earns confidence that a pristine, untested system never can.
Testing at Scale
At Microsoft in the Windows NT era, testing happened in layers. The lab maintained machines from every major manufacturer—Dell, Gateway, Northgate. Every night, each machine would load the newest build and run stress tests until morning. Process tests, thread injection tests, graphics tests, handle tests. Anything to push the operating system into low-resource states.
Developers ran their own tests too. Plummer describes his contribution to chaos engineering: "I would take the tape dispenser and set it on my keyboard for the weekend before I left." The sustained key press would eventually trigger some unforeseen input path and expose a bug.
"The weirder things you can do to a system," he says, "the better it is in terms of finding and exposing the things that are the weaknesses in it."
This philosophy addresses a tension that emerged in viewer questions. One asked about AI models playing video games—how can something fundamentally probabilistic exhibit reliable behavior? Another asked how to distinguish between a bug and behavior that merely looks wrong.
Plummer's answer to the second question cuts through the false distinction: "Sometimes just being different than the spec or being different than expectations is in itself a bug." The de facto standard is what users expect. When software violates those expectations, it fails regardless of whether the code is technically correct.
The Nightmare Scenario
The worst failure isn't dramatic collapse. It's silent corruption paired with false confirmation.
Plummer describes a backup system that ran nightly, sending him email confirmations of successful completion. One day he needed the backup. It didn't exist. The script had been failing for weeks, possibly months, but a bug in the notification code sent the same success message whether the backup completed or failed.
"Worse than just failing or even just failing silently," he says. "It failed silently and told me affirmatively that it had succeeded."
This scenario illustrates why rigorous testing matters more than coverage percentages suggest. You're not just checking whether code executes without crashing. You're validating that success means success and failure means failure—and that the system knows the difference.
Trust and Verification
A viewer asked about Windows 11 Recall: if you remove it, how do you know it's actually gone?
Plummer's response bypasses the technical question entirely: "I don't think you can in good conscience run a system you don't trust and delete just the parts that you don't trust. You either trust the company and the software or you don't."
This is the engineer's version of a philosophical position. At some level, you're dependent on someone else's competence and intentions. You can verify, you can test, you can stress the system in creative ways. But complete certainty remains elusive.
For his own projects—AI systems that play arcade games like Robotron—Plummer runs suites of fifteen or so unit tests. Each test validates a specific behavior: what should the AI do if an enemy appears above, below, to the side? The tests aren't exhaustive. Edge cases slip through. But they catch regressions when unrelated changes unexpectedly break core functionality.
"You got to run the unit test for every checkin," he emphasizes. "Your ability to roll back and undo the mess you made is limited checkin to checkin."
The lesson applies beyond software. How do you know anything complex actually works? You test it. You expose it to conditions it might encounter. You watch for the gap between appearance and reality. And you remember that confidence accumulates through survival, not certification.
Bob Reynolds has covered technology for five decades. He remembers when bugs meant actual insects in the hardware.
Watch the Original Video
Debugging the “Almost Working” Problem
Dave's Attic
42m 42sAbout This Source
Dave's Attic
Dave's Attic, a complementary channel to the acclaimed 'Dave's Garage', has been an active part of the tech YouTube landscape since October 2025. With a subscriber base of over 52,300, Dave's Attic delves into the intricacies of AI and software development, providing valuable insights and discussions that attract tech enthusiasts drawn to cutting-edge developments and industry nuances.
Read full source profileMore Like This
The Security Hole We Keep Ignoring: Third-Party Scripts
After 50 years covering tech, I've seen this pattern before: developers linking to code they don't control, creating vulnerabilities that shouldn't exist.
She Built a Manga App in 24 Hours Without Writing Code
Ex-Meta data scientist Tina Huang created a functional manga generation app in under 24 hours using AI tools—no coding required. Here's what worked.
The Engineer Who Stopped Writing Code
Boris Cherny created Claude Code at Anthropic. Now he doesn't write any code himself. A year into AI-assisted development, what have we learned?
Building a Virtual Machine From Scratch Takes Six Hours
A programmer documents building a 16-bit virtual machine in C with custom assembly language, revealing the actual complexity of low-level systems work.