When Software 'Works' But You Can't Trust It

There's a question that haunts every software engineer at 3 AM: Is it actually fixed, or does it just look fixed?

Dave Plummer, who spent years building Windows NT at Microsoft, tackled this existential crisis in a recent episode of his Shop Talk series. The discussion, driven entirely by viewer questions, orbits around a deceptively simple problem: how do you know when something is genuinely working versus just appearing to work?

"It's really an expression of how many things has that system seen," Plummer explains. "Has it been through every low memory situation? Has it been through every low resource situation you could think of? Has it been at peak load? Has it been at peak loads on February 29th?"

The question arrived from viewer Paul Stubs, refined by Jacob W into its sharpest form: What's the difference between something that's genuinely correct and something that just hasn't failed yet?

The Two Kinds of Broken

Plummer breaks the problem into two categories. The first is straightforward—something that used to work suddenly breaks. You introduce a bug, or existing code encounters a case it never handled properly. You debug it, identify the culprit, fix it, run the same data through again. If the problem disappears, you've got reasonable confidence.

"Doesn't mean there aren't other bugs," he notes. This is the important part that gets forgotten in the relief of seeing green test results.

The second category is scarier: building a complex system from scratch and wondering whether it will survive contact with reality. An operating system has tens of thousands of potential failure points. Will it work when memory runs low? When disk space vanishes? During a storm of concurrent requests? On a leap day?

Plummer reaches for an analogy that reveals more than he probably intends. He describes boarding a helicopter in Hawaii twenty years ago, noting his relief that all the pilots were "grizzled Vietnam veterans."

"They've seen some things," he says. "They've been through a whole bunch of different scenarios. They've been tested against any number of things that they didn't see coming, and then they were able to somehow survive that and then be there today."

The parallel to software is exact. Trust comes from survival. A system that's been through weird edge cases and emerged intact earns confidence that a pristine, untested system never can.

Testing at Scale

At Microsoft in the Windows NT era, testing happened in layers. The lab maintained machines from every major manufacturer—Dell, Gateway, Northgate. Every night, each machine would load the newest build and run stress tests until morning. Process tests, thread injection tests, graphics tests, handle tests. Anything to push the operating system into low-resource states.

Developers ran their own tests too. Plummer describes his contribution to chaos engineering: "I would take the tape dispenser and set it on my keyboard for the weekend before I left." The sustained key press would eventually trigger some unforeseen input path and expose a bug.

"The weirder things you can do to a system," he says, "the better it is in terms of finding and exposing the things that are the weaknesses in it."

This philosophy addresses a tension that emerged in viewer questions. One asked about AI models playing video games—how can something fundamentally probabilistic exhibit reliable behavior? Another asked how to distinguish between a bug and behavior that merely looks wrong.

Plummer's answer to the second question cuts through the false distinction: "Sometimes just being different than the spec or being different than expectations is in itself a bug." The de facto standard is what users expect. When software violates those expectations, it fails regardless of whether the code is technically correct.

The Nightmare Scenario

The worst failure isn't dramatic collapse. It's silent corruption paired with false confirmation.

Plummer describes a backup system that ran nightly, sending him email confirmations of successful completion. One day he needed the backup. It didn't exist. The script had been failing for weeks, possibly months, but a bug in the notification code sent the same success message whether the backup completed or failed.

"Worse than just failing or even just failing silently," he says. "It failed silently and told me affirmatively that it had succeeded."

This scenario illustrates why rigorous testing matters more than coverage percentages suggest. You're not just checking whether code executes without crashing. You're validating that success means success and failure means failure—and that the system knows the difference.

Trust and Verification

A viewer asked about Windows 11 Recall: if you remove it, how do you know it's actually gone?

Plummer's response bypasses the technical question entirely: "I don't think you can in good conscience run a system you don't trust and delete just the parts that you don't trust. You either trust the company and the software or you don't."

This is the engineer's version of a philosophical position. At some level, you're dependent on someone else's competence and intentions. You can verify, you can test, you can stress the system in creative ways. But complete certainty remains elusive.

For his own projects—AI systems that play arcade games like Robotron—Plummer runs suites of fifteen or so unit tests. Each test validates a specific behavior: what should the AI do if an enemy appears above, below, to the side? The tests aren't exhaustive. Edge cases slip through. But they catch regressions when unrelated changes unexpectedly break core functionality.

"You got to run the unit test for every checkin," he emphasizes. "Your ability to roll back and undo the mess you made is limited checkin to checkin."

The lesson applies beyond software. How do you know anything complex actually works? You test it. You expose it to conditions it might encounter. You watch for the gap between appearance and reality. And you remember that confidence accumulates through survival, not certification.

Bob Reynolds has covered technology for five decades. He remembers when bugs meant actual insects in the hardware.