AWS US-East-1 Has Failed Six Times in 15 Years

October 20th, 2025. Seven in the morning. Nurses across the United States try to pull up patient records. Nothing loads. Airline gate agents can't issue boarding passes. Coinbase, Slack, Snapchat, Duolingo, Fortnite — all down. Someone asks Alexa what's happening. Alexa says nothing.

A significant portion of the modern internet had stopped. No cyberattack. No ransomware. No natural disaster. A timing collision between two software processes inside a data center in Ashburn, Virginia.

"This outage so massive, in fact, it's really hard to pinpoint an industry that wasn't affected by this."

Here's what a freeCodeCamp breakdown of every major AWS failure makes plain: this was the sixth time. The same region. Fifteen years. Six completely different causes, six completely different failure modes — same impact, same story.

The region that became unavoidable

To understand why this keeps happening, you have to understand what US-East-1 actually is. AWS divides its infrastructure into geographic regions. US-East-1 — Northern Virginia — was the first region. Every foundational service: S3, EC2, DynamoDB, Lambda. All built and tested there first. That made it the default, and defaults compound. Developers building in the early 2010s picked US-East-1 because everything was there. Once enough people were on it, it became the region you had to use. TechnologyChecker.io estimates somewhere between 30 and 50% of all internet traffic now runs through that one cluster in Virginia.

That context is what makes the next six failures so instructive — and so uncomfortable.

2011: The feedback loop nobody designed for

August 21st, 2011. AWS is four years old. The cloud is still a novelty. An engineer begins a routine network upgrade in US-East-1, intending to move traffic to higher-capacity connections. They execute the steps in reverse. Backup network gets primary traffic load. It can't handle it. Packets start dropping.

Then EBS — the virtual hard drives EC2 instances run on — detects the instability. Thousands of volumes simultaneously try to remirror themselves to protect their data. Normally fine. Normally there's spare network capacity for that. There is none. The remirroring clogs the network further, which causes more EBS volumes to think they're losing data, which causes more remirroring. A feedback loop. Reddit, Foursquare, Quora go dark. Full recovery takes four days.

The cloud was four years old and had already demonstrated something that would recur: it breaks in ways nobody predicted.

2012: The split that changed how engineers think

June 29th, 2012. A severe thunderstorm — 80 mph winds, rapid-fire lightning — hits the Ashburn data centers. Emergency generators kick in. Transfer switches, the hardware managing the transition from grid power to generator power, start malfunctioning. Power fluctuates. Netflix, Pinterest, Instagram, Heroku all go down.

Except Netflix didn't go down.

Netflix engineers saw their dashboards degrade, and their systems automatically shifted traffic away from the affected zone — because Netflix had spent two years deliberately breaking their own infrastructure in production. Chaos Monkey: software that randomly terminated live instances on purpose, specifically to force their systems to become resilient to exactly this kind of failure. They had rehearsed the disaster.

The engineers watching from the non-Netflix camp that morning had a different experience: the quiet, vertiginous realization that resilience wasn't a setting you turned on, it was an architecture you built over years — and they hadn't built it. That split, between teams that had operationalized failure and teams that were discovering they hadn't, is what turned chaos engineering from a Netflix curiosity into an industry discipline.

2017: The status dashboard that lied

February 28th, 2017. No storm, no hardware failure. A typo.

An S3 engineer debugs a billing issue and runs a command to take a small number of servers offline. They type the wrong number. Instead of a small number, they remove a large one — specifically, the index subsystem. The service that tracks the location of every object in S3. Every photo, every file, every database backup. Without the index, S3 can't find anything.

Trello goes down. Slack goes down. Thousands of apps go dark. The AWS team tries to restart the index subsystem, but S3 has grown so large since 2006 that nobody knows how long a full restart will take. They watch a crawl bar move. Slowly.

Someone tries to check the AWS status dashboard for updates. The dashboard is hosted on S3. It won't load. And because it can't update its own status, it reports everything green. Services operational. Amazon was using Amazon to check if Amazon was down.

Four hours later, S3 recovers. Amazon's postmortem quietly notes they're moving the status dashboard off S3.

The engineer presumably takes a long walk home.

2020: When nobody owns the whole map

November 25th, 2020 — the day before Thanksgiving. AWS makes a configuration change to Kinesis, a real-time data processing service. Not a glamorous service. Not the kind of thing that shows up in developer Twitter arguments. But it turns out Kinesis is quietly load-bearing inside AWS's own infrastructure in ways that weren't fully documented anywhere.

The change causes Kinesis to overconsume resources. Kinesis slows. IAM starts having problems — that's the authentication service, the one that decides who's allowed to do anything. Cognito starts failing. CloudWatch, the monitoring tool engineers use to diagnose problems, starts failing. Route 53 health checks go sideways. Auto-scaling stops working.

Engineers try to log into the AWS console to diagnose what's happening. They can't log in. IAM is down.

IAM being downstream of Kinesis wasn't just an unmapped dependency — it was an organizational failure, the kind that happens when a platform grows so large and so fast that no single team owns the full picture of what depends on what. The dependency graph becomes institutional dark matter.

Services recovered slowly through the night. On Thanksgiving morning, things worked again. At the time, almost no one knew what had happened.

2021: Amazon, taken down by Amazon

December 7th, 2021. Something breaks in AWS Lambda and other US-East-1 services. At first it's not catastrophic. Then Alexa stops responding. Ring cameras go offline. Roomba vacuums freeze mid-floor.

Then Amazon itself breaks. Amazon Flex drivers — gig workers delivering packages in their own cars — open their apps to start shifts. The apps won't load. They can't scan packages, can't start routes. Inside fulfillment centers, the handheld scanners stop connecting.

"The company that owns the cloud is being taken down by its own cloud. For a few hours, Amazon's ability to fulfill orders — which you know is its entire reason for existing — is being impacted by AWS going down."

Amazon doesn't go into significant detail about the root cause. As the freeCodeCamp analysis observes: "The silence is its own kind of answer."

2025: The DNS record that wasn't there

October 19th, 2025, 11 p.m. An automated system managing DNS records for DynamoDB — Amazon's flagship database, and the internal coordination layer for dozens of other AWS services — encounters a rare timing condition between two redundant components. They collide in a way that's apparently never happened before. The automation deletes the DNS record for the DynamoDB regional endpoint.

Not the data. Not the servers. Just the address every AWS service uses to find DynamoDB.

DynamoDB becomes unreachable. Not broken — unfindable. EC2 stores its operational data in DynamoDB. EC2's orchestration stalls. New instances can't launch. Existing instances can't be managed. The cascade spreads outward. Netflix, Slack, Coinbase, Expedia. Hospitals, airlines, banks.

Engineering teams open their AWS management consoles to diagnose the problem. The consoles are either unreachable or showing stale data.

The systems built to automatically recover were themselves dependent on DynamoDB. AWS engineers had to restore everything manually, in the right order, validating each step.

The thing that doesn't change

Six outages. Six different causes: a reversed network migration, a storm, a mistyped number, a holiday configuration change, a dependency graph nobody had fully drawn, a race condition in automation. Not one cause repeated. Every postmortem delivered. Every safeguard promised.

"Systems this complex don't fail in ways you predict. They fail in the ways you didn't think to protect against."

That's not a counsel of despair — it's a description of how systems actually work when they get large enough. The question worth sitting with isn't whether US-East-1 will fail again. It's what we've collectively decided is acceptable when a significant share of the internet runs through a single data center cluster in one state — and whether that's a decision anyone actually made, or just what happens when a default goes unchallenged long enough to become load-bearing infrastructure.

— Dev Kapoor covers open source software, developer communities, and the politics of code for Buzzrag.