AWS US-East-1 Has Failed Six Times in 15 Years
Six AWS outages, one Virginia region, fifteen years. A look at what actually broke each time—and what it reveals about how the internet is built.
Written by AI. Dev Kapoor

Photo: AI. Ines Cienfuegos
October 20th, 2025. Seven in the morning. Nurses across the United States try to pull up patient records. Nothing loads. Airline gate agents can't issue boarding passes. Coinbase, Slack, Snapchat, Duolingo, Fortnite — all down. Someone asks Alexa what's happening. Alexa says nothing.
A significant portion of the modern internet had stopped. No cyberattack. No ransomware. No natural disaster. A timing collision between two software processes inside a data center in Ashburn, Virginia.
"This outage so massive, in fact, it's really hard to pinpoint an industry that wasn't affected by this."
Here's what a freeCodeCamp breakdown of every major AWS failure makes plain: this was the sixth time. The same region. Fifteen years. Six completely different causes, six completely different failure modes — same impact, same story.
The region that became unavoidable
To understand why this keeps happening, you have to understand what US-East-1 actually is. AWS divides its infrastructure into geographic regions. US-East-1 — Northern Virginia — was the first region. Every foundational service: S3, EC2, DynamoDB, Lambda. All built and tested there first. That made it the default, and defaults compound. Developers building in the early 2010s picked US-East-1 because everything was there. Once enough people were on it, it became the region you had to use. TechnologyChecker.io estimates somewhere between 30 and 50% of all internet traffic now runs through that one cluster in Virginia.
That context is what makes the next six failures so instructive — and so uncomfortable.
2011: The feedback loop nobody designed for
August 21st, 2011. AWS is four years old. The cloud is still a novelty. An engineer begins a routine network upgrade in US-East-1, intending to move traffic to higher-capacity connections. They execute the steps in reverse. Backup network gets primary traffic load. It can't handle it. Packets start dropping.
Then EBS — the virtual hard drives EC2 instances run on — detects the instability. Thousands of volumes simultaneously try to remirror themselves to protect their data. Normally fine. Normally there's spare network capacity for that. There is none. The remirroring clogs the network further, which causes more EBS volumes to think they're losing data, which causes more remirroring. A feedback loop. Reddit, Foursquare, Quora go dark. Full recovery takes four days.
The cloud was four years old and had already demonstrated something that would recur: it breaks in ways nobody predicted.
2012: The split that changed how engineers think
June 29th, 2012. A severe thunderstorm — 80 mph winds, rapid-fire lightning — hits the Ashburn data centers. Emergency generators kick in. Transfer switches, the hardware managing the transition from grid power to generator power, start malfunctioning. Power fluctuates. Netflix, Pinterest, Instagram, Heroku all go down.
Except Netflix didn't go down.
Netflix engineers saw their dashboards degrade, and their systems automatically shifted traffic away from the affected zone — because Netflix had spent two years deliberately breaking their own infrastructure in production. Chaos Monkey: software that randomly terminated live instances on purpose, specifically to force their systems to become resilient to exactly this kind of failure. They had rehearsed the disaster.
The engineers watching from the non-Netflix camp that morning had a different experience: the quiet, vertiginous realization that resilience wasn't a setting you turned on, it was an architecture you built over years — and they hadn't built it. That split, between teams that had operationalized failure and teams that were discovering they hadn't, is what turned chaos engineering from a Netflix curiosity into an industry discipline.
2017: The status dashboard that lied
February 28th, 2017. No storm, no hardware failure. A typo.
An S3 engineer debugs a billing issue and runs a command to take a small number of servers offline. They type the wrong number. Instead of a small number, they remove a large one — specifically, the index subsystem. The service that tracks the location of every object in S3. Every photo, every file, every database backup. Without the index, S3 can't find anything.
Trello goes down. Slack goes down. Thousands of apps go dark. The AWS team tries to restart the index subsystem, but S3 has grown so large since 2006 that nobody knows how long a full restart will take. They watch a crawl bar move. Slowly.
Someone tries to check the AWS status dashboard for updates. The dashboard is hosted on S3. It won't load. And because it can't update its own status, it reports everything green. Services operational. Amazon was using Amazon to check if Amazon was down.
Four hours later, S3 recovers. Amazon's postmortem quietly notes they're moving the status dashboard off S3.
The engineer presumably takes a long walk home.
2020: When nobody owns the whole map
November 25th, 2020 — the day before Thanksgiving. AWS makes a configuration change to Kinesis, a real-time data processing service. Not a glamorous service. Not the kind of thing that shows up in developer Twitter arguments. But it turns out Kinesis is quietly load-bearing inside AWS's own infrastructure in ways that weren't fully documented anywhere.
The change causes Kinesis to overconsume resources. Kinesis slows. IAM starts having problems — that's the authentication service, the one that decides who's allowed to do anything. Cognito starts failing. CloudWatch, the monitoring tool engineers use to diagnose problems, starts failing. Route 53 health checks go sideways. Auto-scaling stops working.
Engineers try to log into the AWS console to diagnose what's happening. They can't log in. IAM is down.
IAM being downstream of Kinesis wasn't just an unmapped dependency — it was an organizational failure, the kind that happens when a platform grows so large and so fast that no single team owns the full picture of what depends on what. The dependency graph becomes institutional dark matter.
Services recovered slowly through the night. On Thanksgiving morning, things worked again. At the time, almost no one knew what had happened.
2021: Amazon, taken down by Amazon
December 7th, 2021. Something breaks in AWS Lambda and other US-East-1 services. At first it's not catastrophic. Then Alexa stops responding. Ring cameras go offline. Roomba vacuums freeze mid-floor.
Then Amazon itself breaks. Amazon Flex drivers — gig workers delivering packages in their own cars — open their apps to start shifts. The apps won't load. They can't scan packages, can't start routes. Inside fulfillment centers, the handheld scanners stop connecting.
"The company that owns the cloud is being taken down by its own cloud. For a few hours, Amazon's ability to fulfill orders — which you know is its entire reason for existing — is being impacted by AWS going down."
Amazon doesn't go into significant detail about the root cause. As the freeCodeCamp analysis observes: "The silence is its own kind of answer."
2025: The DNS record that wasn't there
October 19th, 2025, 11 p.m. An automated system managing DNS records for DynamoDB — Amazon's flagship database, and the internal coordination layer for dozens of other AWS services — encounters a rare timing condition between two redundant components. They collide in a way that's apparently never happened before. The automation deletes the DNS record for the DynamoDB regional endpoint.
Not the data. Not the servers. Just the address every AWS service uses to find DynamoDB.
DynamoDB becomes unreachable. Not broken — unfindable. EC2 stores its operational data in DynamoDB. EC2's orchestration stalls. New instances can't launch. Existing instances can't be managed. The cascade spreads outward. Netflix, Slack, Coinbase, Expedia. Hospitals, airlines, banks.
Engineering teams open their AWS management consoles to diagnose the problem. The consoles are either unreachable or showing stale data.
The systems built to automatically recover were themselves dependent on DynamoDB. AWS engineers had to restore everything manually, in the right order, validating each step.
The thing that doesn't change
Six outages. Six different causes: a reversed network migration, a storm, a mistyped number, a holiday configuration change, a dependency graph nobody had fully drawn, a race condition in automation. Not one cause repeated. Every postmortem delivered. Every safeguard promised.
"Systems this complex don't fail in ways you predict. They fail in the ways you didn't think to protect against."
That's not a counsel of despair — it's a description of how systems actually work when they get large enough. The question worth sitting with isn't whether US-East-1 will fail again. It's what we've collectively decided is acceptable when a significant share of the internet runs through a single data center cluster in one state — and whether that's a decision anyone actually made, or just what happens when a default goes unchallenged long enough to become load-bearing infrastructure.
— Dev Kapoor covers open source software, developer communities, and the politics of code for Buzzrag.
We Watch Tech YouTube So You Don't Have To
Get the week's best tech insights, summarized and delivered to your inbox. No fluff, no spam.
More Like This
Command Line Basics: A Free Course for Beginners
freeCodeCamp and Scrimba released a free 45-minute command line course for beginners. Here's what it teaches, how it teaches it, and who it's actually for.
Benchmarking Embedding Models: Open Source vs Proprietary
Explore embedding models and their role in data processing, focusing on open-source vs proprietary options.
JavaScript Date Handling: From Broken Basics to Temporal
A deep dive into JavaScript's notoriously broken Date object, the underrated Intl API, and why TC39's Temporal proposal took nearly a decade to arrive.
AWS Identity Center's Multi-Region Replication Feature
AWS just launched multi-region replication for Identity Center. Here's what it means for failover, latency, and the KMS key policy minefield you'll need to navigate.
This 6-Hour Kubernetes Course Explains Why Everything Evolved
FreeCodeCamp's new Kubernetes course takes you from data centers to containers, explaining not just how tech works but why we needed it in the first place.
What the CKA Exam Actually Tests (And Why It Matters)
The Certified Kubernetes Administrator exam tests hands-on skills under pressure. A new course reveals what the test really measures—and what it doesn't.
AI Agents That Optimize Themselves While You Sleep
Kevin Guo's AutoAgent extends Karpathy's auto-research loop to let AI agents rewrite their own operational code overnight. What happens when agents program agents?
Google's Gemma 4 Brings Powerful AI to Consumer Hardware
Google released Gemma 4 under Apache 2.0 license. The open model runs on standard GPUs, challenging the assumption you need enterprise hardware for capable AI.
RAG·vector embedding
2026-07-04This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.