On October 19 and 20, 2025, one of the world’s most reliable cloud infrastructures—Amazon Web Services (AWS)—faced a major disruption in its Northern Virginia (us-east-1) region. This incident, which started with Amazon DynamoDB, quickly cascaded across many critical AWS services, fundamentally challenging common assumptions about system resilience.
In this video, based on analysis from The Architect’s Notebook, we break down what happened, the core cause, how AWS recovered, and the crucial lessons we can learn as engineers and architects.
📉 What Went Wrong?
The incident began around 11:48 PM PDT on October 19 when Amazon DynamoDB—AWS’s fully managed NoSQL database—started returning API errors in us-east-1.
The root cause was traced to a subtle race condition deep within DynamoDB’s automated DNS management system. A conflict between the three independent DNS Enactors resulted in the deletion of the active DNS records for the dynamodb.us-east-1.amazonaws.com regional endpoint. This meant no one, including AWS's own internal systems, could find where DynamoDB lived.
💥 The Massive Impact
Because DynamoDB is a core dependency for dozens of AWS services (including EC2 instance management, Lambda, Redshift, and IAM), this single failure rippled through the entire ecosystem:
• EC2 (Virtual Machines): Couldn’t renew “leases,” leading to “insufficient capacity” errors for new launches, even after DynamoDB recovered.
• Lambda Functions: Failed to invoke or scale properly.
• Network Load Balancer (NLB): Started failing health checks and removing healthy nodes, causing random connection failures.
• Other Services: ECS, EKS, Fargate, Amazon Connect, and Redshift operations were also affected.
• It took AWS engineers nearly 15 hours to fully restore all services.
🔑 Key System Design Lessons Learned
This complex event highlighted several timeless engineering truths:
1. Redundancy Isn’t the Same as Immunity: Having multiple independent DNS Enactors didn't prevent failure because they weren’t isolated from making the same mistake. True resilience requires independent failure domains.
2. Dependencies Multiply the Blast Radius: DynamoDB's failure quickly spread because so many core services depend on it indirectly.
3. Recovery Can Be Harder Than Failure: Even once the DNS record was manually restored, AWS spent hours bringing dependent services like EC2 and NLB back online due to backlog congestion. Recovery workflows must be designed to handle large-scale restarts gracefully.
4. Automation Needs Guardrails: Tiny bugs in automated logic can bring an entire region down if safety checks and rollback paths are overlooked.
AWS has responded by disabling the DynamoDB DNS automation globally, adding protections against race conditions, and improving EC2 recovery test suites.
Even the best-engineered systems fail; what matters most is how quickly you detect, communicate, and recover.
--------------------------------------------------------------------------------
If you found this breakdown useful, consider subscribing for more real-world system architecture analyses, design failures, and lessons from large-scale engineering.
This content is based on "Ep #53: The Amazon DynamoDB Outage (Oct 19–20, 2025)" by Amit Raghuvanshi | The Architect’s Notebook.
https://open.substack.com/pub/thearch...
Информация по комментариям в разработке