Branko (@brankopetric00): "Breakdown of AWS outage in simple words 1. Sunday night, a DNS problem hit AWS - DynamoDB endpoint lost 2. This meant services couldn't find DynamoDB (a database that stores tons of data). 3. AWS fixed the DNS issue in about 3 hours. 4. But then EC2 (the system that creates virtual servers) broke because it needs DynamoDB to work. 5. Then the system that checks if network load balancers are healthy also failed. 6. This crashed Lambda, CloudWatch, SQS, and 75+ other services - everything that needed network connectivity. 7. This created a chain reaction - servers couldn't talk to each other, new servers couldn't start, everything got stuck 8. AWS had to intentionally slow down EC2 launches and Lambda functions to prevent total collapse. 9. Recovery took 15+ hours as they fixed each broken service while clearing massive backlogs of stuck requests. This outage impacted: Snapchat, Roblox, Fortnite, McDonald's app, Ring doorbells, banks, and 1,000+ more websites. This all happened in one AWS region (us-east-1). This is why multi-region architecture isn't optional anymore." | ab4n

Branko

@brankopetric00

Oct 21

Breakdown of AWS outage in simple words 1. Sunday night, a DNS problem hit AWS - DynamoDB endpoint lost 2. This meant services couldn't find DynamoDB (a database that stores tons of data). 3. AWS fixed the DNS issue in about 3 hours. 4. But then EC2 (the system that creates virtual servers) broke because it needs DynamoDB to work. 5. Then the system that checks if network load balancers are healthy also failed. 6. This crashed Lambda, CloudWatch, SQS, and 75+ other services - everything that needed network connectivity. 7. This created a chain reaction - servers couldn't talk to each other, new servers couldn't start, everything got stuck 8. AWS had to intentionally slow down EC2 launches and Lambda functions to prevent total collapse. 9. Recovery took 15+ hours as they fixed each broken service while clearing massive backlogs of stuck requests. This outage impacted: Snapchat, Roblox, Fortnite, McDonald's app, Ring doorbells, banks, and 1,000+ more websites. This all happened in one AWS region (us-east-1). This is why multi-region architecture isn't optional anymore.

Oct 21, 2025 · 12:10 AM UTC

2,366

GrapeVine Social

@grapevinesoc

Oct 21

Replying to @brankopetric00

Multi region adds so much complexity. If your app is not mission critical, I’m not sure it’s worth the trouble. Multi AZ, yes of course. But it’s very rare an entire region to go down. Tradeoffs, tradeoffs.

110

Branko

@brankopetric00

Oct 21

Correct. 🤌

15

J @useadiffsnare

Oct 21

Replying to @brankopetric00

Why couldn’t they migrate resources from us-east-1, to us-west-1 during the outage? I thought high availability in the cloud was a fail safe to prevent single points of failure.

22

Branko

@brankopetric00

Oct 21

Their services depended on services in us-east-1, they had to fix the issues.

12

Your Local Deplorable @Yes_Deplorable

Oct 22

Replying to @brankopetric00

Look for AWS to start courting all the talent they laid off/fired/pissed off over the last five years. Automation isn't foolproof. The environment it works in gets a vote, and very few companies that work at this scale hire Chaos Monkeys.

13

Branko

@brankopetric00

Oct 22

Well said.

𝒫𝑒𝓇 𝒜𝓇𝓃𝑒𝓃𝑔 【🐧λ🦀⎈】

@per_arneng

Oct 21

Replying to @brankopetric00

I wonder what the total cost was if you sum up all the customers outages. Also companies uptime level would go down quite a bit.

6

Branko

@brankopetric00

Oct 21

Can't imagine...

1

Alvin De Cruz

@decruz

Oct 22

Replying to @brankopetric00

Multi-region adds a shitload of complexity unless you reaaaallly need it. Even if you dove into the root cause, it wasn’t the lack of multi-region.

8

Branko

@brankopetric00

Oct 22

For AWS, it wasn't. For others depending on their us-east-1 region, it was. I agree that the multi-region is super complex and involves not only infrastructure teams. 👍

7

ravindran v @vijayscsa

Oct 21

Replying to @brankopetric00

Curious doubt is AWS always mention their services are multi region hosted. Still wondering how one entire region down creates all services going off..

4

Branko

@brankopetric00

Oct 21

This impacted only services in us-east-1, other regions (services) worked fine

2

Rising riser @budabakk

Oct 21

Replying to @brankopetric00

Does any one have any idea on what exactly that DNS issue is?

4

Branko

@brankopetric00

Oct 21

No specific details shared so far.

2

bckndon @bckndoff

Oct 21

Replying to @brankopetric00

What dns problem?

1

Branko

@brankopetric00

Oct 21

DynamoDB endpoint was not reachable. So the internal systems couldn't resolve the hostname to IP.

8

Zak

@zakdandachi

Oct 21

Replying to @brankopetric00

Multi-regions wouldn't mitigate the issue unless you treat them as completely isolated zones from the front-end level, not from the backend level. The redundancy has to start with an entry API call to the database in two different pipelines and replicate between both.

30

DJ Goosen

@dj_goosen

Oct 21

Replying to @brankopetric00

And following yesterday's AWS us-east-1 outage... today's Google trends implies strongly: uptime is often literally only on somebody's mind when they don't have it anymore 🫣

4

JC

@phylax4christos

Oct 21

Replying to @brankopetric00

So hilarious. Millions of dollars of fortune 500 services hinging on a cringe cheapoDB

6

Rusty Shackleford @BlackPill_1

Oct 21

Replying to @brankopetric00

That’s always been the argument for multi-region. It’s on every orgs nice to have list, it’s the management push back on cost associated to stops it from being implemented

4

Quentin Quaadgras

@splizard

Oct 21

Replying to @brankopetric00

(one simple word: DNS)

3

San$ | n8n Automation

@Sanaullahkx

Oct 21

Replying to @brankopetric00

One would think that with the amount of revenue these companies make, they would have setup multi-region architecture. Were they saving on costs by hosting only on one region?

3

Sanj

@Mindzatwork

Oct 21

Replying to @brankopetric00

@brankopetric00 Spot-on breakdown of the AWS us-east-1 chaos- DNS flop snowballed to 75+ services down for 15hrs. Lesson from subsea fibre: maybe "self-healing rings" with multi-region paths & auto-failover. Turn single-point fails into seamless reroutes!

2

Deep Tech Operator

@InfraScaler

Oct 21

Replying to @brankopetric00

Metastable failures! cloudnetworking.pro/when-eff…

Understanding Metastable Failures in Cloud Networks

Introduction Distributed systems rarely fail cleanly. Most engineers are trained to look for broken components, bad deployments, or hardware outages. Yet, some of the worst incidents in modern cloud...

cloudnetworking.pro

2

DeSerrate

@DeSerrateGame

Oct 22

Replying to @brankopetric00

Microservices were a huge mistake. The Cloud is an even bigger one!

2

☭ The Bitland Prince ☭ @TheBitlandPrinc

Oct 21

Replying to @brankopetric00

AWS knows since at least 3-4 years that it has a strong dependence on that region. Two years ago someone restarted a DB in Virginia region and everything crashed including their own internal tools. I'm surprised that they weren't able to de-couple.

2

Diy @DiyarGn

Oct 22

Replying to @brankopetric00

Average dev will hype about multi-region and CDN for a static portfolio while big tech doesn't even bother 🫠

2

AIAgentsOps

@AIAgentsOps

Nov 2

Replying to @brankopetric00

Exactly. DR and multi region failover aren’t optional once you hit production scale. You can automate deployments all day but if your critical systems (databases, APIs, auth) all live in one region, you are one DNS glitch away from downtime. True resilience means distributing workloads, syncing data across regions, and testing your failover plan before an outage hits.

1

Hagbard Celine

@HgbrdCeline

Oct 21

Replying to @brankopetric00

Auto-erotically running your control plane on your data plane seems like a contributor too.

1

AWS intern @AWS_intern

Oct 22

Replying to @brankopetric00

fake news

1

b4thetrade @b4thetrade

Oct 22

Replying to @brankopetric00

And yet Amazon stock was up

1

Jessica @jess_571

Oct 21

Replying to @brankopetric00

Even multi region won't always help: "Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues."

1

Abbott Tobey Bandagin. @LurkingTro11

Oct 21

Replying to @brankopetric00

How does a DNS “problem” hit anything? Doesn’t Azon have its own DNS? A smooth running DNS server doesn’t go sideways on its own. Something (or someone) “HIT” the DNS.

1

Anthony Peyton @arpeyton

Oct 22

Replying to @brankopetric00

"This is why multi-region architecture isn't optional anymore." I'd contend it never was ...

1

Pawan Dalal 🇮🇳 @_pd5

Oct 22

Replying to @brankopetric00

So now AWS will earn more with multi AZ deployments.

1

CalSailor @Na11Ariel

Oct 22

Replying to @brankopetric00

seems like a very fragile infrastructure. billions lost

1

GeoPixels

@GeoPixels

Oct 3

Come join us making PixelArt. Almost anything is allowed. Normal art greatly appreciated but we won't judge you!

Start Painting

223