The Challenges of Availability Zone Outages

Many of us were bitten by the effects of the recent AWS availability zone outage in US East.

The outage took down parts of popular services such as Slack and Hulu. The effects were felt worldwide: during the outage, we were unable to create new Slack chats between individuals, for instance.

It created headaches for a lot of people: CTOs, cloud and software architects, and DevOps engineers that had to scramble to keep their services up and running.

I wrote an article about this earlier today, and by shedding some light on what the problem is and how to deal with it, I hope I can help solve that headache for you.

Automation is key, and by being aware of which services are actually confined to a single zone, rather than a full region, you can make good choices.

Such choices include:

:arrow_right: Ensuring that you configure databases to work across availability zones.

:arrow_right: Setting up autoscaling groups to scale your stateless components across multiple availability zones.

:arrow_right: Incrementally snapshotting EBS volumes (which are tied to a zone) to S3 (which is a regional service) regularly, and automatically restoring from those.

There is automation in both AWS itself and Kubernetes to help set this up for you, including the EBS Container Storage Interface provider by AWS. It can restore Persistent Volumes for you, based on EBS snapshots.

And the most important message? Don’t worry, you don’t have to do this alone. :hugs:

The above summarizes the article, but if you want to read more, the rest of the article is here.

Please let me know what you think here! In particular, I would love to hear how you dealt with this recent outage, and what automation features (in Kubernetes or otherwise) you used to do so.