When The Cloud Gets Dark – How Amazon’s Outage Affected Airbnb

About Tobi Knaup

We’re big fans of AWS at Airbnb. All of our infrastructure runs on it, so just like many other websites, we were also affected by this week’s AWS outage. However we survived it with only 30 minutes total downtime, and search being unavailable for three hours. We didn’t lose any data. Compared to what happened to other sites, that’s not too bad. I want to share what exactly happened, and some of the things we had in place that saved us from a more widespread outage.

What happened?

Amazon didn’t provide too much detail but it seems that the initial problem was that a significant part of the network in two us-east availability zones became unreachable Thursday morning at about 1am. This triggered a number of automated actions. Like many AWS services, Elastic Block Store (EBS) is mirrored across multiple nodes to provide redundancy. If some nodes go down, the service automatically re-mirrors to new instances, to make up for the lost redundancy. This is expected to happen in a distributed system and AWS is built to handle it. Obviously you need to have some spare capacity to enable this. Apparrently the problem was so widespread that there wasn’t enough extra capacity to handle all the re-mirroring. This lead to huge IO latency or completely unresponsive EBS volumes. One of the zones recovered after a couple of hours, but the other zone was still having issues on Friday.

How did it affect us?

We use RDS for all our databases and it runs on top of EC2 and EBS. Our databases became unresponsive when the event first happened, which took Airbnb down for 18 minutes starting at 12:50am Thursday morning. After that, the master db recovered and our website was back up because our app servers only use the master. Our slave db was hit harder and it remained unresponsive until Friday. We use it for analytics and background tasks. Our background tasks were failing until the morning hours which delayed outbound emails, reminders, etc. for up to 8 hours.

The next partial outage happened Friday morning at around 6am when our search daemon crashed. Search on Airbnb was unavailable until about 9am when we brought it back up. The rest of the website was working fine during this time.

The third outage happened Friday at 6pm. This time, the site was down for 11 minutes. Again, the master db became unresponsive due to EBS problems, and every query eventually timed out. In a twist of fate we deployed code during the outage, which stops app server processes running with old code, and starts fresh ones with new code. Because the old processes were stuck waiting for their queries to go through, they didn’t terminate and were still running when the new processes came under full load. We ended up with double the number of processes, which exhausted the memory on the app servers. Some of the new processes died because they were out of memory, which made things worse. 

How did we handle it?

The web frontend recovered by itself after the first downtime. To fix the background tasks, we pointed them to the master db, which was working fine. The Friday morning search downtime had the same underlying problem, so pointing the indexer to the master db until the slave was back fixed the problem. Friday afternoon we had to go in and manually kill the dead processes, then restart the site.

The main reason why things stayed pretty much under control is that we spent some extra money on redundancy of our data store. We blogged back in November about how we moved our database to RDS with a setup that provides redundancy and failover across multiple availability zones. We are very happy today that we did this migration. Problems would have been a lot worse with the old setup. While AWS services are extremely reliable, availability zone failures sometimes happen. This is nothing specific to Amazon or other cloud services. They run on servers and networks that are also found in regular datacenters, and these things can fail. It’s the responsibility of the engineers who build on these services to install proper procedures to handle failure.

While things went ok this time, we also identified some things that we could have done better. We’ll follow up with another post about these changes.

2 comments

About Tobi Knaup

Speak Your Mind

*

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

Comments

  1. NorikDavtian

    Interesting story, would love to read the follow up part :)

Trackbacks

  1. […] Charon is Airbnb’s front-facing load balancer. Previous to Charon, we used Amazon’s ELB. However, ELB did not provide us with good introspection into our production traffic. Also, the process of adding and removing instances to the ELB was clunky – we had a powerful service discovery framework, and then we had a second one just for front-facing traffic. Finally, because ELB is based on EBS, we were worried about the impact ofanother EBS outage. […]