When The Cloud Gets Dark – How Amazon’s Outage Affected Airbnb
by April 24, 2011
We’re big fans of AWS at Airbnb. All of our infrastructure runs on it, so just like many other websites, we were also affected by this week’s AWS outage. However we survived it with only 30 minutes total downtime, and search being unavailable for three hours. We didn’t lose any data. Compared to what happened to other sites, that’s not too bad. I want to share what exactly happened, and some of the things we had in place that saved us from a more widespread outage.
Amazon didn’t provide too much detail but it seems that the initial problem was that a significant part of the network in two us-east availability zones became unreachable Thursday morning at about 1am. This triggered a number of automated actions. Like many AWS services, Elastic Block Store (EBS) is mirrored across multiple nodes to provide redundancy. If some nodes go down, the service automatically re-mirrors to new instances, to make up for the lost redundancy. This is expected to happen in a distributed system and AWS is built to handle it. Obviously you need to have some spare capacity to enable this. Apparrently the problem was so widespread that there wasn’t enough extra capacity to handle all the re-mirroring. This lead to huge IO latency or completely unresponsive EBS volumes. One of the zones recovered after a couple of hours, but the other zone was still having issues on Friday.
How did it affect us?
We use RDS for all our databases and it runs on top of EC2 and EBS. Our databases became unresponsive when the event first happened, which took Airbnb down for 18 minutes starting at 12:50am Thursday morning. After that, the master db recovered and our website was back up because our app servers only use the master. Our slave db was hit harder and it remained unresponsive until Friday. We use it for analytics and background tasks. Our background tasks were failing until the morning hours which delayed outbound emails, reminders, etc. for up to 8 hours.
The next partial outage happened Friday morning at around 6am when our search daemon crashed. Search on Airbnb was unavailable until about 9am when we brought it back up. The rest of the website was working fine during this time.
The third outage happened Friday at 6pm. This time, the site was down for 11 minutes. Again, the master db became unresponsive due to EBS problems, and every query eventually timed out. In a twist of fate we deployed code during the outage, which stops app server processes running with old code, and starts fresh ones with new code. Because the old processes were stuck waiting for their queries to go through, they didn’t terminate and were still running when the new processes came under full load. We ended up with double the number of processes, which exhausted the memory on the app servers. Some of the new processes died because they were out of memory, which made things worse.
How did we handle it?
The web frontend recovered by itself after the first downtime. To fix the background tasks, we pointed them to the master db, which was working fine. The Friday morning search downtime had the same underlying problem, so pointing the indexer to the master db until the slave was back fixed the problem. Friday afternoon we had to go in and manually kill the dead processes, then restart the site.
The main reason why things stayed pretty much under control is that we spent some extra money on redundancy of our data store. We blogged back in November about how we moved our database to RDS with a setup that provides redundancy and failover across multiple availability zones. We are very happy today that we did this migration. Problems would have been a lot worse with the old setup. While AWS services are extremely reliable, availability zone failures sometimes happen. This is nothing specific to Amazon or other cloud services. They run on servers and networks that are also found in regular datacenters, and these things can fail. It’s the responsibility of the engineers who build on these services to install proper procedures to handle failure.
While things went ok this time, we also identified some things that we could have done better. We’ll follow up with another post about these changes.