Airbnb is a rapidly growing marketplace and our infrastructure needs are evolving. Like many other websites, we were using MySQL as our main datastore, but we recently migrated it to RDS. RDS (Relational Database Service) is a web service by Amazon that lets you run a scalable MySQL setup in the cloud while hiding the administrative overhead that usually comes with it. This blog post is about the challenges we faced with MySQL, how RDS solves them, and how we migrated to RDS from our old setup.

The challenges

Before we switched to RDS, Airbnb was running on a single MySQL instance on top of Amazon EC2. We have dozens of background tasks that handle things like payments and analytics. Many of the queries that are run by these tasks are expensive SELECTs, and we were starting to notice an impact on frontend performance. MySQL has an easy solution for read-intensive applications (as most websites are): just add a read-only slave server that uses asynchronous replication and divide read queries between the servers… Except setting up replication takes some care and on-going attention. Our engineering team is already maxed out, so adding that much complexity didn’t seem like a good idea.

While EC2 is generally very reliable, there is always a chance of an instance failure. The recovery process we had in place was completely manual. We would have had to launch a new instance and restart the site after we changed the host name. Our data backup solution for the old setup was to take a snapshot of the underlying EBS volume every couple of hours. In other words, the site would have been offline for a couple hours and there was a possibility of data loss.

As a website grows, the database eventually becomes a bottleneck. Scaling horizontally (sharding) is always a delicate task, and scaling vertically with our old setup would have involved a significant amount of planning and manual work. In order to have consistent data, we would have had to take the site offline, take a snapshot, then launch a bigger instance with that snapshot.

How RDS solved these challenges for us

Amazon RDS supports asynchronous master-slave replication, and it can be launched with the click of a button or an API call. The time-intensive administration tasks are all handled by RDS, and our engineers can spend more time developing features.

RDS supports Multi-AZ deployments (multiple availability zones), where it transparently keeps a hot-standby master in a different AZ that is synchronously updated. Unlike asynchronous replication, there is no replication lag, meaning that the hot standby always has the same data as the live master. In case of an instance failure, network outage, or even unavailability of the whole AZ of the master, the hot standby is automatically promoted to be the new master. The fail-over process is completely automated: AWS changes the CNAME record of the master under the hood. 

A Multi-AZ setup gives you a good amount of redundancy, but there is always a worst-case scenario. What happens if the hot standby fails too? For those rare cases (or when you don’t want to spend the extra money for Multi-AZ), RDS provides point-in-time recovery, which lets you boot a new database instance using a consistent snapshot of your data at any time within your data retention period. Yes, any second you want, and up to about five minutes before a failure! The data retention period is one day by default but can be extended by up to 8 days. In addition to that, RDS allows you to pull a consistent snapshot of your data at any time, much like an EBS snapshot. This is great for archiving.

RDS provides a number of different instance types, similar to EC2. You can start with a small, 1.7GB 1-core server, and scale up to a massive quadruple extra large 68GB 8-core. This range should cover our needs for quite some time. This operation inevitably causes some downtime, but since the whole process is automated, it is kept to a minimum. And again, there is an API for that. 

How to switch

Everything is straightforward if you are setting up a site for the first time, but what if you have an existing database with gigabytes of data that you want to migrate to RDS? This is where the fun begins. At the time we did the migration, we already had gigabytes of data and some tables with millions of records. RDS instances only expose MySQL and don’t give you direct access to the machine via SSH or other tools, so unfortunately there is no way to do this on the file system level. The only practical way to move your data over is to do a mysqldump on the source machine, followed by a mysqlimport in RDS. This can take a significant amount of time if you have a lot of data, and unless you want to spend days or weeks figuring out a complex migration strategy, the only option is to take down the site during the process. We wanted to keep the downtime as low as possible, so we looked for simple ways to transfer a significant amount of our data while the site was still running. We came up with the idea to copy the data in tables that never receive UPDATEs or DELETEs, so we would only have to copy the diff later on. Once those tables were carefully selected, the procedure was pretty straightforward. The actual script that we used to do our migration is available at https://gist.github.com/671874

This approach allowed us to perform the whole migration with only 15 minutes of downtime. The new setup eliminates the effects of long-running queries on our frontend performance, and generally seems to perform significantly better than off-the-shelf MySQL on EC2. Even more important than that, we are now well prepared for our future growth.