by Henry Cai
At Airbnb, we look into all possible ways to improve our product and user experience. Often times this involves lots of analytics behind the scene. Our data pipeline thus far has consisted of Hadoop, MySQL, R and Stata. We’ve used a wide variety of libraries for interfacing with our Hadoop cluster such as Hive, Pig, Cascading and Cascalog. However, we found that analysts aren’t as productive as they can be by using Hadoop, and standalone MySQL was no longer an option given the size of our dataset. We experimented with frameworks such as Spark but found them to be too immature for our use-case. So we turned our eye to Amazon Redshift earlier this year, and the results have been promising. We saw a 5x performance improvement over Hive.
Redshift is Amazon’s SQL based enterprise data warehouse solution for large scale and complex analytics. Under the hood, it is a distributed managed ParAccel cluster. It achieves its high performance through extreme parallelism, columnar data storage, and smart data compression. The setup process is very easy and fast (it took just a few minutes to provision a 16-node cluster) , and you can connect to the system via any Postgres compliant client.
The first advice we can offer is to follow what the manual says closely when migrating your tables and queries. We started out by migrating a few large datasets generated by our existing Hadoop based ETL over to Redshift. The first challenge we had was schema migration. Even though Redshift is based on Postgres 8.0, the “subtle” differences are big enough, forcing you into the Redshift way of doing things. We tried to automate the schema migration, but the problem was bigger than we originally expected and we decided it was beyond the scope of our experiment.
Indexes, timestamp type, and arrays are not supported in Redshift, thus you need to either get rid of them in your schema or find a workaround (only 11 primitive data types are supported at this point in time). This was the most lengthy and tedious part of the migration to Redshift, but it serves as a very good training and evaluation process (whether redshift is the right solution for you).
When defining your schema, but careful with the distribution key, which determines how your data is distributed across the cluster. Check all the queries you run against the table and choose the column that gets joined most frequently to get the best performance. Unfortunately, you can only specify one distribution key and if you are joining against multiple columns on a large scale, you might notice a performance degradation. Also, specify the columns your range queries use the most as sort key, as it will help with the performance.
The next step is loading our data into the system. Which probably sounds easy, but there are few gotchas.
First, in order to load your data into Redshift, it has to be in either S3 or Dynamo DB already. The default data loading is single threaded and could take a long time to load all your data. We found breaking data into slices and loading them in parallel helps a lot.
Second, not all the utf-8 control characters are supported. Some of our data originally came from other SQL databases and unfortunately it has all sorts of utf-8 characters. Redshift only supports control characters up to 3 characters long. Remember, Redshift is intended for analytics and I doubt those characters are of any use for the purpose, so clean them up before loading. If you don’t have many of those cases the other options is to use MAXERROR option to skip them. Another issue we had was NULL values, since Redshift only supports one null value when loading data. If you have multiple ‘NULL’ values in your data what you need to do is to load them as a string into a temp table and cast them back to NULL. Last, we had some data in json format, and we had to convert those into flat files, since it is not supported in Redshift.
After schema migration and data loading we are finally ready to play around with Redshift to see its power in action.
Redshift — 16 node cluster ($13.60 per hour — $0.85 per hour per node)
Node Type: dw.hs1.xlarge
CPU: 4.4 EC2 Compute Units (2 virtual cores) per node
Memory: 15 GiB per node
I/O Performance: Moderate
Storage: 3 HDD with 2 TB of storage per node
Hive/EMR — 44 node cluster ($57 per hour)
Node Type: Cluster Compute Quadruple Extra Large
Memory: 23.00 GB
CPU: 33.5 (2xIntel Xeon X5570)
Storage: 1690 GB (2×840 GB)
I/O Performance: Very High
Runtime Hive: 28 minutes Redshift: less than 6 minutes
Simple range query against a giant table with 3 billion rows, we saw 5x performance improvement over Hive!
For the following slightly more complex query that has two joins with millions of rows.
Hive: 182 seconds
Redshift: 8 seconds
Redshift is 20x faster than the Hive version of the query!
As shown above the performance gain is pretty significant, and the cost saving is even more impressive: $13.60/hour versus $57/hour. This is hard to compare due to the different pricing models, but check out pricing here for more info. In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help.
From our preliminary experiment with Redshift, although lacking a few features we would like it to have, it is very responsive and can handle range and aggregation against a fairly large dataset very well. Anyone with a little SQL background can get start to use it immediately, and the cost of the system is very reasonable. We don’t think Redshift is a replacement of the hadoop family due to its limitations, but rather it is a very good complement to hadoop for interactive analytics. Check it out on Amazon and we hope you will enjoy the ride as well!