At Airbnb, we look into all possible ways to improve our product and user experience. Often times this involves lots of analytics behind the scene. Our data pipeline thus far has consisted of Hadoop, MySQL, R and Stata. We’ve used a wide variety of libraries for interfacing with our Hadoop cluster such as Hive, Pig, Cascading and Cascalog. However, we found that analysts aren’t as productive as they can be by using Hadoop, and standalone MySQL was no longer an option given the size of our dataset. We experimented with frameworks such as

Spark but found them to be too immature for our use-case. So we turned our eye to Amazon Redshift earlier this year, and the results have been promising. We saw a 5x performance improvement over Hive.

Redshift is Amazon’s SQL based enterprise data warehouse solution for large scale and complex analytics. Under the hood, it is a distributed managed ParAccel cluster. It achieves its high performance through extreme parallelism, columnar data storage, and smart data compression. The setup process is very easy and fast (it took just a few minutes to provision a 16-node cluster) , and you can connect to the system via any Postgres compliant client.

## Experiment Setup:

Redshift — 16 node cluster ($13.60 per hour —$0.85 per hour per node) Node Type: dw.hs1.xlarge CPU: 4.4 EC2 Compute Units (2 virtual cores) per node Memory: 15 GiB per node I/O Performance: Moderate Platform: 64-bit Storage: 3 HDD with 2 TB of storage per node

Hive/EMR — 44 node cluster ($57 per hour) Node Type: Cluster Compute Quadruple Extra Large Memory: 23.00 GB CPU: 33.5 (2xIntel Xeon X5570) Storage: 1690 GB (2×840 GB) Platform: 64-bit I/O Performance: Very High ### Test 1: Runtime Hive: 28 minutes Redshift: less than 6 minutes Simple range query against a giant table with 3 billion rows, we saw 5x performance improvement over Hive! ### Test 2: For the following slightly more complex query that has two joins with millions of rows. Runtime Hive: 182 seconds Redshift: 8 seconds Redshift is 20x faster than the Hive version of the query! Results As shown above the performance gain is pretty significant, and the cost saving is even more impressive:$13.60/hour versus \$57/hour. This is hard to compare due to the different pricing models, but check out pricing

here for more info. In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help.

## Conclusion

From our preliminary experiment with Redshift, although lacking a few features we would like it to have, it is very responsive and can handle range and aggregation against a fairly large dataset very well. Anyone with a little SQL background can get start to use it immediately, and the cost of the system is very reasonable. We don’t think Redshift is a replacement of the hadoop family due to its limitations, but rather it is a very good complement to hadoop for interactive analytics. Check it out on Amazon and we hope you will enjoy the ride as well!

Want to work with us? We're hiring!