How I Got Started at Airbnb

Mike Lewis (photo by Celeste Noche)

Mike Lewis (photo by Celeste Noche)

Last fall Airbnb acquired Fondu, a company that Gauri Manglik and I founded. Fondu was helping people discover fresh new restaurants through a beautiful mobile experience. Rather than reading 300 reviews about a restaurant, we felt it would be more useful to follow your friends and respected foodies and see where they were eating.

Since the acquisition, I’ve gotten a lot of questions about the process and the transition. The biggest question I get whenever people hear about the acquisition is “Do you miss working on a small team?” My gut response to this isn’t a yes or no, but actually just the realization that I still feel like I’m on a small team. The product team I work on, Trust, is just as small as Fondu was, and operates with significant autonomy. We have our mission: to make Airbnb the most trusted community on the internet. The only real difference is that instead of using off-the-shelf everything, we’re sitting next to other teams that are building the latest and greatest for us to use. Teams like Data Infrastructure, SRE, Search, Mobile, and many others. Instead of building everything from scratch, we have access to battle-tested functionality that lets us ship good stuff, even faster than the hacky way. Both ways allow quick iteration, but having supporting teams means that the solutions tend to scale better and be more reliable.

What was the acquisition process like?
Similar to dating. After the initial introduction was made, we started talking in earnest, getting a feel for how we each worked, what our values were, and generally if there was chemistry. The more we talked, the more we realized how well we would fit into the Airbnb culture. There was a full battery of interviews, and a negotiation process that lasted a few rounds. We had a few deals on the table, but we went with Airbnb because of its amazing culture and compelling vision. We were also fortunate to have amazing investors who supported us through the entire process.

How has it affected your daily life?
Honestly, I get to focus on code and product a lot more. There are fewer distractions, and most of the details that suck time are eliminated. Some days I miss the uncertainty and the struggle, but that’s been traded for some nice perks. I’ve stopped eating at the same places every day and instead get to enjoy Chef Sam’s diverse meals while chatting with some of the most people-oriented engineers I’ve ever met. Having time to go to the gym daily has been transformational to my health as well. I’m currently working on repairing 2 years worth of physical neglect due to life “in the trenches.” I’ve even been able to start traveling again and actually wrote the first draft of this post from our offices in Delhi.

What’s different about the work you’re doing?
The scale of data at Airbnb is orders of magnitude greater than what we were dealing with at Fondu. We also deal with people’s safety and real money, so our standards for protecting them are celestial. Having more engineering support means that the quality of the systems and infrastructure is also higher, as it needs to be. The tradeoff is that having a larger infrastructure means more moving parts need to be connected and maintained for every project. It’s no longer possible to keep every part of the system fresh in your head because the product does so much.

What problems are you working on now?
I’ve been working on keeping our users safe and preventing fraud. My first major project was building a graph traversal system that continuously relates users sharing common attributes. We use this to identify duplicate accounts created by bad actors. The challenge was that it had to be really fast, as it is accessed on almost every authenticated request. As malicious users become more sophisticated, we’re developing smarter techniques to stay ahead of them and neutralize attack vectors. Since then I’ve been involved in multiple projects, such as Verified ID, related to improving our trust/anti-fraud infrastructure. Some components of those projects have been generalized and are getting internal adoption (such as Saddle).

 
To be honest, the past 7 months have been a real transition, but also a massive opportunity for personal growth. Airbnb’s vision is inspiring, my colleagues challenge me daily, and the problems we’re solving are legendary. If my life was a book, this is a chapter I would not dream of missing out on.

Location Relevance… aka knowing where you want to go in places we’ve never been

by Maxim Charkov, Riley Newman & Jan Overgoor

Here at Airbnb, as you can probably imagine, we’re big fans of travel. We love thinking about the diversity of experiences our host community offers, and we spend a fair amount of time trying to make sense of the tens of thousands of cities where people are booking trips every night. If Apple has the iPad and iPhone, we have New York and Paris. And Kavajë, Außervillgraten, and Bli Bli.

The tricky thing is, most of us haven’t been to Bli Bli. So we try to come up with creative ways to help people find the experience they’re looking for in places we know very little about. The key to this is our search algorithm – a system that combines dozens of signals to surface the listings guests want.

In the early days, our approach was pretty straightforward. Lacking data or personal experience to guide an estimate of what people would want, we returned what we considered to be the highest quality set of listings within a certain radius from the center of wherever someone searched (as determined by Google).

SF heatmap of listings returned without location relevance model

SF heatmap of listings returned without location relevance model

This was a decent first step, and our community worked with it resiliently. However, for a company based in San Francisco, we didn’t have to look far to realize this wasn’t perfect. A general search for our city would return great listings but they were scattered randomly around town, in a variety of neighborhoods, or even outside of town. This is a problem because the location of a listing is as significant to the experience of a trip as the quality of the listing itself. However, while the quality of a listing is fairly easy to measure, the relevance of the location is dependent upon the user’s query. Searching for San Francisco doesn’t mean you want to stay anywhere in San Francisco, let alone the Bay Area more broadly. Therefore, a great listing in Berkeley shouldn’t come up as the first result for someone looking to stay in San Francisco. Conversely, if a user is specifically looking to stay in the East Bay, their search result page shouldn’t be overwhelmed by San Francisco listings, even if they are some of the highest quality ones in the Bay Area.

Exponential distance curve

So we set out to build a location relevance signal into our search model that would endeavor to return the best listings possible, confined to the location a searcher wants to stay.

One heuristic that seems reasonable on the surface is that listings closer to the center of the search area are more relevant to the query. Given that intuition, we introduced an exponential demotion function based upon the distance between the center of the search and the listing location, which we applied on top of the listing’s quality score.

SF heatmap with distance demotion

SF heatmap with distance demotion

This got us past the issue of random locations, but the signal overemphasized centrality, returning listings predominantly in the city center as opposed to other neighborhoods where people might prefer to stay.

Sigmoid distance curve

To deal with this, we tried shifting from an exponential to a sigmoid demotion curve. This had the benefit of an inflection point, which we could use to tune the demotion function in a more flexible manner. In an A/B test, we found this to generate a positive lift, but it still wasn’t ideal – every city required individual tweaking to accommodate its size and layout. And the city center still benefited from distance-demotion.

There are, of course, simple solutions to a problem like this. For example, we could expand the radius for search results and diminish the algorithm’s distance weight relative to weights for other factors. But most locations aren’t symmetrical or axis-aligned, so by widening our radius a search for New York could – gasp – return listings in New Jersey. It quickly became clear that predetermining and hardcoding the perfect logic is too tricky when thinking about every city in the world all at once.

Listing density relative to distance from city center, select market comparison

So we decided to let our community solve the problem for us. Using a rich dataset comprised of guest and host interactions, we built a model that estimated a conditional probability of booking in a location, given where the person searched. A search for San Francisco would thus skew towards neighborhoods where people who also search for San Francisco typically wind up booking, for example the Mission District or Lower Haight.

Choropleth of probability of booking given a general query for San Francisco

Choropleth of probability of booking given a general query for San Francisco

This solved the centrality problem and an A/B test again showed positive lift over the previous paradigm.

SF heatmap with location relevance signal

SF heatmap with location relevance signal

However, it didn’t take long to realize the biases we had introduced. We were pulling every search to where we had the most bookings, creating a gravitational force toward big cities. A search for a smaller location, such as the nearby surf town Pacifica, would return some listings in Pacifica and then many more in San Francisco. But the urban experience San Francisco offers doesn’t match the surf trip most Pacifica searchers are planning. To fix this, we tried normalizing by the number of listings in the search area. In the case of Pacifica, we now returned other small beach towns over SF. Victory!

Change in location ranking score before and after normalization

Change in location ranking score before and after normalization

At this point we were close to solving the problem, but something still didn’t feel right. In the earlier world of randomly-scattered listings, there were a number of serendipitous bookings. The mushroom dome, for example, is a beloved listing for our community, but few people find it by searching for Aptos, CA. Instead, the vast majority of mushroom dome guests would discover it while searching for Santa Cruz. However by tightening up our search results for Santa Cruz to be great listings in Santa Cruz, the mushroom dome vanished. Thus, we decided to layer in another conditional probability encoding the relationship between the city people booked in and the cities they searched to get there.

Searching for Santa Cruz

The relationship between the two conditional probabilities we used is displayed in the graph to the right. While all of the cities in the graph have a low booking likelihood relative to Santa Cruz itself, they are also mostly small markets and we can give them some credit for depending on Santa Cruz for searches for their bookings. At the same time places like San Jose and Monterey have no clear connection to Santa Cruz, so we can consider them as completely separate markets in search. It’s important that improvements to the model do not lead to regressions in other parts of the world. In this case, little changed for our bigger markets like San Francisco. But this additional signal brings back the mushroom dome and other remote but iconic properties, facilitating the unique experiences our community is looking for.

The location relevance model that we built during this effort relies completely on data from our users’ behavior. We like this because it allows our community to dynamically inform future guests where they will have great experiences, and allows us to apply the model uniformly to all of the places around the world where our hosts are offering up places to stay.


Huge thanks to Stamen Design and OpenStreetMap.org for sharing their map tiles and data, respectively.

Distributed Computing at Airbnb

by

Airbnb is obsessed with creating a great experience for hosts and guests alike. One of several ways we work on doing this is by analyzing the various sources of data we have. We have a vast volume of data from various sources—logging, third party analytics, and our own internal sets of generated data.

In order to further analyze this data we’ve leveraged a number of open source tools, which may come as no surprise. Some of these names may be familiar to those who have worked on similar data infrastructure projects, but we’re also doing things a bit differently. We run 3 frameworks on a cluster managed by Mesos. The frameworks we rely heavily upon are Chronos, Hadoop, and Storm. I’ll talk about each one of these briefly and how we use them.

Mesos

Mesos can be thought of as an operating system for a cluster. It manages resources and allows you to allocate those resources to frameworks. Mesos is to clusters what the Linux kernel is to individual computers. Mesos is novel for a number of reasons, but the most important thing about Mesos is that it makes it possible to use the same cluster to run multiple frameworks and ensure that resources are being utilized properly. You can allocate just enough computing power, as you need it. No more, no less. Ideally your cluster will be between 95% and 99% utilized at all times. Mesos has other great features, such as taking advantage of Linux containers to ensure processes don’t run away.

Chronos

Chronos has been talked about on this blog before. It was written by engineers at Airbnb to solve a problem that seems easy to solve, but actually isn’t. Chronos allows us to schedule tasks on a cluster, such as running a daily or weekly Pig job on the Hadoop cluster. Chronos runs on Mesos alongside our other frameworks.

Hadoop

Another framework we run on Mesos is Hadoop. Hadoop has been around for a long time and will be familiar to most people. The newest release of Hadoop includes a new resource manager as well, called YARN, which is in some ways is similar to Mesos. YARN is specific to Hadoop and only works for MapReduce jobs at the moment, so we have decided to stick with MapReduce v1 on Mesos instead. The future of YARN remains unclear, but it will likely mature into a valuable framework.

Storm

Lastly, we use the Storm framework for running our real time distributed computing tasks, such as doing real time analytics or running jobs that require significant computing resources. Storm is stable, mature, and full-featured. You can run Storm on the same nodes that your MapReduce jobs and Chronos tasks run on, without them interfering with each other.

At the moment the barrier to entry for these projects is fairly high. None of them are easy to deploy, especially in conjunction with Mesos. We have been working with the developers of these projects to make it easier to use and deploy them by contributing patches and submitting bug reports. We hope to help enable others to take advantage of these tools, and we want to give back to the community that has given so much to us.

We’re going to post a follow-up in a few weeks with more details on how we’ve deployed these projects. Stay tuned for more.

Behind the Scenes: Airbnb Neighborhoods

Q+A with &

andy and ben

Let’s start with some background. Why did you guys focus on Neighborhoods and what is the Neighborhoods project?

Andy: Airbnb Neighborhoods was birthed from the Snow White project and our research showed location is the number one criteria for Airbnb travelers when choosing a place to stay. Our goal with Neighborhoods is for every Airbnb traveler to figure out where in a city to stay and to help them feel more connected to the local culture. We like to call this local intuition the sixth sense of traveling.

Neighborhoods is also the successor of our previous company, NabeWise, which was acquired by Airbnb in 2012. At NabeWise, we provided movers and travelers with a comprehensive and essential neighborhood guide that chronicled 25 US cities. By working with Airbnb’s passionate and international community of hosts and guests, we are able to evolve our product and offer truly global, yet locally-nuanced, solutions for enhancing aspirational travel.

How did you guys handle all of the data?

Andy: Neighborhoods itself doesn’t deal with insane amounts of data. Instead we were able to offload the hard work to external services. One such service, Glop (Genome Location Pipeline), regularly associates our data with the neighborhood in which it occurs. It goes something like this:

  1. Airbnb produces lots of data each day: tracking reservations, new users, new listings, etc.
  2. Glop is scheduled by Chronos to run.
  3. Glop churns through all this new data, ignoring it if it isn’t associated with a location (i.e. does not have an associated latitude and longitude).
  4. Glop looks up each (latitude, longitude) pair to see if we have neighborhood coverage there.
  5. If Glop sees that something is in a neighborhood we cover, it will then dump that information to flat files and memcached.

For example, say you list your place, which is located at (12.333568650219718, 45.43647998034738). The next time Glop runs, it will correctly identify your listing as being in San Marco.

Glop looks something like this:

picture of glop

In order to capture neighborhood boundaries we also built a custom browser-based system for creating the neighborhood geometry. Zack Walker, our cartographer, works with this system to map out each city in very fine grained detail. We’re then able to really play with the geometry and pass it through various filters before importing it into the front end facing app. By the time the front end gets the underlying data, it is relatively small and manageable.

neighborhoods mapping ui 2

neighborhoods mapping ui 1

What was the stack?

Andy: There are actually quite a few components, among which are:

Neighborhoods, the App

Server Side

  • Rails 3.2
  • PostgreSQL/PostGIS
  • Memcached

Client Side

  • CoffeeScript
  • Sass
  • jQuery
  • Handlebars
  • Backbone
  • Underscore

Neighborhoods, the API

Server Side

  • Sinatra
  • PostgreSQL/PostGIS

The Neighborhood Tool

Server Side

  • Rails 3.2
  • PostGIS
  • nsync for data versioning

Neighborhoods, the Data Pipeline

Server Side (EMR)

  • Clojure
  • Java
  • Hadoop
  • Memcached
  • Cascalog

What was your biggest challenge building Neighborhoods?

Andy: My biggest challenge was engineering the best way to give the front-facing Rails app (henceforth neighborhoods-core) access to all the data produced by the pipeline. Neighborhoods-core reads data produced by the pipeline to personalize pages and produce the community visualization. What we needed was a solution that could lookup resources by city or neighborhood. We also wanted our solution to be fast. Very fast.

The “resources” we needed to fetch are de-normalized tuples representing a variety of types of data. A single resource tuple could represent a reservation, a listing or even a user.

At first, it seemed we wanted a SQL database, as our data had relations. However, this was ruled out based on the need for mass updates.

Next, we looked at an in-house NoSQL solution that we call Dyson. Dyson seemed to give us the flexibility we needed with writes and updates, so we tried it. For reference, Dyson is backed by Amazon’s DynamoDB, a reliable, but limited, managed, NoSQL solution. In essence, if we put the data right into DynamoDB, then Dyson can serve it. This led to the creation of a DynamoDB cascading tap. Countless timeouts, headaches and late nights later, we had a working solution.

However, there was a problem, namely DynamoDB’s 65KB storage limit. When you’re storing uncompressed JSON, that’s a pretty easy target to reach. As a band-aid, we engineered a solution involving pages of tuples. To say this solution was sub-optimal is putting it mildly, and the performance was even worse.

With launch quickly approaching, brilliant words saved the day: “You don’t need a database, you need a [expletive deleted] cache” 1. So that’s what we did, we traded our database for a cache. Specifically, we switched from Dyson to Memcached.

How does this story end? 35ms response times.

Ben: My biggest challenge was setting up the neighborhood page layout tools. We needed tools in place to allow our content editors, translators, and photographers to begin work before we were even close to final designs. We also realized pretty early on that we needed to allow considerable flexibility in how pages would be laid out. Additionally, this tool had to be easy to use so that it wouldn’t waste our content editors’ time, as that was the limiting factor in whether we would be able to ship. I ended up creating a drag-and-drop page creation system that could import images from Finder, iPhoto, or any other photo viewing software our people were using. Once imported, images could be edited, reflowed, and captioned on the page.

We also ran into a ton of issues because all of our photos are very high resolution and took a while to process. To speed things up, I wrote a high performance image processing server in Clojure that essentially injected itself as a proxy in front of the Rails image upload endpoint. Unfortunately, we ran up against some fairly bad image quality issues for certain images that didn’t occur when processing using imagemagick, so we were never able to fully roll it out.

What did you guys learn?

Ben: It’s important to consider performance from day one. On this project, we kept New Relic Development Mode open in a separate tab at basically all points during development. This allowed us to constantly monitor what our app was actually doing, rather than hoping we had written fast code and then trying to bolt speed on at the last second. We also made our app akamai friendly from the start, so static page caching was just a matter of setting the right headers.

You can check out Neighborhoods here: airbnb.com/neighborhoods

 


1 The brilliant man responsible for this observation is one Davide Cerri.

We’ve open sourced Rendr: Run your Backbone.js apps in the browser and Node.js

You may remember our blog post back in January that first introduced Rendr, our library for running Backbone.js apps seamlessly on both the client and the server.  We originally built Rendr to power our mobile web app, and in the post we explained our approach and showed some sample code.


We’ve been blown away by the response from the community.  With 80,000 hits to the original blog post and a constant stream of questions and comments on Twitter, it quickly became clear that we’d stumbled upon a Zeitgeist.  Many developers shared the same pain points with the traditional client-side MVC approach: poor pageload performance, lack of SEO, duplication of application logic, and context switching between languages.  


The number one question we received went something like this: “When will u release Rendr???”


Well, we’ve up and done it — Rendr is now an open source project, and you can check out the code over at github.com/airbnb/rendr.  A good place to start is the sample app: github.com/airbnb/rendr-app-template.  It provides some minimal boilerplate for a Rendr app, fashioned as “GitHub Browser”, a little app that consumes the public GitHub API.  Rendr is meant to work with any RESTful API, so please give it a try with your API of choice.  If you come up with something you’d like to show off, send it to us at @rendrjs and we’ll host a little gallery of Rendr apps.

Server-side Backbone at HTML5DevConf

Last week at HTML5DevConf I got a chance to speak on Rendr.  [Check out the slides]() for some more context on Rendr.  The talk was also recorded, but the video isn’t ready quite yet.  If you want to see it sooner rather than later, let the conference organizers know by asking nicely to @html5devconf.  I’ll update this post when the video’s live.

The conference was great; there were phenomenal speakers, and there was a lot of useful tips on how to build modern, maintainable JavaScript apps.

Indeed, there were no less than two other talks on server-side Backbone.  Tim Branyen showed previewcode and Lauri Svan showed backbone-serverside, both prototypes that demonstrate sharing Backbone code between the client and the server.  We got to meet up over lunch and compare approaches, and it was really interesting to see how similar our code was, with a few important differences.

For example, both Tim and Lauri chose AMD via RequireJS for modules, while I chose CommonJS via Stitch. Also, we all used Express as a web server, but they patched Backbone.Router or Backbone.History to route and handle server-side requests, whereas I created separate ClientRouter and ServerRouter classes that delegate to Express or Backbone.Router, and a separate routes file. Tim and Lauri chose to stub out jQuery on the server using Cheerio; I opted to avoid DOM abstractions on the server in favor of strict string-based templating.

What’s important, though, is that we all arrived at this approach after struggling with the drawbacks of a traditional client-side Backbone app.  And I’m sure we’re not the only ones: there must be a number of developers who’ve hacked on similar approaches, just like Bones.  There’s an opportunity here to combine our efforts and work towards a library that we can all use and extend to fit our needs.  

Design Goals

Here is the simple set of design goals which are guiding Rendr’s development.

Write application logic agnostic to environment.

Indicating what data to fetch, which template to render, which route to match, how to transform a model’s data — this logic can and should be abstracted from specific implementation details as much as possible.

Library, not a framework.

In true Backbone style, Rendr strives to be a library as opposed to a framework.  A small collection of base classes which can be built upon is easier to adopt and maintain than a batteries-included web framework.  Solve the problem at hand without imposing unneeded structure on the application.

Minimize code like if (server) {…} else {…}.

If your application has a bunch of conditions that look like this, then it means you’re doing something wrong.  It’s a sign of a leaky abstraction.  Of course, sometimes it’s necessary to know which environment you’re in, but that logic should be consolidated and abstracted away as much as possible.  Which leads us to…

Hide complexity in the library.

There are a few really tricky problems that need to be tackled to achieve these other goals.  The complexity of the solutions should be hidden in the library, keeping the application code clean, but remain accessible when it’s time to override core behaviors.

Talk to a RESTful API.

Backbone is great at integrating with a RESTful API.  Let’s follow that convention, keeping the data source separate from Rendr itself.  It should be possible to write adapters for different data sources, such as Mongo, CouchDB, Postgres, or Riak, but the library shouldn’t impose structure on your model layer.

No server-side DOM.

We prefer string-based templating over using a DOM on the server because DOM implementations are slow, and, well, because it feels hacky.  You shouldn’t need a DOM to render HTML.  However, I’m curious how this will change though once WebComponents become commonplace.

Simple Express middleware.

Express is the de facto Node.js web server.  Rendr should fit with Express convention, exposing a few simple middleware.  A nice effect of this is that you can tack on Rendr routes onto any existing Express app, or have Rendr and non-Rendr content served from the same codebase as necessary.

The Evolution of Rendr

I’d love to see Rendr evolve to become even more modular, pulling out components like the routing, templating, and view classes into separate modules, leaving the minimal set of glue required to sanely build a Backbone app that runs on the client and server.  With a bit of work, it may even be possible to decouple Rendr from Backbone itself, allowing it to work with other client-side MVC libraries.

While you can easily use the code we’ve released to build a production-quality app (if you don’t mind getting your hands a bit dirty), our intention for Rendr isn’t for it to be the next [Rails, Meteor, Ember, Backbone].   Instead, we want to explore the problem of isomorphic JavaScript applications and stimulate discussion in the community.

To that end, we’ve been talking with @mde, who maintains the Geddy project, the first and most complete web framework for Node.js.  Geddy is more of a Rails-style server-side MVC framework, but Matt shares our vision for JavaScript apps that can run on both sides.  We’re looking at what it would take to allow Geddy to take advantage of Rendr’s approach.  What primitives and abstractions do we need to build applications that can seamlessly render views, route requests, and fetch data in both environments?

Rendr, Decaffeinated

Coming from Rails, we Airbnb engineers were initially smitten with CoffeeScript.  We’ve found it increases our productivity and lets us avoid some of the more verbose parts of JavaScript.  Thus, Rendr has been written mostly in CoffeeScript.  But heeding the advice of the broader Node.js community, we’re in the process of converting it over to plain-jane JavaScript.  This will make it easier for more people to adopt and contribute, avoiding fragmentation.  Interesting side note: when we show off Rendr to people from the Node.js community, they choke on the CoffeeScript without fail, but the Backbone community and front-end developers in general seem to prefer CoffeeScript.

Open Source at Airbnb

Rendr comes hot on the heels of the announcement of Chronos, the open source replacement for cron built by our data-infrastructure team.  

We’ve also released a few JavaScript projects that are borne out of our efforts building web apps here at Airbnb.

Infinity.js is a small library for managing infinite scroll.  It’s like UITableView for the web.

Polyglot.js is a browser- and Node.js-compatible library for handling internationalization.  The secret-sauce of Polyglot.js is its handling of pluralization rules for non-English languages.  For example, Chinese and Korean have just a single form, whereas Russian, Czech, and Polish have three forms.

We’ve got a few more exciting open source projects coming through the pipeline, so keep an eye out!  Follow @AirbnbNerds and @rendrjs for updates.

And, as always, we’re hiring talented engineers to work all over our stack: frontend, backend, search, trust & safety, machine learning, data-infrastructure, analytics — you name it.

Scaling with CacheObservers

Airbnb’s traffic grows approximately 5x year over year. We’re running in the AWS cloud, using most of their services: hundreds of EC2 servers, tens of RDS instances, and nearly a dozen Elasticache nodes. At around this time last year, we hit a wall. Traffic was growing faster than we could scale our databases, and we were running the largest RDS instances Amazon sold. There was no more vertical headroom. Unless our tiny engineering team made massive changes, our site would buckle in under a month.

We of course made large refactors to our schemas, in many cases switched datastores for frequently-used columns (especially those being used for key-value data), and vertically partitioned tables into separate databases where possible. But these were large changes and were performed over the course of months, not weeks. One of the ways we bought time for these efforts was by rebuilding our caching infrastructure from the ground up.

Caching At Airbnb

The most frequently accessed pages on our site are our listing pages. They’re also largely static, so we’d been caching the rendered output in Memcached and using ActiveRecord callbacks to expire the pages after related models changed. Unfortunately, the pages displayed information from many data sources: user information (prominently, the host’s… But also information for each review!), the listing itself, data derived from pricing and currency tables, and more. Often Models displayed on the listing pages would change, but would only change private data like home address, or would change public data like their personal description that isn’t shown in a review. Since we were expiring our page caches every time that happened, the pages were being re-rendered far more frequently than necessary — resulting in extra load on the databases.

We looked into using Sweepers, the Rails-3-omakase way of expiring caches. But Sweepers suffered the same problems as callbacks: any Model attribute change expired the page, even if it wasn’t used during rendering. Since there didn’t seem to be a solution that let us specify which attributes were “safe” to change, we built our own on top of ActiveRecord::Observers.

ActiveRecord::Observers, for those not familiar with the deeper workings of Rails, are an implementation of the Gang of Four Observer pattern: they watch things — in this case, ActiveRecord Models — for changes, and when changes occur, take specific actions. Since ActiveRecord Models store a list of all “dirty” (changed) attributes, we were able to build an extension that watches Models for attributes deemed unsafe to change and then expires caches when those attributes do change.

CacheObservers

Traditional Observers are one-to-one with Models, not Views. But our CacheObservers needed to have knowledge of the View structure, which in our case often rendered multiple Models; we couldn’t have them be one-to-one with Models.

Sweepers are slightly different: they observe both Models and specific Controller actions. But we wanted something that knew just about specific views, not about what actions could impact it: every time we added a new edit action, we didn’t want to also have to remember to call the related CacheObserver’s expiration method.

Instead of following the Rails Observer and Sweeper patterns, we made CacheObservers one-to-one with views and one-to-many with Models. CacheObservers have no knowledge of Controllers; they only know what data a view needs to render. There’s a simple DSL for declaring data dependencies:

    caches_on :listing do |cache|
        cache.safe_fields 'updated_at',
                          'latitude',
                          'longitude'
    end

Inside each CacheObserver definition, there can be multiple caches_on blocks. Each caches_on block represents a Model class; the safe_fields are the fields that are safe to change without needing to expire the page cache.

Of course, this isn’t quite enough to implement a smart cache-busting Observer. After all: how can you expire a Listing cache given only a model? To auto-expire caches, CacheObservers need to control the cache keys as well, and have a little bit of extra information to figure out how to get from a User model to the Listing’s cache key.

First, to define the cache key generation:

    class ListingCache < CacheObserver
        key_config.generate_from_url :controller => :listing, :action => :show

        caches_on :listing do |cache|
            cache.safe_fields 'updated_at',
                        'latitude',
                          'longitude'
        end
    end

All we’ve added here is the surrounding class definition, along with one extra line:

    key_config.generate_from_url :controller => :listing, :action => :show

There are many ways to configure key generation (in fact, at the base level it simply accepts a block that returns a string); however, in this case we’ve chosen to use the generate_from_url built-in strategy. We’ve passed it pieces of the hash needed by Rails’ url_for method, but missing a crucial component: the listing id.

Next, we tell the CacheObserver how to generate the key from the Listing model:

    class ListingCache < CacheObserver
      key_config.generate_from_url :controller => :listing, :action => :show

      caches_on :listing do |cache|
        cache.safe_fields 'updated_at',
                          'latitude',
                          'longitude'
        cache.url_options do |listing|
          { :id => listing.id }
        end
      end
    end

The cache.url_options block takes an instance of a model that’s had unsafe attributes changed and returns any pieces of the hash left out of the generate_from_url hash. The CacheObserver then merges the hashes together and generates the key from the result. In this case, we simply return the listing.id as the :id parameter; we could’ve also left out the :action parameter in the generate_from_url method and returned it here, but there’s no reason to and it’s easier to keep it in the class method instead of inside each caches_on block.

You might wonder: if at the base key configuration level the configuration simply accepts a block that returns a string, how could there be a built-in url_options method? Is there tight coupling between the generate_from_url method and the CacheObservers? But there isn’t: url_options is just an alias for key_options, which takes a block and passes its output as the input for the key configuration block. In this case, the URL strategy decorates the cache object to have the alias for clarity, and expects the given block to return a hash.

Expiring Multiple Caches

Associating Listing models with a Listing cache is easy — they’re clearly one-to-one. If a listing changes, the cached page for it should be expired. But what about user models? Users could have multiple listings, so clearly returning a single hash won’t work. Luckily, the generate_from_url strategy handles this case:

    caches_on :user do |cache|
      cache.safe_fields 'updated_at',
                        'billing_address'
      cache.url_options do |user|
        listing_ids = []
        user.listings.each do |listing|
          listing_ids.push({ :id => listing.id })
        end
        listing_ids
      end
    end

The url_options block can return an array of hashes, and the CacheObserver will expire all of the associated caches. Of course, in production you probably wouldn’t write the code as above: you’ll potentially generate a bunch of inefficient queries for listings, when all you need are the ids. For simplicity, we’ve presented it this way; writing the more efficient query isn’t difficult but is left as an exercise for the reader.

Fragment Caches

The above worked well for expiring full-page caches only when absolutely necessary. But often pages have portions that are highly dynamic, but other pieces that are generally static; for that, we cache pages in fragments so that even when the page expires we don’t do extra queries for the static pieces.

The trick with fragment caches is that if they expire, they should expire the full-page cache as well: but they shouldn’t expire their sibling caches. What’s more, nested fragments should expire their parents, who should in turn expire their own parents. This makes sure that your pages are always up-to-date, but maximally cached.

This isn’t the first time this concept has been blogged about; at around the same time as we were building our system, 37Signals began blogging about their “Russian Doll” caching architecture, which has now become the base for Rails 4 caching. CacheObservers work similarly with respect to fragment caches: caches have children (and grandchildren, and great-grandchildren…), and they do what you expect and expire only what’s necessary. The DSL is once again simple:

    child_cache :social do
      child_cache :reviews do
        caches_on :review do |cache|
          # ...
        end
        caches_on :user do |cache|
          # ...
        end
      end
      child_cache :friends do
        caches_on :user do |cache|
          # ...
        end
      end
    end

Each cache is named, and can be accessed by name in the view. The models and safe attributes are separate from their siblings and parents, and in addition to being able to expire on different models, children can expire on different attributes on the same models (or, more unlikely, the same attributes on the same models) as their siblings or parents.

Results

Airbnb grows seasonally: in the spring and summer months we accelerate dramatically, then level off mid-autumn before beginning the relentless upward drive once again following the New Year. We built CacheObservers in pieces last spring: at first, just the full-page caches, then child fragment caches, and eventually even Etag-based extensions to short-circuit the rendering process. By midsummer our full-page cache hit rates had more than doubled, and our database efforts bore fruit. We’d kept the site online, and we slowly transitioned our efforts away from disaster mitigation and towards overall site performance and building the foundation for the next year’s marathon.

Thanks to those efforts, we’re doing better this year. We’re well into the growth season, faster than we’ve been in years, and we’ve been keeping our site at three 9s thanks to the tireless efforts of the SRE and Performance teams, with the Core Services team building out future architecture for next year’s spike in parallel. I’m crossing my fingers for summer, but it’s mostly out of habit. We learned our lesson last year.

If you want to be a part of a fast-moving team tackling giant scaling problems, we’re hiring.

Introducing Chronos: A Replacement for Cron

by , , &

Chronos is our replacement for cron. It is a distributed and fault-tolerant scheduler which runs on top of Mesos. It’s a framework and supports custom mesos executors as well as the default command executor. Thus by default, Chronos executes SH (on most systems BASH) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchroneous callbacks to notify Chronos of job completion or failures.

Chronos has a number of advantages over regular cron. It allows you to schedule your jobs using ISO8601 repeating interval notation, which enables more flexibility in job scheduling. Chronos also supports the definition of jobs triggered by the completion of other jobs, and it also supports arbitrarily long dependency chains.

Chronos is available on Github

The Backstory

At Airbnb, we heavily rely on data analysis to build great products. Our data-pipeline consists of many technologies such as Hadoop, MySQL, Amazon Redshift and S3. Our software engineers and analysts use a mix of Cascading, Cascalog, Hive and Pig for interfacing with Hadoop. We have scripts that export tables from a vast number of databases into S3 and we use various ETL (extract transform and load) processes to turn blobs of bytes into meaningful information. Many of these transformations consist of multiple steps and some tables are composed of a myriad of data-sources and joins.

We’re not in a private datacenter, and we aren’t running our own Hadoop cluster – we use a managed Hadoop product from Amazon, called Elastic Map/Reduce. High variance in network latency, virtualization and not having predictable I/O performance is an ongoing challenge in a cloud environment. There are many sources for errors. For example calls to web services are subject to timeouts.

In a complex processing pipeline every step increases the chance of failure. Until December last year, we were relying on a single instance with cron to kick off our hourly, daily and weekly ETL jobs. Cron is a really great tool but we wanted a system that allowed retries, was lightweight and provided an easy-to-use interface giving analysts quick insights into which jobs failed and which ones succeeded.

We also wanted a system that was highly available and could manage any workload, not just Hadoop jobs. Other requirements were that the system still could run BASH scripts and fan out the workload to many systems (as we are exporting many tables we didn’t want to just execute on one box albeit we wanted to have central management). At the same time we began looking at Mesos for data-infrastructure. Thus we made the decision to build a new lightweight, fault-tolerant scheduling tool which we named Chronos that would run on top of Mesos, using Mesos’ primitives for storing state and distributing work. Mesos also allowed us to dynamically add new workers to our pool without having to change the configuration of the existing cluster.

Chronos UI

Chronos comes with a UI which can be used to add, delete, list, modify and run jobs. It can also show a graph of job dependencies. These screenshots should give you a good idea of what Chronos can do.

Sample-chronos-ui

Check it out on Github

Over the past weeks, we have open-sourced Chronos, you can check it out on our github page: http://airbnb.github.com/chronos

Here’s the video from our Tech Talk on Chronos: https://www.youtube.com/watch?v=FLqURrtS8IA

HTTP+JSON Services in Modern Java

by

At Airbnb, we build most of our user facing apps in Ruby on Rails, or more recently Node.js and our own Rendr framework.

We also have a number of internal services, and those are mainly written in Java for stability and performance.

Coming from a Ruby world, building anything in Java can feel pretty painful and boring. But thankfully there are modern Java libraries that make it easy and even fun. We build our Java services with Twitter Commons, a collection of libraries for building HTTP (and other) services. There isn’t much documentation about how to get started with Twitter Commons, so we thought we’d post a little tutorial to help you get started. [Read more...]

Redshift Performance & Cost

by

At Airbnb, we look into all possible ways to improve our product and user experience. Often times this involves lots of analytics behind the scene. Our data pipeline thus far has consisted of Hadoop, MySQL, R and Stata. We’ve used a wide variety of libraries for interfacing with our Hadoop cluster such as Hive, Pig, Cascading and Cascalog. However, we found that analysts aren’t as productive as they can be by using Hadoop, and standalone MySQL was no longer an option given the size of our dataset. We experimented with frameworks such as Spark but found them to be too immature for our use-case. So we turned our eye to Amazon Redshift earlier this year, and the results have been promising. We saw a 5x performance improvement over Hive.

[Read more...]

Our First Node.js App: Backbone on the Client and Server

by

Here at Airbnb, we’ve been looking curiously at Node.js for a long time now.  We’ve used it for odds and ends, such as the build process for some of our libraries, but we hadn’t built anything production-scale.  Until now.

The Problem

There’s a disconnect in the way we build rich web apps these days.  In order to provide a snappy, fluid UI, more and more of the application logic is moving to the client.  In some cases, this promotes a nice, clean separation of concerns, with the server merely being a set of JSON endpoints and a place to host static files.  This is the approach we’ve taken recently, using Backbone.js + Rails.

But all too often, it’s not so clean; application logic is somewhat arbitrarily split between client and server, or in some cases needs to be duplicated on both sides.  Think date and currency formatting.  You tend to have a Ruby (Python, Java, PHP) library that you’ve been using for awhile, but all of a sudden you have to replicate this logic in a JavaScript library.  The same is true for the view layer; some of your markup lives in Mustache or Handlebars templates for the client, while other needs to live in ERB or Haml so they can be rendered on the server for SEO or other reasons.

If you’ve seen my tech talk or last blog post, then all this should sound familiar.  Our thesis, four months ago when I gave this talk, was that, in theory, if we have a JavaScript runtime on the server, we should be able to pull most of this application logic back down to the server in a way that can be shared with the client.  Then, as a developer you just focus on writing application code.  Your great new product can run on both sides of the wire, serving up real HTML on first pageload, but then kicking off a client-side JavaScript app.  In other words, the Holy Grail.

Our Solution

I’m proud to announce that we’ve launched our first Holy Grail app into production!  We’ve successfully pulled Backbone onto the server using Node.js, tied together with a library we’re calling Rendr.  We haven’t open-sourced Rendr quite yet, but you can expect it in the coming months, once we’ve had a chance to build a few more apps with it and decouple it from our use case a bit.

So, the app: we re-launched our Mobile Web site using this new stack, replacing the Backbone.js + Rails stack that it used previously.  You may have been using it for over a month without even knowing.  It looks exactly the same as the app it replaced, however initial pageload feels drastically quicker because we serve up real HTML instead of waiting for the client to download JavaScript before rendering.  Plus, it is fully crawlable by search engines.

Airbnb Mobile website screenshot

The performance gains are an awesome side effect of this design.  In testing, we’re using a metric we call “time to content”, inspired by Twitter’s time to first tweet.  It measures the time it takes for the user to see real content on the screen.  Let’s take our search results page, for example.  Under the old design, before any search results could be rendered in the client, first all of the external JavaScript files had to download, evaluate, and execute.  Then, the Backbone router would look at the URL to determine which page to render, and thus which data to fetch.  Then, our app would make a request to the API for search results.  Finally, once the API request returned with our data, we could render the page of search results.  Keep in mind all of this has most likely happened over a mobile connection, which tends to have very high latency.  All of these steps add up to a “time to content” that can be more 10 seconds in extreme cases.

Compare this with serving the full search results HTML from the server.  Over a mobile connection, it may take 2 seconds to download the initial page of HTML, but it can be immediately rendered.  Even as the rest of the JavaScript application is being downloaded, the user can interact with the page.  It feels 5x faster.

Gimme the Deets!

We built upon tools we already know and love: Backbone.js and Handlebars.js.  We ended up with a hybrid of Rails, Backbone, and Node conventions.  For example, the app’s directory structure will look familiar to Rails users (minus collections and models):

In a typical Backbone app, “controller” logic—fetching the appropriate data and instantiating view(s) for a particular page—lives in methods on your instance of Backbone.Router.  As your app grows, however, the router quickly becomes bloated with all of these route handlers.  This is why we’ve created real controller objects, which group related actions into more manageable chunks.  This abstraction also allows us to more easily add before and after filters if we want to.  Here’s what a users controller might look like:

First off, you’ll notice we’re using CoffeeScript.  It’s a bit controversial, I know, but we’re fans.  Secondly, notice the module.exports = at the top.  That’s a tell-tale sign of a CommonJS module.  CommonJS is what Node.js uses to require modules, and we’re able to reuse the same syntax in the client using Stitch, which was written by the Sam Stephenson, the same fellow who also wrote Sprockets for Rails.

Now, keep in mind that this controller code gets executed on both the client and the server.  For example, if a user lands on “/users/1234″, and a route exists that routes that to users#show (more on that below), then the show action will be invoked.

The @app.fetcher.fetch you see is our way of encapsulating resource fetching.  In this case, a User model with id 1234 will be fetched from the API, instantiated, and passed to the view.  Why not just instantiate the User model yourself in the controller?  You could do that, but the fetcher provides a layer of indirection which allows us to do a bunch of fancy caching in both the client and the server-side.  I’ll leave out the details for now, but that could be the subject of a future blog post!

Routing Requests

We need to be able to match a certain URL to a controller/action pair both on the client-side and the server-side.  Inspired by Rails, we have a routes file that specifies these routes and any additional route parameters.

Notice the optional parameter maxAge on the listings#search route; this is used to set caching headers.  You can add any arbitrary parameters here and access them in the router.  We also plan to make this more advanced as needed, such as adding parameter requirements.

Our routes file format is heavily inspired by a really interesting project called ChaplinJS.  Chaplin is an application framework built on top of Backbone.  Chaplin is under active development and is quite well maintained; definitely worth a look if you want some more structure in your Backbone app.

These routes are parsed by a router, which delegates incoming requests to particular controllers.  We have a ClientRouter, which delegates to Backbone.History and translates pushState events to controller actions, and a ServerRouter, which does the same for Express requests; both extend BaseRouter to share common logic.

We serve the app using Express, the de-facto web server for Node.js.

Views

In Rendr, your views extend our BaseView, which in turn extends Backbone.View, adding a number of methods that allow it to easily render on both the server and the client.  Here’s an example ListingView for you:

First you’ll notice that we require BaseView with require('rendr/base/view').  This CommonJS module path is standard in Node.js for requiring files within NPM packages, and using a trick in our Stitch packaging, we can use the same path in the client.

In addition to the typical methods and properties from Backbone.View, we’ve added some custom ones to hook into the view’s lifecycle.  Notice the #postRender() method; this gets called only in the client-side, right after rendering.  This is where you would put any code that needs to touch the DOM, such as initializing jQuery plugins like slideshows or sliders.

Each view has a Handlebars template associated with it, and a View#getHtml() method.  On the server, we simply call #getHtml() and return the resulting string.  On the client, View#render() calls #getHtml() and updates the innerHTML of its element.

We also have a #getTemplateData() method, which by default returns @model.toJSON().  You can override it to act as a view-model, adding any view-specific properties to pass to the template for rendering.

We decided on an entirely string-based templating approach, preferring not to depend upon a DOM on the server.  One result of this is that it becomes necessary to push all HTML manipulation either into the Handlebars template or into a custom #getHtml() method.  In other words, you cannot rely on any DOM manipulation code to construct your markup; if you have a habit of appending child views, loading spinners, or any other markup in #render(), get used to pushing that down to the template.

Now, here’s the tricky part: when we generate a page of HTML from a hierarchy of views on the server, we also have to ensure that once it reaches the client, all of the views’ event handlers are properly bound to the DOM, and we have living, breathing view instances that can respond to user interactions.  Our approach was to decorate the generated HTML with data-attributes, which specify which view class it represents.  Here’s an example from one of our listing pages:

On pageload, we find all DOM elements with the data-view="some_view_class" attribute, instantiate view objects for each, and reconstruct the view hierarchy.  That way, events are properly bound, and we preserve any parent-child relationships between our view objects so they can listen to and emit events on each other.

Sounds easy enough, right?  Well, it ended up being more difficult to solve than we expected.  One of the problems we ran into is that all data needed to instantiate a view needs to be extractable from the DOM.  You’ll notice in the above code snippet that we also have the data-attributes data-model_name and data-model_id.  This allows us to pass in the correct model or collection into any view, fetching it from the client-side model cache (remember @app.fetcher from our controller? it handles this for us).  This is the pattern we came up with in order to ship; a better approach would be to assign a unique id to each view instance on the server, and have some sort of mapping from view id to view data which we could read from in the client, rather than looking up from the DOM.

There’s another way to ensure views are properly bound to the DOM, which we decided against.  It involves discarding the server-generated HTML upon pageload, regenerating the view hierarchy in the client, and swapping the view DOM tree into the page.  The downside is the potential for weird UX interactions if, for example, a user starts to interact with elements on the page that then get destroyed and replaced.

NestedView

We really like enforcing encapsulation within our views.  We also wanted an easy way to declaratively nest subviews within any view.  We came up with a nifty way of achieving both using Handlebars helpers.  Check out this slightly contrived example.

Given these views and templates:

and this data:

calling View#getHtml() on the parent returns this HTML:

So, you can arbitrarily nest views using a simple Handlebars helper, and simply call #getHtml() on the top-most view to get the entire hierarchy’s HTML.  Nifty, eh?

We found this useful enough as a standalone library that we’ve pulled it out into NestedView.  Check it out on Github.

Models & Collections

Now onto models and collections.

Rendr has a BaseModel and BaseCollection, which extend Backbone.Model and Backbone.Collection.  Here’s the most basic User model:

Pretty thin, right?  Of course, you can add whatever custom methods you want to your models.

The important part is the url property. Rendr expects to get all of its data from a JSON API over HTTP.  This could be on the same server, but in our case, it’s a preexisting API that powers a number of other apps.  The url specified above points to the path on this API server.

Calling Backbone’s CRUD methods on the model or collection (#save(), #fetch(), #destroy(), and now #update() in the latest Backbone) will dispatch a request to this API.  We override Backbone.sync() such that when these methods are called from the server-side, Rendr sends an HTTP request directly to the API server.  When called from the client, Rendr will prepend /api to the model/collection URL, proxying the request to the API through the Rendr server.  This allows us to do any additional formatting of the request or response, and to centralize the API request and response handling logic, at the expense of some network time.

Sprechen Sie Deutsch?

Internationalization (I18n) is incredibly important to us here at Airbnb.  We support 30+ languages, and have localized web sites all around the world.  We’ve been performing I18n in JavaScript for some time, but the need arose to make a more-robust library that can run in CommonJS environments as well.  That’s led to Polyglot.js, our open source JavaScript I18n library.  The extra special sauce is our pluralization logic.  Here’s an excerpt from the docs:

In English (and German, Spanish, Italian, and a few others) there are only two plural forms: singular and not-singular.

However, some languages get a bit more complicated. In Czech, there are three separate forms: 1, 2 through 4, and 5 and up. Russian is even crazier.

Polyglot.js abstracts all that away from you, and just requires you to provide translations for whichever locales you’re interested in.

What’s Next?

There’s still a lot of work to be done.  Before we release Rendr, we need to build a few more applications with it and modularize the code a bit more.  Luckily we have a growing need for rich JavaScript apps that are fast and SEO-friendly here at Airbnb.  Keep an eye out here and follow @AirbnbNerds and @spikebrehm for more updates.

I’ll be speaking more about Rendr and Airbnb’s experiences building rich JavaScript apps at HTML5DevConf on April 1-2 in San Francisco.  It’s shaping up to be a great conference.  Sign up if you haven’t already!