Airbnb’s traffic grows approximately 5x year over year. We’re running in the AWS cloud, using most of their services: hundreds of EC2 servers, tens of RDS instances, and nearly a dozen Elasticache nodes. At around this time last year, we hit a wall. Traffic was growing faster than we could scale our databases, and we were running the largest RDS instances Amazon sold. There was no more vertical headroom. Unless our tiny engineering team made massive changes, our site would buckle in under a month. We of course made large refactors to our schemas, in many cases switched datastores for frequently-used columns (especially those being used for key-value data), and vertically partitioned tables into separate databases where possible. But these were large changes and were performed over the course of months, not weeks. One of the ways we bought time for these efforts was by rebuilding our caching infrastructure from the ground up.

Caching At Airbnb

The most frequently accessed pages on our site are our listing pages. They’re also largely static, so we’d been caching the rendered output in Memcached and using ActiveRecord callbacks to expire the pages after related models changed. Unfortunately, the pages displayed information from many data sources: user information (prominently, the host’s… But also information for each review!), the listing itself, data derived from pricing and currency tables, and more. Often Models displayed on the listing pages would change, but would only change private data like home address, or would change public data like their personal description that isn’t shown in a review. Since we were expiring our page caches every time that happened, the pages were being re-rendered far more frequently than necessary — resulting in extra load on the databases. We looked into using Sweepers, the Rails-3-omakase way of expiring caches. But Sweepers suffered the same problems as callbacks: any Model attribute change expired the page, even if it wasn’t used during rendering. Since there didn’t seem to be a solution that let us specify which attributes were “safe” to change, we built our own on top of ActiveRecord::Observers. ActiveRecord::Observers, for those not familiar with the deeper workings of Rails, are an implementation of the Gang of Four Observer pattern: they watch things — in this case, ActiveRecord Models — for changes, and when changes occur, take specific actions. Since ActiveRecord Models store a list of all “dirty” (changed) attributes, we were able to build an extension that watches Models for attributes deemed unsafe to change and then expires caches when those attributes do change.

CacheObservers

Traditional Observers are one-to-one with Models, not Views. But our CacheObservers needed to have knowledge of the View structure, which in our case often rendered multiple Models; we couldn’t have them be one-to-one with Models. Sweepers are slightly different: they observe both Models and specific Controller actions. But we wanted something that knew just about specific views, not about what actions could impact it: every time we added a new edit action, we didn’t want to also have to remember to call the related CacheObserver’s expiration method. Instead of following the Rails Observer and Sweeper patterns, we made CacheObservers one-to-one with views and one-to-many with Models. CacheObservers have no knowledge of Controllers; they only know what data a view needs to render. There’s a simple DSL for declaring data dependencies:

    caches_on :listing do |cache|
        cache.safe_fields 'updated_at',
                          'latitude',
                          'longitude'
    end

Inside each CacheObserver definition, there can be multiple caches_on blocks. Each caches_on block represents a Model class; the safe_fields are the fields that are safe to change without needing to expire the page cache. Of course, this isn’t quite enough to implement a smart cache-busting Observer. After all: how can you expire a Listing cache given only a model? To auto-expire caches, CacheObservers need to control the cache keys as well, and have a little bit of extra information to figure out how to get from a User model to the Listing’s cache key. First, to define the cache key generation:

    class ListingCache < CacheObserver
        key_config.generate_from_url :controller => :listing, :action => :show

        caches_on :listing do |cache|
            cache.safe_fields 'updated_at',
                        'latitude',
                          'longitude'
        end
    end

All we’ve added here is the surrounding class definition, along with one extra line:

    key_config.generate_from_url :controller => :listing, :action => :show

There are many ways to configure key generation (in fact, at the base level it simply accepts a block that returns a string); however, in this case we’ve chosen to use the generate_from_url built-in strategy. We’ve passed it pieces of the hash needed by Rails’ url_for method, but missing a crucial component: the listing id. Next, we tell the CacheObserver how to generate the key from the Listing model:

    class ListingCache < CacheObserver
      key_config.generate_from_url :controller => :listing, :action => :show

      caches_on :listing do |cache|
        cache.safe_fields 'updated_at',
                          'latitude',
                          'longitude'
        cache.url_options do |listing|
          { :id => listing.id }
        end
      end
    end

The cache.url_options block takes an instance of a model that’s had unsafe attributes changed and returns any pieces of the hash left out of the generate_from_url hash. The CacheObserver then merges the hashes together and generates the key from the result. In this case, we simply return the listing.id as the :id parameter; we could’ve also left out the :action parameter in the generate_from_url method and returned it here, but there’s no reason to and it’s easier to keep it in the class method instead of inside each caches_on block. You might wonder: if at the base key configuration level the configuration simply accepts a block that returns a string, how could there be a built-in url_options method? Is there tight coupling between the generate_from_url method and the CacheObservers? But there isn’t: url_options is just an alias for key_options, which takes a block and passes its output as the input for the key configuration block. In this case, the URL strategy decorates the cache object to have the alias for clarity, and expects the given block to return a hash.

Expiring Multiple Caches Associating Listing models with a Listing cache is easy — they’re clearly one-to-one. If a listing changes, the cached page for it should be expired. But what about user models? Users could have multiple listings, so clearly returning a single hash won’t work. Luckily, the generate_from_url strategy handles this case:

    caches_on :user do |cache|
      cache.safe_fields 'updated_at',
                        'billing_address'
      cache.url_options do |user|
        listing_ids = []
        user.listings.each do |listing|
          listing_ids.push({ :id => listing.id })
        end
        listing_ids
      end
    end

The url_options block can return an array of hashes, and the CacheObserver will expire all of the associated caches. Of course, in production you probably wouldn’t write the code as above: you’ll potentially generate a bunch of inefficient queries for listings, when all you need are the ids. For simplicity, we’ve presented it this way; writing the more efficient query isn’t difficult but is left as an exercise for the reader.

Fragment Caches

The above worked well for expiring full-page caches only when absolutely necessary. But often pages have portions that are highly dynamic, but other pieces that are generally static; for that, we cache pages in fragments so that even when the page expires we don’t do extra queries for the static pieces. The trick with fragment caches is that if they expire, they should expire the full-page cache as well: but they shouldn’t expire their sibling caches. What’s more, nested fragments should expire their parents, who should in turn expire their own parents. This makes sure that your pages are always up-to-date, but maximally cached. This isn’t the first time this concept has been blogged about; at around the same time as we were building our system, 37Signals began blogging about their “Russian Doll” caching architecture, which has now become the base for Rails 4 caching. CacheObservers work similarly with respect to fragment caches: caches have children (and grandchildren, and great-grandchildren…), and they do what you expect and expire only what’s necessary. The DSL is once again simple:

    child_cache :social do
      child_cache :reviews do
        caches_on :review do |cache|
          # ...
        end
        caches_on :user do |cache|
          # ...
        end
      end
      child_cache :friends do
        caches_on :user do |cache|
          # ...
        end
      end
    end

Each cache is named, and can be accessed by name in the view. The models and safe attributes are separate from their siblings and parents, and in addition to being able to expire on different models, children can expire on different attributes on the same models (or, more unlikely, the same attributes on the same models) as their siblings or parents.

Results

Airbnb grows seasonally: in the spring and summer months we accelerate dramatically, then level off mid-autumn before beginning the relentless upward drive once again following the New Year. We built CacheObservers in pieces last spring: at first, just the full-page caches, then child fragment caches, and eventually even Etag-based extensions to short-circuit the rendering process. By midsummer our full-page cache hit rates had more than doubled, and our database efforts bore fruit. We’d kept the site online, and we slowly transitioned our efforts away from disaster mitigation and towards overall site performance and building the foundation for the next year’s marathon. Thanks to those efforts, we’re doing better this year. We’re well into the growth season, faster than we’ve been in years, and we’ve been keeping our site at three 9s thanks to the tireless efforts of the SRE and Performance teams, with the Core Services team building out future architecture for next year’s spike in parallel. I’m crossing my fingers for summer, but it’s mostly out of habit. We learned our lesson last year. If you want to be a part of a fast-moving team tackling giant scaling problems, we’re hiring.