How we improved search performance by 2x

About Tobi Knaup

If you haven’t used our moving map search recently you should check it out now, because we made it more than twice as fast! Actually, every search is faster now but it’s most noticeable in map mode. So how did we do that? Let’s start with some background on our setup. We’re a Rails shop, and we’re using Sphinx as our search engine. The two are connected through ThinkingSphinx, an excellent Ruby gem that provides an easy to use query interface, and a DSL for defining indexes. The queries we run are a little bit different from what an average website does, because every single one filters results with spatial constraints (latitude/longitude). We also make heavy use of facets for the various filter options such as room type, neighborhood, or amenities.

Why was it slow?

Sphinx works great for most common use cases, but it’s not optimized for spatial queries. While it gives you some basic functions to query and rank by distance, it doesn’t perform any spatial indexing. The latitude and longitude fields are just floats, and spatial queries have to scan the whole index, which is of course not very performant or scalable. Also, it turns out that the configuration generated by ThinkingSphinx doesn’t allow Sphinx to make use of multiple processor cores. Now while it sounds like this setup doesn’t fit our requirements at all in terms of performance, Sphinx is very fast in general. Rewriting or switching to a different engine wasn’t an option for us at the time so we wanted to make surgical changes to get the maximum out of it. We got help from Vlad and Rich at sphinxsearch.com, who are experts in tuning Sphinx.

How we optimized it

The first objective was to allow Sphinx to use all available processor cores. To achieve this, we split the search index into multiple parts and configured Sphinx to use them as a distributed index. Sphinx then uses one thread to search each partial index, and merges the results afterwards. Here is an example configuration snippet that makes use of two cores:

searchd{  …  dist_threads = 2  …}source hosting_core_0{  …  sql_query = SELECT … FROM hostings WHERE id % 2 = 0  …}source hosting_core_1 : hosting_core_0{  sql_query = SELECT … FROM hostings WHERE id % 2 = 1}index hosting_core_0{  source = hosting_core_0  path = /home/sphinx/db/hosting_core_0}index hosting_core_1{  source = hosting_core_1  path = /home/sphinx/db/hosting_core_1}index hosting_core{  type = distributed  local = hosting_core_0  local = hosting_core_1}

What’s important here is to set dist_threads to the number of processor cores, and to configure one partial index per core. It’s easy to split your data into multiple indexes if you have an id column with auto_increment. Simply use the mod operator in the source config blocks. Another big performance boost came from upgrading Sphinx from 0.9.9 to 2.0. It’s currently in “stable beta”, which basically means that core features are production quality, whereas some newly added features might be less tested. The Sphinxsearch guys recommended it, and since we weren’t using any of the cutting-edge features we felt confident to use it in production. The only downside to those changes is that we had to say goodbye to the ThinkingSphinx index configuration DSL. It doesn’t support these advanced settings.

Statsd

The future

There are a few ways to get even more performance out of Sphinx. It has its own query language – SphinxQL, which allows you to bundle queries and execute them together. This is really helpful for combining multiple facet queries. It would require major changes in our app and getting rid of ThinkingSphinx though, so we’ll save that for a later date. Another way to get more parallelism and scalability is to split the index across multiple machines. This works similar to same-machine distributed indexes and is easy to set up. Although Sphinx has been great for us so far, the lack of spatial indexing will become a problem at some point. We’re currently exploring other architectures that provide this feature. Stay tuned.

 

9 comments

About Tobi Knaup

Speak Your Mind

*

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

Comments

  1. Aaron Blohowiak

    http://en.wikipedia.org/wiki/Geohash Would turns lat/long range intersection queries into very, very fast prefix-matching queries.Encoding data with GeoHash (a very simple thing to do) makes it so that the longer the shared prefix, the closer things are together (excepting right around the equator and meridian.. which may not be an issue for AirBnB.)Sphinx is very, very fast at answering prefix queries so I suggest you look into it!

  2. Jon Hinson

    Very nice write up. I’ve been curious as to what architecture you guys use for your spatial searches.@Aaron: The problem with geohash is its inflexibility. Items very close might not show up in a search because they are in different grids. Thus, you would have to decrease your precision in order to return them both in the same query. On map interfaces like Airbnb’s, one would want to use bounding boxes or radius searches which, correct me if I’m wrong, wouldn’t be an option if using geohashes.I experimented with geohashes using Solr and Sunspot for a new version of one of my apps, but it just didn’t suit my needs. I switched over to Spatial Search Plugin for Solr and have been happy with it so far.

  3. Aaron Blohowiak

    @Jon: using a bit of math, you can see if the point in question is near a boundary and then do the appropriate additional search. Alternately, you can just perform the 8 surrounding searches every time. This may still be faster than doing lat/long range-based searches in sphinx.

  4. Yaroslav Vorozhko

    Also, you could split indexes core0 and core1 into separated hdd which will give you a bit more performance in hdd IO, before your move to dedicated sphinx cluster.

  5. pat

    Hey thereIt’s great to hear about some serious Sphinx usage – and especially about a site like AirBnB using Thinking Sphinx. Thanks for all the detail!I just wanted to let you guys know that what you’re doing (at least, what you’ve described) should still be possible via Thinking Sphinx’s auto-generated file. You can define multiple indices for a model, and then have a different WHERE condition for each one – and dist_threads can be listed in config/sphinx.yml. TS also supports Sphinx 2.x as well (but perhaps didn’t when you were making these changes). And of course, all of that could be better documented.As for SphinxQL support, I’ve started rewriting Thinking Sphinx to use it (check out the edge branch), but it’s not quite ready to handle most situations (facets support most notably for how you’re using it). I’m pretty proud of the rewrite even as it stands right now – it doesn’t insert nearly as much into ActiveRecord, and it’s much lighter generally (especially when the Rails stack reloads on every development request). I can’t make any promises of when it’ll be ready for the prime-time though – paid work takes priority.Would love any feedback you have on how Thinking Sphinx could be improved from the perspective of how you’re using it and Sphinx, too!

  6. katiepatrick

    Hey Tobi – Thanks for the awesome blog! Would you be kind enough to share were you guys get the data from for the autocomplete for location on the main airbnb search?Do you use geonames or have a downloaded database that you host on your server? Hook into a google service?http://www.geonames.org/export/ajax-postalcode-autocomplete.htmlLove airbnb!Katie

  7. eugenemi

    Hi,Would you mind sharing how did you integrate “common friends” into your search? Is it also done using Sphinx? Or is it an extra stage that is performed on the results returned by Sphinx? – Eugene

  8. alpistee

    Katie: They are using this API from Google: https://developers.google.com/maps/documentation/javascript/places#places_autocompleteTobi, so for each of the query terms you have to save the geo-encoding of the corresponding result for the query? And then you query rooms against a center point or a bounding box? Could you tell us more about that? :)

  9. Zack Ham

    Thanks for the writeup – will definitely put this config tip to use.

    Our @geodist queries were way too slow, so we settled on narrowing the dataset dramatically as a rough first pass in the query by leveraging what sphinx is good at (fulltext search).

    1) Add {floor(lng)}x{floor(lat)} to your index for each record.

    CONCAT(FLOOR(first_lng+180),'x',FLOOR(first_lat+90)) AS tile,

    2) Generate a string of tiles that fully covers the search area based on center of search and radius: https://gist.github.com/zackham/5911093

    3) Add it into your search query

    tile_query = "@tile #{MyModel.tiles_for(center_lat, center_lng, search_radius).join(' | ')}"
    keyword_query = keywords.nil? || keywords.strip == '' ? '' : "@(name,description,etc) #{Riddle.escape(keywords)}"
    extended_query = "#{keyword_query} #{tile_query}"

    Feel free to get in touch with any question, zack@ridewithgps.com