Airbnb Engineering http://nerds.airbnb.com Nerds Fri, 26 Sep 2014 05:38:49 +0000 en-US hourly 1 http://wordpress.org/?v=3.8.4 Interning with the Airbnb Payments Team http://nerds.airbnb.com/interning-airbnb-payments-team/?utm_source=rss&utm_medium=rss&utm_campaign=interning-airbnb-payments-team http://nerds.airbnb.com/interning-airbnb-payments-team/#comments Thu, 25 Sep 2014 15:43:42 +0000 http://nerds.airbnb.com/?p=176594486 My name is Rasmus Rygaard, and I’m a graduate student at Stanford University about to wrap up my masters in computer science. I have spent the last 12 weeks interning as a software engineer at Airbnb for the second summer in a row. This year, I joined the Payments team to help build out our […]

The post Interning with the Airbnb Payments Team appeared first on Airbnb Engineering.

]]>
My name is Rasmus Rygaard, and I’m a graduate student at Stanford University about to wrap up my masters in computer science. I have spent the last 12 weeks interning as a software engineer at Airbnb for the second summer in a row. This year, I joined the Payments team to help build out our global payments platform. It has been a fantastic experience, and I have been amazed by the kind of responsibility I have gotten as an intern. During my internship, I sped up our financial pipeline by more than an order of magnitude, isolated and upgraded the code that lets guests pay with credit cards, and set up payouts through local bank transfers that will save our hosts millions of dollars in fees.

Speeding up the Financial Pipeline

My main project for the summer was improving our financial pipeline. Airbnb is fortunate to have plenty of graphs that go up and to the right, but when those graphs track application performance, we would rather have them stay flat. To help our Finance team understand our business, the Payments team built a service to generate and serve key business metrics. That service, however, was not scaling with our business. The Finance team depends on having fresh data available at all times, but our infrastructure was struggling to deliver results fast enough. My task was clear yet open-ended: Speed up the pipeline.

I considered several approaches to speeding up the pipeline, but I ended up adapting our existing MySQL queries to be runnable in Presto. Presto, a distributed in-memory query engine built at Facebook, is heavily used for ad-hoc analytics queries at Airbnb. Presto’s in-memory computation provides significantly better performance than Hive, and for a task that required significant financial number-crunching, it seemed like the obvious choice. In addition to translating the queries, I built a pipeline that will help us seamlessly run complex queries in Presto even for data that lives in MySQL. The system is written in Ruby, which lets us leverage Presto’s computational power from our existing Rails application. As a result, the functionality that we have moved to our new infrastructure now runs 10 to 15 times faster.

Upgrading and Isolating Credit Card Code

My second project was to reorganize the way we process credit cards when you book a trip on Airbnb. I upgraded the JavaScript code that handles checkout logic so our system no longer needs to be exposed to sensitive credit card information. Now, we exchange the credit card information for a token in the browser when a user books a listing. This token works as a replacement for the credit card information with our credit card processing partner so we can complete the transaction exactly like before.

This was a high-impact project that had to be treated like surgery: We were replacing code on our perhaps most important page, and our guests should not experience any bumps along the way, nor should we cause any downtime. In addition to upgrading the code, I extracted this functionality into a separate module that we now share between our Rails and Node.js apps for web and mobile web. This required decomposing the existing code, but as a side-effect, we now have much more modularized code around capturing and processing transactions that we can test more aggressively.

Skærmbillede 2014-09-03 kl. 15.06.22

Paying Hosts in Local Currencies

Wherever possible, we want to pay our hosts in their local currency. A convenient way of doing that is through bank transfers that are faster and less expensive than international wire transfers. My final project for the summer was to work with a third party provider to start supporting local bank transfers in a number of markets where the existing methods for payouts are slow or expensive. I was the point person for our engineering side of the integration, working on the technical integration with our provider and on product decisions with our own Global Payments team.

We rely on several partners to support our platform, so the experience of partnering with another provider was not new to the team. I, however, had never worked this closely on a third party integration before. Still, my team trusted me to see through a project that would end up delivering payouts to hosts in countries around the world. Although work that requires syncing with an external company can be difficult to plan, the experience of balancing interests, requirements, and timelines between us and our partner gave me real world experience that no other class project or internship could provide. Our hosts in these markets can now look forward to saving millions of dollars in processing fees that would otherwise be charged by the providers of the existing payout methods.

My 12 weeks at Airbnb have been an amazing learning experience. As an intern, I have worked on projects that had significant impact on our hosts, guests, and core business. I have seen the global scale of our payments platform and how a small team of talented engineers make sure that money flows safely from guests to hosts on Airbnb. I’ve been blown away by the visible, high-impact projects that the company trusts interns with, and it’s clear that interns are treated just like full-time employees.

The post Interning with the Airbnb Payments Team appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/interning-airbnb-payments-team/feed/ 5
Inside the Airbnb iOS Brand Evolution http://nerds.airbnb.com/inside-airbnb-ios-brand-evolution/?utm_source=rss&utm_medium=rss&utm_campaign=inside-airbnb-ios-brand-evolution http://nerds.airbnb.com/inside-airbnb-ios-brand-evolution/#comments Tue, 16 Sep 2014 15:55:20 +0000 http://nerds.airbnb.com/?p=176594459 Earlier this year we embarked on a six month project to refresh the mobile app, to coincide with the launch of our updated brand. The Airbnb product teams continued to ship new features for our mobile apps even during the months it took to implement the new brand. Logistically, this is the equivalent of repainting […]

The post Inside the Airbnb iOS Brand Evolution appeared first on Airbnb Engineering.

]]>
Earlier this year we embarked on a six month project to refresh the mobile app, to coincide with the launch of our updated brand. The Airbnb product teams continued to ship new features for our mobile apps even during the months it took to implement the new brand. Logistically, this is the equivalent of repainting a plane while it is in flight — without any of the passengers knowing it. When it landed, we also wanted it to be a surprise.

When we embarked upon the brand evolution we chose to use small and nimble engineering teams to prevent blocking each other. When the spec for a project is changing it is easy to have too many cooks in the kitchen as engineers begin to move faster than the shifting designs. Until the last month the team consisted of one engineer on each mobile application and three on the website. Part of the logic behind this decision was also an acknowledgement that design is an iterative process: no PSD file survives first contact with the user. A small engineering team is better able to adapt, with fewer places for communication to breakdown.

Brand Evolution Workshop

The design team had already been hard at work for months creating the new brand, but they brought the engineers into the room early to start helping to scope the project. At this point, we were considering:

  1. Design language: what would our “toolkit” look like (buttons, switches, etc.?)
  2. How could we improve the usability of the application without moving beyond the scope of a reskin?
  3. How could we do all this without disrupting other teams?

Even thinking within the confines of reskinning the application using a small team, it became apparent that there were some quick usability wins to be had. For example, we changed the background color of our side-drawer (Airnav) to be dark, which greatly improved readability. For contrast, here’s the before and after shots of our navigation system:

AirNav

We made the tabs on the Reservation screen more obviously buttons, which improved the ability of users to discover functionality. It was tempting to say “yes” to every great idea dreamed up in these design/engineering meetings, but the constraint of small engineering teams meant that we were hyper aware of our limited resources.

Toolkit

Above: one of the PSDs in our toolkit, defining modal and user picture styles. Those with a sharp eye will note that this does not exactly match the final version, which was one of the challenges we knew we would be facing during the process.

Build Engineering

We had two options for how to structure our workflow. We could have constrained the branding code to a long-lived Git branch, or we could have kept merging to master and split the XCode targets. We decided to pursue the latter approach for a number of reasons. Most notably, we decided that even though there was only one engineer responsible for the brand evolution itself, team members would be responsible for ensuring that new features were compatible with the brand evolution.

Build

After duplicating our target and swapping out the app icon, we now were faced with a new set of challenges. First, we needed to make sure no assets were leaked by accidental inclusion in the distribution target. We meticulously examined every .ipa which was built for leaks, and even went so far to use code words as symbol names in the code to prevent leakage through reverse engineering the binary.

Next, we needed to decide how to branch the code. On the surface this was simple: we just added a preprocessor macro to the new target called STYLE_PRETZEL. Then, we could use #if defined(STYLE_PRETZEL) in our code and rest assured that anything in the block would not be compiled into the existing deployment target. There were definitely drawbacks to this method, such as:

  • Code complexity: the codebase was quickly riddled with branching logic
  • No header changes: due to the multi-project nature of our XCode workspace, we could not be guaranteed that the macro would be available in a header file, so we could not use the #if defined in .h files
  • (Almost) no new classes: for both of the above reasons, it was generally best to not add any new classes, but rather to work with existing classes

In some ways, these restrictions actually helped us: it forced us to think about the minimum viable code change necessary to accomplish the task. In other words: if you’re writing lots of code, you’ve probably moved beyond reskinning.

Progressive Implementation

Our internal styling library, Nitrogen, allowed us to swap out colors, fonts, and basic component (“toolkit”) styles. This wholesale replacement was then followed by multiple progressive refinement passes across all the screens. We liked to think of it like the progressive rendering of images downloaded over a slow internet connection:

After the toolkit came the “first-cut”, at which point the app definitely looked new, but was still nowhere near Airbnb’s standard of design perfection; we essentially eyeballed the changes:

 First Cut

Next was “redlining.” The production team took the PSDs provided and used software to call out the exact fonts, colors, margins, etc.:

Redlining

With nested views, subclassed controllers, and other pitfalls it can sometimes be hard to know that the design is actually implemented to-spec. To check our work, we used Reveal to compare the correct values to the actual implementation:

reveal

All Hands on Deck

With about a month left until the big unveiling, we deleted the old XCode target and removed the preprocessor macro and #if defined statements, locking us into deploying the next version of the app with the brand evolution. Not only were all mobile team members now working with the reskinned app, but it was also distributed to employees of the company to start testing.

One of our mantras at Airbnb is that “every frame matters,” and now was the time to prove it. Designers began going through the app with a fine-tooth comb, and engineers’ queues filled with bug reports. It cannot be stressed enough how important these last few weeks were. This is the point at which we were moving text by one pixel here and one pixel there, yet all these pixels added up to the difference between a good app and a clean app.

Lessons Learned

The biggest oversight we made was to not think about localization earlier. Airbnb is an exceptionally global brand, and it is paramount that the app look stunning no matter what language it is shown in. This speaks to a greater need to “design for the dynamic.” We as engineers need to do a better job communicating with the design team about how they imagine the implementation changing. Not just for different sizes of text, but for animations, screen sizes, etc.

The early decision we made to split the XCode targets rather than creating a Git branch proved to be a good one. The result was that the rest of the team was at least superficially familiar with the new code, and when we called for All Hands On Deck there was minimum friction.

Conclusion

Overall, the process went exceptionally smoothly. Teams across the company continued to work on their own projects until just a few weeks before launch. On the big day, all of the teams pressed the “launch” button at the same time, and the world saw the new brand across all channels.

The post Inside the Airbnb iOS Brand Evolution appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/inside-airbnb-ios-brand-evolution/feed/ 0
How to Drive Growth: Lessons from Facebook, Dropbox, and More http://nerds.airbnb.com/drive-growth-lessons-facebook-dropbox/?utm_source=rss&utm_medium=rss&utm_campaign=drive-growth-lessons-facebook-dropbox http://nerds.airbnb.com/drive-growth-lessons-facebook-dropbox/#comments Tue, 01 Jul 2014 16:00:03 +0000 http://nerds.airbnb.com/?p=176594450 You’re inundated with tactics about how to grow your company. This talk will help you refine a framework for how to think about growth, and arm you with some tools to help hack through the bullshit. We’ll fly through a 50,000 foot view of the product development process used at Dropbox and Facebook to drive […]

The post How to Drive Growth: Lessons from Facebook, Dropbox, and More appeared first on Airbnb Engineering.

]]>
You’re inundated with tactics about how to grow your company. This talk will help you refine a framework for how to think about growth, and arm you with some tools to help hack through the bullshit.

We’ll fly through a 50,000 foot view of the product development process used at Dropbox and Facebook to drive growth.

We’ll dive into the nitty gritty: dozens of real examples of tests that worked and didn’t work.

Ivan Kirigin is an engineer and founder at YesGraph, a young startup focused on referral recruiting. Previous he lead growth at Dropbox, helping drive 12X growth in 2 years, and helped build Facebook Credits, and co-founded Tipjoy YC W08.

The post How to Drive Growth: Lessons from Facebook, Dropbox, and More appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/drive-growth-lessons-facebook-dropbox/feed/ 1
How to Search http://nerds.airbnb.com/search/?utm_source=rss&utm_medium=rss&utm_campaign=search http://nerds.airbnb.com/search/#comments Sat, 28 Jun 2014 18:45:54 +0000 http://nerds.airbnb.com/?p=176594446 I have worked on search projects for around 7 years now in 3 companies and have had the privilege of working with smart search engineers from many other companies.Solving search problems together, we’ve found there are many ways to skin a cat. And we’ve formed opinions on the best ways among them. In this talk, […]

The post How to Search appeared first on Airbnb Engineering.

]]>
I have worked on search projects for around 7 years now in 3 companies and have had the privilege of working with smart search engineers from many other companies.Solving search problems together, we’ve found there are many ways to skin a cat. And we’ve formed opinions on the best ways among them.

In this talk, I’ll share some of these experiences and cover different approaches to deliver the best search results for our user. I’ll cover infrastructure (indexing strategies, sharding, efficient retrieval, etc.) as well as relevance (query rewriting, scoring, etc.). Finally, I’ll spend some time of frontend/product issues and ways to measure search quality.

Sriram Sankar is a Principal Staff Engineer at LinkedIn, where he is leading the development of their next-generation search infrastructure and search quality frameworks. Before that, he led Facebook’s search quality efforts for Graph Search, and was a key contributor to Unicorn, the index powering Graph Search. He previously worked at Google on search quality and ads infrastructure and has held senior technical roles at VMware, WebGain, and Sun. He is the author of JavaCC, one of the leading parser generators for Java. Sriram obtained his PhD from Stanford University, and a BS from the Indian Institute of Technology Kanpur.

The post How to Search appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/search/feed/ 0
Referrals at Airbnb: Driving Sustainable & Scalable Growth http://nerds.airbnb.com/referrals-airbnb-driving-sustainable-scalable-growth/?utm_source=rss&utm_medium=rss&utm_campaign=referrals-airbnb-driving-sustainable-scalable-growth http://nerds.airbnb.com/referrals-airbnb-driving-sustainable-scalable-growth/#comments Sat, 28 Jun 2014 18:43:13 +0000 http://nerds.airbnb.com/?p=176594442 Word of mouth is the largest source of growth for Airbnb, in part because Airbnb experiences are so personal. People use Airbnb to unlock incredible experiences–anything from weekend getaways with friends, cultural exchanges, and once in a lifetime events like honeymoons. The Growth team builds products that helps users tell their stories. Our most successful […]

The post Referrals at Airbnb: Driving Sustainable & Scalable Growth appeared first on Airbnb Engineering.

]]>
Word of mouth is the largest source of growth for Airbnb, in part because Airbnb experiences are so personal. People use Airbnb to unlock incredible experiences–anything from weekend getaways with friends, cultural exchanges, and once in a lifetime events like honeymoons.

The Growth team builds products that helps users tell their stories. Our most successful program is Referrals which accounts for up to 30% of growth in certain markets. We launched Referrals in January on all three of our platforms: our website, our Android app, and our iOS app allowing users both to send and redeem referrals.

In this talk we will dive into the details of why we decided to build out Referrals, how we built it and the effect it have had on our growth trajectory. We’ll walk you through in-depth learnings and tips that we’ve acquired along the way. Finally, we’ll share what we’re working on now.

Jimmy Tang is an engineer on the Airbnb Growth team and built referrals for iOS. He works on all things growth from product, tools, and tracking. Previously, he cofounded Yozio, a platform for organic mobile growth.

Gustaf Alstromer is a product manager on the Airbnb Growth team and helped launched Referrals. Gustaf was previously Head of Growth at Voxer before he joined Airbnb in the end of 2012.

The post Referrals at Airbnb: Driving Sustainable & Scalable Growth appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/referrals-airbnb-driving-sustainable-scalable-growth/feed/ 0
Architecting a Machine Learning System for Risk http://nerds.airbnb.com/architecting-machine-learning-system-risk/?utm_source=rss&utm_medium=rss&utm_campaign=architecting-machine-learning-system-risk http://nerds.airbnb.com/architecting-machine-learning-system-risk/#comments Mon, 16 Jun 2014 16:30:45 +0000 http://nerds.airbnb.com/?p=176594368 Online risk mitigation At Airbnb, we want to build the world’s most trusted community.  Guests trust Airbnb to connect them with world-class hosts for unique and memorable travel experiences. Airbnb hosts trust that guests will treat their home with the same care and respect that they would their own.  The Airbnb review system helps users find community members who earn […]

The post Architecting a Machine Learning System for Risk appeared first on Airbnb Engineering.

]]>
Online risk mitigation

At Airbnb, we want to build the world’s most trusted community.  Guests trust Airbnb to connect them with world-class hosts for unique and memorable travel experiences. Airbnb hosts trust that guests will treat their home with the same care and respect that they would their own.  The Airbnb review system helps users find community members who earn this trust through positive interactions with others, and the ecosystem as a whole prospers.

The overwhelming majority of web users act in good faith, but unfortunately, there exists a small number of bad actors who attempt to profit by defrauding websites and their communities.  The trust and safety team at Airbnb works across many disciplines to help protect our users from these bad actors, ideally before they have the opportunity to impart negativity on the community.

There are many different kinds of risk that online businesses may have to protect against, with varying exposure depending on the particular business. For example, email providers devote significant resources to protecting users from spam, whereas payments companies deal more with credit card chargebacks.

We can mitigate the potential for bad actors to carry out different types of attacks in different ways.

1) Product changes

Many risks can be mitigated through user-facing changes to the product that require additional verification from the user. For example, requiring email confirmation, or implementing 2FA to combat account takeovers, as many banks have done.

2) Anomaly detection

Scripted attacks are often associated with a noticeable increase in some measurable metric over a short period of time. For example, a sudden 1000% increase in reservations in a particular city could be a result of excellent marketing, or fraud.

3) Simple heuristics or a machine learning model based on a number of different variables

Fraudulent actors often exhibit repetitive patterns.  As we recognize these patterns, we can apply heuristics to predict when they are about to occur again, and help stop them.  For complex, evolving fraud vectors, heuristics eventually become too complicated and therefore unwieldy.  In such cases, we turn to machine learning, which will be the focus of this blog post.

For a more detailed look at other aspects of online risk management, check out Ohad Samet’s great ebook.

Machine Learning Architecture

Different risk vectors can require different architectures. For example, some risk vectors are not time critical, but require computationally intensive techniques to detect. An offline architecture is best suited for this kind of detection. For the purposes of this post, we are focusing on risks requiring realtime or near-realtime action. From a broad perspective, a machine-learning pipeline for these kinds of risk must balance two important goals:

  1. The framework must be fast and robust.  That is, we should experience essentially zero downtime and the model scoring framework should provide instant feedback.  Bad actors can take advantage of a slow or buggy framework by scripting many simultaneous attacks, or by launching a steady stream of relatively naive attacks, knowing that eventually an unplanned outage will provide an opening.  Our framework must make decisions in near real-time, and our choice of a model should never be limited by the speed of scoring or deriving features.
  1. The framework must be agile Since fraud vectors constantly morph, new models and features must be tested and pushed into production quickly.  The model-building pipeline must be flexible to allow data scientists and engineers to remain unconstrained in terms of how they solve problems.

These may seem like competing goals, since optimizing for realtime calculations during a web transaction creates a focus on speed and reliability, whereas optimizing for model building and iteration creates more of a focus on flexibility. At Airbnb, engineering and data teams have worked closely together to develop a framework that accommodates both goals: a fast, robust scoring framework with an agile model-building pipeline.

Feature Extraction and Scoring Framework

In keeping with our service-oriented architecture, we built a separate fraud prediction service to handle deriving all the features for a particular model. When a critical event occurs in our system, e.g., a reservation is created, we query the fraud prediction service for this event. This service can then calculate all the features for the “reservation creation” model, and send these features to our Openscoring service, which is described in more detail below. The Openscoring service returns a score and a decision based on a threshold we’ve set, and the fraud prediction service can then use this information to take action (i.e., put the reservation on hold).

The fraud prediction service has to be fast, to ensure that we are taking action on suspicious events in near realtime. Like many of our backend services for which performance is critical, it is built in java, and we parallelize the database queries necessary for feature generation.  However, we also want the freedom to occasionally do some heavy computation in deriving features, so we run it asynchronously so that we are never blocking for reservations, etc. This asynchronous model works for many situations where a few seconds of delay in fraud detection has no negative effect. It’s worth noting, however, that there are cases where you may want to react in realtime to block transactions, in which case a synchronous query and precomputed features may be necessary.  This service is built in a very modular way, and exposes an internal restful API, making adding new events and models easy.

Openscoring

Openscoring is a Java service that provides a JSON REST interface to the Java Predictive Model Markup Language (PMML) evaluator JPMML.  Both JPMML and Openscoring are open source projects released under the Apache 2.0 license and authored by Villu Ruusmann (edit – the most recent version is licensed the under AGPL 3.0) .  The JPMML backend of Openscoring consumes PMML, an xml markup language that encodes several common types of machine learning models, including tree models, logit models, SVMs and neural networks.  We have streamlined Openscoring for a production environment by adding several features, including kafka logging and statsd monitoring.  Andy Kramolisch has modified Openscoring to permit using several models simultaneously.

As described below, there are several considerations that we weighed carefully before moving forward with Openscoring:

Advantages

  • Openscoring is opensource - this has allowed us to customize Openscoring to suit our specific needs.
  • Supports random forests - we tested a few different learning methods and found that random forests had an appropriate precision-recall for our purposes.
  • Fast and robust - after load testing our setup, we found that most responses took under 10ms.
  • Multiple models - after adding our customizations, Openscoring allows us to run many models simultaneously.
  • PMML format - PMML allows analysts and engineers to use any compatible machine learning package (R, Python, Java, etc.) they are most comfortable with to build models.  The same PMML file can be used with pattern to perform large-scale distributed model evaluation in batch via cascading.

Disadvantages

  • PMML doesn’t support some types of models - PMML only supports relatively standard ML models; therefore, we can’t productionize bleeding-edge models or implement significant modifications to standard models.
  • No native support for online learning - models cannot train themselves on-the-fly. A secondary mechanism needs to be in place to automatically account for new ground truth.
  • Rudimentary error handling -  PMML is difficult to debug.  Editing the xml file by hand is a risky endeavor.  This task is best left to the software packages, so long as they support the requisite features.  JPMML is known to return relatively cryptic error messages when the PMML is invalid.

After considering all of these factors, we decided that Openscoring best satisfied our two-pronged goal of having a fast and robust, yet flexible machine learning framework.

Model Building Pipeline

hackpad.com_86HnsLzAU5J_p
A schematic of our model-building pipeline using PMML is illustrated above.  The first step involves deriving features from the data stored on the site.  Since the combination of features that gives the optimal signal is constantly changing, we store the features in a json format, which allows us to generalize the process of loading and transforming features, based on their names and types.  We then transform the raw features through bucketing or binning values, and replacing missing values with reasonable estimates to improve signal.  We also remove features that are shown to be statistically unimportant from our dataset.  While we omit most of the details regarding how we perform these transformations for brevity here, it is important to recognize that these steps take a significant amount of time and care.  We then use our transformed features to train and cross-validate the model using our favorite PMML-compatible machine learning library, and upload the PMML model to Openscoring.  The final model is tested and then used for decision-making if it becomes the best performer.

The model-training step can be performed in any language with a library that outputs PMML.  One commonly used and well-supported library is the R PMML package.  As illustrated below, generating a PMML with R requires very little code.

This R script has the advantage of simplicity, and a script similar to this is a great way to start building PMMLs and to get a first model into production.  In the long run, however, a setup like this has some disadvantages.  First, our script requires that we perform feature transformation as a pre-processing step, and therefore we have add these transformation instructions to the PMML by editing it afterwards.  The R PMML package supports many PMML transformations and data manipulations, but it is far from universal.   We deploy the model as a separate step — post model-training — and so we have to manually test it for validity, which can be a time-consuming process.  Yet another disadvantage of R is that the implementation of the PMML exporter is somewhat slow for a random forest model with many features and many trees.  However, we’ve found that simply re-writing the export function in C++ decreases run time by a factor of 10,000, from a few days to a few seconds. We can get around the drawbacks of R while maintaining its advantages by building a pipeline based on Python and scikit-learn.  Scikit-learn is a Python package that supports many standard machine learning models, and includes helpful utilities for validating models and performing feature transformations.  We find that Python is a more natural language than R for ad-hoc data manipulation and feature extraction.  We automate the process of feature extraction based on a set of rules encoded in the names and types of variables in the features json; thus, new features can be incorporated into the model pipeline with no changes to the existing code.  Deployment and testing can also be performed automatically in Python by using its standard network libraries to interface with Openscoring.  Standard model performance tests (precision recall, ROC curves, etc.) are carried out using sklearn’s built-in capabilities.  Sklearn does not support PMML export out of the box, so have written an in-house exporter for particular sklearn classifiers.  When the PMML file is uploaded to Openscoring, it is automatically tested for correspondence with the scikit-learn model it represents.   Because feature-transformation, model building, model validation, deployment and testing are all carried out in a single script, a data scientist or engineer is able to quickly iterate on a model based on new features or more recent data, and then rapidly deploy the new model into production.

Takeaways: ground truth > features > algorithm choice

Although this blog post has focused mostly on our architecture and model building pipeline, the truth is that much of our time has been spent elsewhere.  Our process was very successful for some models, but for others we encountered poor precision-recall.  Initially we considered whether we were experiencing a bias or a variance problem, and tried using more data and more features.  However, after finding no improvement, we started digging deeper into the data, and found that the problem was that our ground truth was not accurate.

Consider chargebacks as an example. A chargeback can be “Not As Described (NAD)” or “Fraud” (this is a simplification), and grouping both types of chargebacks together for a single model would be a bad idea because legitimate users can file NAD chargebacks. This is an easy problem to resolve, and not one we actually had (agents categorize chargebacks as part of our workflow); however, there are other types of attacks where distinguishing legitimate activity from illegitimate is more subtle, and necessitated the creation of new data stores and logging pipelines.

Most people who’ve worked in machine learning will find this obvious, but it’s worth re-stressing:
If your ground truth is inaccurate, you’ve already set an upper limit to how good your precision and recall can be. If your ground truth is grossly inaccurate, that upper limit is pretty low. 

Towards this end, sometimes you don’t know what data you’re going to need until you’ve seen a new attack, especially if you haven’t worked in the risk space before, or have worked in the risk space but only in a different sector. So the best advice we can offer in this case is to log everything. Throw it all in HDFS, whether you need it now or not. In the future, you can always use this data to backfill new data stores if you find it useful. This can be invaluable in responding to a new attack vector.

Future Outlook

Although our current ML pipeline uses scikit-learn and Openscoring, our system is constantly evolving.  Our current setup is a function of the stage of the company and the amount of resources, both in terms of personnel and data, that are currently available.  Smaller companies may only have a few ML models in production and a small number of analysts, and can take time to manually curate data and train the model in many non-standardized steps.  Larger companies might have many, many models and require a high degree of automation, and get a sizable boost from online training.  A unique challenge of working at a hyper-growth company is that landscape fundamentally changes year-over-year, and pipelines need to adjust to account for this.

As our data and logging pipelines improve, investing in improved learning algorithms will become more worthwhile, and we will likely shift to testing new algorithms, incorporating online learning, and expanding on our model building framework to support larger data sets.  Additionally, some of the most important opportunities to improve our models are based on insights into our unique data, feature selection, and other aspects our risk systems that we are not able to share publicly.  We would like to acknowledge the other engineers and analysts who have contributed to these critical aspects of this project.  We work in a dynamic, highly-collaborative environment, and this project is an example of how engineers and data scientists at Airbnb work together to arrive at a solution that meets a diverse set of needs.  If you’re interested in learning more, contact us about our data science and engineering teams!

 

*Illustration above by Shyam Sundar Srinivasan

The post Architecting a Machine Learning System for Risk appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/architecting-machine-learning-system-risk/feed/ 16
Engineering Culture at Airbnb http://nerds.airbnb.com/engineering-culture-airbnb/?utm_source=rss&utm_medium=rss&utm_campaign=engineering-culture-airbnb http://nerds.airbnb.com/engineering-culture-airbnb/#comments Thu, 05 Jun 2014 15:46:09 +0000 http://nerds.airbnb.com/?p=176594297 If you had visited Airbnb’s office yesterday you probably would have noticed something: clapping. I’m not sure why, but sometimes a team will applaud a small victory, then more people will start clapping, then suddenly the entire product and engineering area is a din of applause and cheers. Most people don’t know why they’re clapping, […]

The post Engineering Culture at Airbnb appeared first on Airbnb Engineering.

]]>
If you had visited Airbnb’s office yesterday you probably would have noticed something: clapping. I’m not sure why, but sometimes a team will applaud a small victory, then more people will start clapping, then suddenly the entire product and engineering area is a din of applause and cheers. Most people don’t know why they’re clapping, they just want to show support and have fun.

Maybe that’s what good culture is about. Defaulting to an attitude of support and celebrating others’ successes. Every company has some kind of culture. Some maintain it with meticulous attention, others just let it happen and hope for the best. Either way one fact remains: good culture creates an environment where people can do their best work, bad culture is soul-destroying.

I’ve been at Airbnb for a little over a year now. Previously I’ve been an engineer and manager at many companies including Facebook and Yahoo. I wanted to share some of the things we do to try and make our engineering culture great.

Engineers own their impact

At the core our philosophy is this: engineers own their own impact. Each engineer is individually responsible for creating as much value for our users and for the company as possible.

We hire primarily for problem-solving. When you have a team of strong problem-solvers, the most efficient way to move the company forward is to leave decision-making up to individual engineers. Our culture, tools, and processes all revolve around giving individual contributors accurate and timely information that they can use to make great decisions. This helps us iterate, experiment, and learn faster.

Making this environment possible requires a few things. Engineers are involved in goal-setting, planning and brainstorming for all projects, and they have the freedom to select which projects they work on. They also have the flexibility to balance long and short term work, creating business impact while managing technical debt. Does this mean engineers just do whatever they want? No. They work to define and prioritize impactful work with the rest of their team including product managers, designers, data scientists and others.

Just as importantly, engineers have transparent access to information. We default to information sharing. The more information engineers have, the more autonomously they can work. Everything is shared unless there’s an explicit reason not to (which is rare). That includes access to the analytics data warehouse, weekly project updates, CEO staff meeting notes, and a lot more.

This environment can be scary, especially for new engineers. No one is going to tell you exactly how to have impact. That’s why one of our values is that helping others takes priority. In our team, no one is ever too busy to help. In particular, our new grad hires are paired with a team that can help them find leveraged problems. Whether it’s a technical question or a strategic one, engineers always prioritize helping each other first.

Structured teams, fluid responsibilities

Being able to decide what’s impactful is possible with a clear company strategy to guide the decision-making process. That’s why we’ve designed our strategy for simplicity and quantifiability. It’s simple enough to fit on a single page and every employee at Airbnb knows how their function relates to the big picture. Knowing what your team’s goal is helps you decide how to use your time, which minimizes time-wasting debates about the existential stuff. And because each of our major goals has a numeric target, we can measure the effectiveness of various projects, learning quickly from our successes and failures.

Our team structure also maps to our company strategy: we work in tight working groups of generally 10 people or less with efficient lines of communication. Teams are primarily comprised of engineers, product managers, designers, and data scientists, and some teams partner with other departments within the company. There is strong collaboration between functions. Payments includes people from finance, Internal Tools includes people from customer experience. It’s common for engineers and designers pair up and figure out how to make something work in realtime. The best ideas come from close collaboration.

This year, we have ten teams focused on product development and four teams focused on technical infrastructure. Each team is concerned with a specific aspect of Airbnb as a business, and defines its own subgoals and projects on a quarterly basis, using the overall company strategy as a compass.

Although each team owns non-overlapping pieces of the business, collaborating across teams is common and encouraged. For instance, we have discrete Host and Guest teams, since we tend to think of hosts and guests as separate user demographics, each with their own set of needs. But since the interactions between hosts and guests are what make Airbnb special, these teams contribute to their counterparts’ roadmaps, share goals, and partner up on projects, while retaining enough separation to build specific expertise about their constituents’ use cases and needs. Fostering collaboration across teams helps us cover gaps.

It’s common for engineers to switch teams or contribute to areas beyond the scope of their immediate team. For example, it’s routine for a product-focused team to contribute to improving our infrastructure in the workflow of their projects. Engineers have freedom to change teams when the work in another group more closely aligns with their interests and ability to drive impact. In fact, it is encouraged. Managers can facilitate this process, but it’s up to the individual to find the team where he or she can have the greatest impact and initiate a move.

Cultural standards in the development process

The development process at Airbnb is flexible by design. We don’t want to build in different directions, but we also don’t want to be so standardized that we miss out on better tools and methodologies when they emerge. We believe in shaping good judgment in individuals instead of imposing rules across the team.

When our process changes it happens organically from within the team. Code reviews are an old but a good example of this. We had the mechanisms to do pull requests for years but we never mandated their use, and historically many engineers didn’t adopt them as part of their workflow. In fact, in the early days it was common practice to merge your own changes directly to master and deploy the site. This is kind of like juggling chainsaws blindfolded — looks cool when you pull it off, but eventually you’re going to lose a finger.

At some point a few motivated engineers started highlighting great code reviews at our weekly engineering all-hands meetings. They’d highlight some of the most helpful or thoughtful code reviews they had seen over the week. Soon more engineers started adopting pull requests and a tipping point was reached where it became strange if you didn’t ask for code review. Now it is just how we do development.

At the same time, this cultural shift was mirrored by advances in our tooling. A small team of engineers took it upon themselves to build out our continuous integration infrastructure, enabling the engineering team to run the entire test suite in minutes anytime they checked in a branch. Lowering the barriers to good behavior with tooling catalyzed the team’s cultural change.

This is one example, but there are countless others including how we adopted our project management tools and bug tracker. When we discover a better way of doing things we facilitate awareness of the idea then let it stand on its own merit until it catches on (or doesn’t). This way teams have a lot of flexibility with how they accomplish their work and we create opportunity for new good ideas to emerge.

Any engineer can contribute to any part of the codebase. All repositories are open to all engineers. This is possible because of our culture of automated testing, our code reviews, and our ability to detect anomalies in production through detailed monitoring. The standard etiquette here is borrowed from the open source world: someone from the team that maintains the codebase you’re touching should review your changes before you merge.

This model makes it easier for engineers to unblock themselves. Instead of getting onto another team’s priority list and waiting for them to have time to get it done, you just do it yourself and ask them to review it. That code review happens quickly because, again, helping others takes priority.

Once code is merged engineers deploy their own changes. In a given day, we’ll deploy the site 10 times or more. Our build-and-test process takes under 10 minutes to run and we can complete a full production deploy in about 8 minutes. Because it’s so fast, we ask engineers to deploy their changes as soon as they’re merged. Smaller change sets to production mean less chance for conflict and easier debugging when something goes wrong. It’s common etiquette to be present in our engineering chatroom as you deploy your changes. Our bot announces when the deploy starts and completes and the engineer announces they have verified their changes in production. During this time the engineer is also responsible for watching the metrics to make sure nothing bad happens.

Of course, bad things do happen sometimes. In these cases we may rollback the site, or fix and roll forward. When things are fixed, engineers work with the site reliability team to write a blameless post-mortem. We keep all post-mortems in an incident reporter tool that we developed internally. Post-mortems heavily inform proactive work we do to make infrastructure more reliable.

Career advancement, tracks, and impact

Another one of our beliefs is that engineers can progress just as far as individual contributors as they can as managers. There are two tracks by which engineers can progress in their careers: management and individual contribution. The pay scales are parallel, so there’s no compensation advantage for getting into engineering management at Airbnb.

In fact, becoming a manager isn’t about getting promoted; it’s about changing the focus of your work. Managers are facilitators. They exist to get obstacles out of engineers’ way. That can be career obstacles, prioritization, or technical help; pretty much anything. Their primary responsibility is to support the people around them.

We also value technical strength in our managers. Each manager is involved in dozens of technical decisions a week. Without a strong technical background, their influence in that process can lead to poor results. For this reason, all managers start as individual contributors. They can transition into management when they’re familiar with the code and development practices and, more importantly, when it feels like a natural move. We don’t airdrop managers.

An individual contributor’s primary responsibility is technical execution that drives impact to the business. They are responsible for finding and doing high impact work. In that process another value is to leave it better than you found it. Every project should improve our technical foundation. That responsibility falls to individual contributors and this means that engineers are driving technical decisions and holding each other to high standards of technical work. It also means that engineers negotiate feature trade-offs and deadlines to make sure enough time is given to do quality engineering.

Another way that we help engineers progress is by helping them build their individual profiles outside the company. We do this through blog posts on our nerds blog and through open source. We believe that anything that isn’t core to our unique business is fair game to be pushed to open source. We always want to be contributing useful technology back to the community. We encourage it as a way to help increase awareness around the engineering work we’re doing and to showcase some of the best work by our engineers.

Summing up

Right now, we are still establishing the foundation and practices that will carry us forward over the next several years. Things that seem like trivial decisions today will be amplified 10x down the road when we’re a much bigger team. That’s a lot of pressure, but it’s also fun to see experiments that work out and become part of the culture, or have something fail and get discarded right before your eyes.

Not fucking up the culture is paramount. When you’re growing quickly, it’s important to keep the environment creative and fun. Our engineering team meets every Friday for an hour of technical presentations, animated GIFs, applause, appreciation and cheers. We do multi-day hackathons twice a year that are each worthy of their own posts. I meet with small groups of engineers every week just to ask questions and listen to ideas on how we can improve. We have a nerd cave where engineers can hang out and listen to records while they work. We could probably do an entire post on how we stay connected and have fun as a team but I’ll save that for another day.

What makes Airbnb special is that our culture connects engineers to the company mission and to each other more strongly than anyplace else I’ve seen. Engineers own their impact here, prioritize helping others, default to sharing information, and continually leave the code better than they found it. Our culture empowers engineers to do their best work, and helps them get excited to come to work every day.

The post Engineering Culture at Airbnb appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/engineering-culture-airbnb/feed/ 3
Experiment Reporting Framework http://nerds.airbnb.com/experiment-reporting-framework/?utm_source=rss&utm_medium=rss&utm_campaign=experiment-reporting-framework http://nerds.airbnb.com/experiment-reporting-framework/#comments Thu, 29 May 2014 23:01:17 +0000 http://nerds.airbnb.com/?p=176594309 At Airbnb we are always trying to learn more about our users and improve their experience on the site. Much of that learning and improvement comes through the deployment of controlled experiments. If you haven’t already read our other post about experimentation I highly recommend you do it, but I will summarize the two main […]

The post Experiment Reporting Framework appeared first on Airbnb Engineering.

]]>
At Airbnb we are always trying to learn more about our users and improve their experience on the site. Much of that learning and improvement comes through the deployment of controlled experiments. If you haven’t already read our other post about experimentation I highly recommend you do it, but I will summarize the two main points: (1) running controlled experiments is the best way to learn about your users, and (2) there are a lot of pitfalls when running experiments. To that end, we built a tool to make running experiments easier by hiding all the pitfalls and automating the analytical heavy lifting.

png;base6468ee7ac577d881c5

When designing this tool, making experiments simple to run was the primary focus. We also had some specific design goals that came out of what we’ve learned from running and analyzing experiments with our previous tool.

  • Make sure the underlying data are correct. While this seems obvious, I’ll show some examples below of problems we ran into before that caused us to lose confidence in our analysis.

  • Limit the ways someone setting up an experiment could accidentally introduce bias and ensure that we automatically and reliably logged when a user was placed into a treatment for an experiment.

  • Subject experimental changes to the same code review process we use for code changes–they can affect the site in the same way, after all.

  • Run the analysis automatically so the barrier to entry to running (and learning from) an experiment is as low as possible.

Example Experiment

For the rest of this post, let’s consider a sample experiment we might want to run and how we’d get there–from setting it up, to making the code changes, to seeing the results.

png;base642a2f6b2731637022

Here is our current search results page, on the left we have a map of the results and on the right, images of the listings. By default, we show 18 results per page, but we wanted to understand how showing 12 or 24 results would affect user behaviour. Do users prefer getting more information at once? Or is it confusing to show too much? Let’s walk through the process of running that experiment.

Declaring treatments

For declaring experiments we settled on yaml since it provides a nice balance between human and machine readability. To define an experiment, you need two key things–the subject and the treatments. The subject is who you want to run this experiment against. In this case, we choose visitor since not all users who search are logged in. If we were running an experiment on the booking flow (where users have to log in first) we could run the experiment against users.  For a more in-depth look at the issues we’ve seen with visitor versus user experiments, check out our other post. Second, we have to define the treatments; in this case we have the control (of 18 results per page) and our two experimental groups, 12 and 24 results per page. The human_readable fields are what will be used in the UI.

search_per_page:
  human_readable: Search results per page
  subject: visitor
  treatments:
    12_per_page:
      human_readable: 12 per page
    18_per_page:
      human_readable: 18 per page
    24_per_page:
      human_readable: 24 per page
  control: 18_per_page

Deploying

The next step is to implement this experiment in code. In the examples below, we’ll be looking at Ruby code but we have a very similar function in Javascript that we can use for running experiments on cached pages.

deliver_experiment(
  "search_per_page",
  :12_per_page  =>  lambda { <%= render "search_results", :results => 12 %> },
  :18_per_page  =>  lambda { <%= render "search_results", :results => 18 %> },
  :24_per_page  =>  lambda { <%= render "search_results", :results => 24 %> },
  :unknown      =>  lambda { <%= render "search_results", :results => 18 %> }
)

The first argument is just the name of the experiment (from above). Then we’ve got an argument for each treatment above as well as a lambda function. The deliver_experiment function does three main things, (1) assign a user to a group (based on the specified subject), (2) log that the user was put into the treatment group, and (3) execute the provided lambda for the treatment group. You’ll also notice one more argument, :unknown. This is there in the case we run into some unexpected failure. We want to make sure, even in the case that something goes horribly wrong, we still provide the user with a good experience. This group allows us to handle those cases by rendering that view to the user and logging that the unknown treatment was given (and, of course, also logging the error as needed).

This design may seem a little unorthodox, but there is a method behind the madness. To understand why we chose lambdas instead of something simpler like if statements, let’s look at a few examples of doing it differently. Imagine, instead, we had a function that would return the treatment for a given user. We could then deploy an experiment like this:

treatment = user.get_treatment("search_page_number")
if treatment == :12_per_page
  ...
elsif treatment == :18_per_page
  ...
elsif treatment == :24_per_page
  ...
else
  ...

This would work perfectly, and we could log which treatment a user was put into in the get_treatment function. What if, however, someone is looking at site performance later on and realizes that serving 24 results per page is causing the load times to skyrocket in China? They don’t know about the experiment you’re trying to run, but want to improve the user experience for Chinese users, so they come to the code and make the following change:

treatment = user.get_treatment("search_page_number")
if treatment == :12_per_page
  ...
elsif treatment == :18_per_page
  ...
elsif user.country !~ "china" &&
      treatment == :24_per_page
  ...
else
  ...

Now, what’s happening? Well, we’re still going to log that Chinese users are put into the 24 results per page group (since that happens on line 1) but, in fact, they will not be seeing 24 results per page because of the change. We’ve biased our experiment. While you could do that with the lambda too, we’ve found by making it very explicit that this code path is related to an experiment, people are more aware that they shouldn’t be putting switching logic in there.

Let’s look at another example, what about the following two statements?

if user.country !~ "china" &&
   user.get_treatment("search_page_number") == :24_per_page
  ...

if user.get_treatment("search_page_number") == :24_per_page &&
   user.country !~ "china"
  ...

In this case we have identical logic and the same users will see the treatment. The problem is that because the tests are short-circuited in the if statement, in the first case we correctly log only when a user actually sees the treatment. In the second case we have the same problem as above, where we log that Chinese users are seeing the 24 results per page treatment even though they are not.

Analyzing

Finally, once that’s all done and deployed into the wild, we wait for the results to roll in. Currently we process the experiment results nightly, although it could easily be run more frequently. You can see a screenshot of the UI for the search results per page experiment in the image at the beginning of the post. At first glance, you’ll see red and green cells. These cells signify metrics that we think are statistically significant based the methods presented in our previous post (red for bad and green for good). The uncolored cells with grey text represent metrics for which we are not yet sufficiently confident in the results. We also plot a spark line of the p-value and delta over time, which allows a user to look for convergence of these values.

As you can also see from the UI, we provide two other mechanisms for looking at the data, but I won’t go into too much detail on those here. These allow for filtering the results, for example by new or returning users.  We also support pivoting the results, so that a user could see how a specific metric performed on new vs. returning users.

Once we have significant results for the metrics we were interested in for an experiment, we can make a determination about the success (or failure) of that experiment and deploy a specific treatment in the code. To avoid building up confusing code paths, we try to tear down all completed experiments. Experiments can then be marked as retired, which will stop running the analysis, but retain the data so it can be still referred to in the future.

We plan to eventually open source much of this work. In the meantime, we hope this post gives you a taste of some of the decisions we made when designing this tool and why we made them. If you’re trying to build something similar (or already have) we’d love to hear from you.

The post Experiment Reporting Framework appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/experiment-reporting-framework/feed/ 2
Experiments at Airbnb http://nerds.airbnb.com/experiments-at-airbnb/?utm_source=rss&utm_medium=rss&utm_campaign=experiments-at-airbnb http://nerds.airbnb.com/experiments-at-airbnb/#comments Tue, 27 May 2014 23:18:54 +0000 http://nerds.airbnb.com/?p=176594318 Airbnb is an online two-sided marketplace that matches people who rent out their homes (‘hosts’) with people who are looking for a place to stay (‘guests’). We use controlled experiments to learn and make decisions at every step of product development, from design to algorithms. They are equally important in shaping the user experience. While […]

The post Experiments at Airbnb appeared first on Airbnb Engineering.

]]>
Airbnb is an online two-sided marketplace that matches people who rent out their homes (‘hosts’) with people who are looking for a place to stay (‘guests’). We use controlled experiments to learn and make decisions at every step of product development, from design to algorithms. They are equally important in shaping the user experience.

While the basic principles behind controlled experiments are relatively straightforward, using experiments in a complex online ecosystem like Airbnb during fast-paced product development can lead to a number of common pitfalls. Some, like stopping an experiment too soon, are relevant to most experiments. Others, like the issue of introducing bias on a marketplace level, start becoming relevant for a more specialized application like Airbnb. We hope that by sharing the pitfalls we’ve experienced and learned to avoid, we can help you to design and conduct better, more reliable experiments for your own application.

Why experiments?

Experiments provide a clean and simple way to make causal inference. It’s often surprisingly hard to tell the impact of something you do by simply doing it and seeing what happens, as illustrated in Figure 1.

img1_launch

Figure 1 – It’s hard to tell the effect of this product launch.

The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically. Controlled experiments isolate the impact of the product change while controlling for the aforementioned external factors. In Figure 2, you can see an example of a new feature that we tested and rejected this way. We thought of a new way to select what prices you want to see on the search page, but users ended up engaging less with it than the old filter, so we did not launch it.

img2_price

Figure 2 – Example of a new feature that we tested and rejected.

When you test a single change like this, the methodology is often called A/B testing or split testing. This post will not go into the basics of how to run a basic A/B test. There are a number of companies that provide out of the box solutions to run basic A/B tests and a couple of bigger tech companies have open sourced their internal systems for others to use. See Cloudera’s Gertrude, Etsy’s Feature, and Facebook’s PlanOut, for example.

The case of Airbnb

At Airbnb we have built our own A/B testing framework to run experiments which you will be able to read more about in our upcoming blog post on the details of its implementation. There are a couple of features of our business that make experimentation more involved than a regular change of a button color, and that’s why we decided to create our own testing framework.

First, users can browse when not logged in or signed up, making it more difficult to tie a user to actions. People often switch devices (between web and mobile) in the midst of booking. Also given that bookings can take a few days to confirm, we need to wait for those results. Finally, successful bookings are often dependent on available inventory and responsiveness of hosts — factors out of our control.

Our booking flow is also complex. First, a visitor has to make a search. The next step is for a searcher to actually contact a host about a listing. Then, the host has to accept an inquiry and then the guest has to actually book the place.. In addition we have multiple flows that can lead to a booking – a guest can instantly book some listings without a contact, and can also make a booking request that goes straight to booking. This four step flow is visualized in Figure 3. We look at the process of going through these four stages, but the overall conversion rate between searching and booking is our main metric.

img3_flow

Figure 3 – Example of an experiment result broken down by booking flow steps.

How long do you need to run an experiment?

A very common source of confusion in online controlled experiments is how much time you need to make a conclusion about the results of an experiment. The problem with the naive method of using of the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If you continuously monitor the development of a test and the resulting p-value, you are very likely to see an effect, even if there is none. Another common error is to stop an experiment too early, before an effect becomes visible.

Here is an example of an actual experiment we ran. We tested changing the maximum value of the price filter on the search page from $300 to $1000 as displayed below.

img4_max_price

Figure 4 – Example experiment testing the value of the price filter

In Figure 5 we show the development of the experiment over time. The top graph shows the treatment effect (Treatment / Control – 1) and the bottom graph shows the p-value over time. As you can see, the p-value curve hits the commonly used significant value of 0.05 after 7 days, at which point the effect size is 4%. If we had stopped there, we would have concluded that the treatment had a strong and significant effect on the likelihood of booking. But we kept the experiment running and we found that actually, the experiment ended up neutral. The final effect size was practically null, with the p-value indicating that whatever the remaining effect size was, it should be regarded as noise.

img5_max_price_results

Figure 5 – Result of the price filter experiment over time.

Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern of hitting “significance” early and then converging back to a neutral result is actually quite common in our system. There are various reasons for this. Users often take a long time to book, so the early converters have a disproportionately large influence in the beginning of the experiment. Also, even small sample sizes in online experiments are massive in the scale of classical statistics in which these methods were developed. Since the statistical test is a function of the sample- and effect sizes, if an early effect size is large through natural variation it is likely for the p-value to be below 0.05 early. But the most important reason is that you are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.

As a side note, people familiar with our website might notice that, at time of writing, we did in fact launch the increased max price filter, even though the result was neutral. We found that certain users like the ability to search for high-end places and decided to accommodate them, given there was no dip in the metrics.

How long should experiments run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that you care about and compute, based on the sample size (the amount of new samples that come every day) and the certainty you want, how long to run the experiment for, before you start the experiment. Here is a resource that helps with that computation. Setting the time in advance also minimizes the likelihood of finding a result where there is none.

One problem, though, is that we often don’t have a good idea of the size, or even the direction, of the treatment effect. It could be that a change is actually hugely successful and major profits are being lost by not launching the successful variant sooner. Or, on the other side, sometimes an experiment introduces a bug, which makes it much better to stop the experiment early before more users are alienated.

The moment when an experiment dabbles in the otherwise “significant” region could be an interesting one, even when the pre-allotted time has not passed yet. In the case of the price filter experiment example, you can see that when “significance” was first reached, the graph clearly did not look like it had converged yet. We have found this heuristic to be very helpful in judging whether or not a result looks stable. It is important to inspect the development of the relevant metrics over time, rather than to consider the single result of an effect with a p-value.

We can use this insight to be a bit more formal about when to stop an experiment, if it’s before the allotted time. This can be useful if you do want to make an automated judgment call on whether or not the change that you’re testing is performing particularly well or not, which is helpful when you’re running many experiments at the same time and cannot manually inspect them all systematically. The intuition behind it is that you should be more skeptical of early results. Therefore the threshold under which to call a result is very low at the beginning. As more data comes in, you can increase the threshold as the likelihood of finding a false positive is much lower later in the game.

We solved the problem of how to figure out the p-value threshold at which to stop an experiment by running simulations and deriving a curve that gives us a dynamic (in time) p-value threshold to determine whether or not an early result is worth investigating. We wrote code to simulate our ecosystem with various parameters and used this to run many simulations with varying values for parameters like the real effect size, variance and different levels of certainty. This gives us an indication of how likely it is to see false positives or false negatives, and also how far off the estimated effect size is in case of a true positive. In Figure 6 we show an example decision boundary.

Figure 6 - An example of a dynamic p-value curve

Figure 6 – An example of a dynamic p-value curve.

It should be noted that this curve is very particular to our system and the parameters that we used for this experiment. We share the graph as an example for you to use for your own analysis.

Understanding results in context

A second pitfall is failing to understand results in their full context. In general, it is good practice to evaluate the success of an experiment based on a single metric of interest. This is to prevent cherry-picking of ‘significant’ results in the midst of a sea of neutral ones. However, by just looking at a single metric you lose a lot of context that could inform your understanding of the effects of an experiment.

Let’s go through an example. Last year we embarked on a journey to redesign our search page. Search is a fundamental component of the Airbnb ecosystem. It is the main interface to our inventory and the most common way for users to engage with our website. So, it was important for us to get it right. In Figure 7 you can see the before and after stages of the project. The new design puts more emphasis on pictures of the listings (one of our assets since we offer professional photography to our hosts) and the map that displays where listings are located. You can read about the design and implementation process in another blog post here.

Figure 7 - Before and after a full redesign of the search page

Figure 7 – Before and after a full redesign of the search page.

A lot of work went into the project, and we all thought it was clearly better; our users agreed in qualitative user studies. Despite this, we wanted to evaluate the new design quantitatively with an experiment. This can be hard to argue for, especially when testing a big new product like this. It can feel like a missed marketing opportunity if we don’t launch to everyone at the same time. However, to keep in the spirit of our testing culture, we did test the new design — to measure the actual impact and, more importantly, gather knowledge about which aspects did and didn’t work.

After waiting for enough time to pass, as calculated with the methodology described in the previous section, we ended up with a neutral result. The change in the global metric was tiny and the p-value indicated that it was basically a null effect. However, we decided to look into the context and to break down the result to try to see if we could figure out why this was the case. Because we did this, we found that the new design was actually performing fine in most cases, except for Internet Explorer. We then realized that the new design broke an important click-through action for certain older versions of IE, which obviously had a big negative impact on the overall results. When we fixed this, IE displayed similar results to the other browsers, a boost of more than 2%.

Figure 8 - Results of the new search design

Figure 8 – Results of the new search design.

Apart from teaching us to pay more attention to QA for IE, this was a good example of what lessons you can learn about the impact of your change in different contexts. You can break results down by many factors like browser, country and user type. It should be noted that doing this in the classic A/B testing framework requires some care. If you test breakdowns individually as if they were independent, you run a big risk of finding effects where there aren’t, just like in the example of continuously monitoring the effect of the previous section. It’s very common to be looking at a neutral experiment, break it down many ways and to find a single ‘significant’ effect. Declaring victory for that particular group is likely to be incorrect. The reason for this is that you are performing multiple tests with the assumption that they are all independent, which they are not. One way of dealing with this problem is to decrease the p-value by which you decide the effect is real. Read more about this approach here. Another way is to model the effects on all breakdowns directly with a more advanced method like logistic regression.

Assuming the system works

The third and final pitfall is assuming that the system works the way you think or hope it does. This should be a concern if you build your own system to evaluate experiments as well as if you use a third party tool. In either case, it’s possible that what the system tells you does not reflect reality. This can happen either because it’s faulty or because you’re not using it correctly. One way to evaluate the system and your interpretation of it is by formulating hypotheses and then verifying them.

Figure 9 - Results of an example dummy experiment

Figure 9 – Results of an example dummy experiment.

Another way of looking at this is the observation that results too good to be true have a higher likelihood of being false. When you encounter results like this, it is good practice to be skeptical of them and scrutinize them in whatever way you can think of, before you consider them to be accurate.

A simple example of this process is to run an experiment where the treatment is equal to the control. These are called A/A or dummy experiments. In a perfect world the system would return a neutral result (most of the time). What does your system return? We ran many ‘experiments’ like this (see an example run in Figure 9) and identified a number of issues within our own system as a result. In one case, we ran a number of dummy experiments with varying sizes of control and treatment groups. A number of them were evenly split, for example with a 50% control and a 50% treatment group (where everybody saw exactly the same website). We also added cases like a 75% control and a 25% treatment group. The results that we saw for these dummy experiments are displayed in Figure 10.

Figure 10 - Results of a number of dummy experiments

Figure 10 – Results of a number of dummy experiments.

You can see that in the experiments where the control and treatment groups are the same size, the results look neutral as expected (it’s a dummy experiment so the treatment is actually the same as the control). But, for the case where the group sizes are different, there is a massive bias against the treatment group.

We investigated why this was the case, and uncovered a serious issue with the way we assigned visitors that are not logged into treatment groups. The issue is particular to our system, but the general point is that verifying that the system works the way you think it does is worthwhile and will probably lead to useful insights.

One thing to keep in mind when you run dummy experiments is that you should expect some results to come out as non-neutral. This is because of the way the p-value works. For example, if you run a dummy experiment and look at its performance broken down by 100 different countries, you should expect, on average, 5 of them to give you a non-neutral result. Keep this in mind when you’re scrutinizing a 3rd party tool!

Conclusion

Controlled experiments are a great way to inform decisions around product development. Hopefully, the lessons in this post will help prevent some common A/B testing errors.

First, the best way to determine how long you should run an experiment is to compute the sample size you need to make an inference in advance. If the system gives you an early result, you can try to make a heuristic judgment on whether or not the trends have converged. It’s generally good to be conservative in this scenario. Finally, if you do need to make procedural launch and stopping decisions, it’s good to be extra careful by employing a dynamic p-value threshold to determine how certain you can be about a result. The system we use at Airbnb to evaluate experiments employs all three ideas to help us with our decision-making around product changes.

It is important to consider results in context. Break them down into meaningful cohorts and try to deeply understand the impact of the change you made. In general, experiments should be run to make good decisions about how to improve the product, rather than to aggressively optimize for a metric. Optimizing is not impossible, but often leads to opportunistic decisions for short-term gains. By focusing on learning about the product you set yourself up for better future decisions and more effective tests.

Finally, it is good to be scientific about your relationship with the reporting system. If something doesn’t seem right or if it seems too good to be true, investigate it. A simple way of doing this is to run dummy experiments, but any knowledge about how the system behaves is useful for interpreting results. At Airbnb we have found a number of bugs and counter-intuitive behaviors in our system by doing this.

Together with Will Moss, I gave a public talk on this topic in April 2014. You can watch a video recording of it here. Will published another blog post on the infrastructure side of things, read it here. We hope this post was insightful for those who want to improve their own experimentation.

The post Experiments at Airbnb appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/experiments-at-airbnb/feed/ 19
OpenAir: Paul Graham in Conversation with Nathan Blecharczyk http://nerds.airbnb.com/openair-paul-graham-conversation-nathan-blecharczyk/?utm_source=rss&utm_medium=rss&utm_campaign=openair-paul-graham-conversation-nathan-blecharczyk http://nerds.airbnb.com/openair-paul-graham-conversation-nathan-blecharczyk/#comments Tue, 13 May 2014 22:57:56 +0000 http://nerds.airbnb.com/?p=176594303 Paul Graham told the founders of Airbnb to “Do things that don’t scale” and it became an early mantra that informed many important moves for the company. Nathan and PG will discuss how companies from Vayable to HomeJoy can build businesses that cross the chasm by doing things that aren’t built to scale… and then […]

The post OpenAir: Paul Graham in Conversation with Nathan Blecharczyk appeared first on Airbnb Engineering.

]]>
Paul Graham told the founders of Airbnb to “Do things that don’t scale” and it became an early mantra that informed many important moves for the company. Nathan and PG will discuss how companies from Vayable to HomeJoy can build businesses that cross the chasm by doing things that aren’t built to scale… and then scaling them.

The post OpenAir: Paul Graham in Conversation with Nathan Blecharczyk appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/openair-paul-graham-conversation-nathan-blecharczyk/feed/ 0