Online risk mitigation
At Airbnb, we want to build the world’s most trusted community. Guests trust Airbnb to connect them with world-class hosts for unique and memorable travel experiences. Airbnb hosts trust that guests will treat their home with the same care and respect that they would their own. The Airbnb review system helps users find community members who earn this trust through positive interactions with others, and the ecosystem as a whole prospers.
The overwhelming majority of web users act in good faith, but unfortunately, there exists a small number of bad actors who attempt to profit by defrauding websites and their communities. The trust and safety team at Airbnb works across many disciplines to help protect our users from these bad actors, ideally before they have the opportunity to impart negativity on the community.
There are many different kinds of risk that online businesses may have to protect against, with varying exposure depending on the particular business. For example, email providers devote significant resources to protecting users from spam, whereas payments companies deal more with credit card chargebacks.
We can mitigate the potential for bad actors to carry out different types of attacks in different ways.
1) Product changes
Many risks can be mitigated through user-facing changes to the product that require additional verification from the user. For example, requiring email confirmation, or implementing 2FA to combat account takeovers, as many banks have done.
2) Anomaly detection
Scripted attacks are often associated with a noticeable increase in some measurable metric over a short period of time. For example, a sudden 1000% increase in reservations in a particular city could be a result of excellent marketing, or fraud.
3) Simple heuristics or a machine learning model based on a number of different variables
Fraudulent actors often exhibit repetitive patterns. As we recognize these patterns, we can apply heuristics to predict when they are about to occur again, and help stop them. For complex, evolving fraud vectors, heuristics eventually become too complicated and therefore unwieldy. In such cases, we turn to machine learning, which will be the focus of this blog post.
For a more detailed look at other aspects of online risk management, check out Ohad Samet’s great ebook.
Machine Learning Architecture
Different risk vectors can require different architectures. For example, some risk vectors are not time critical, but require computationally intensive techniques to detect. An offline architecture is best suited for this kind of detection. For the purposes of this post, we are focusing on risks requiring realtime or near-realtime action. From a broad perspective, a machine-learning pipeline for these kinds of risk must balance two important goals:
- The framework must be fast and robust. That is, we should experience essentially zero downtime and the model scoring framework should provide instant feedback. Bad actors can take advantage of a slow or buggy framework by scripting many simultaneous attacks, or by launching a steady stream of relatively naive attacks, knowing that eventually an unplanned outage will provide an opening. Our framework must make decisions in near real-time, and our choice of a model should never be limited by the speed of scoring or deriving features.
- The framework must be agile. Since fraud vectors constantly morph, new models and features must be tested and pushed into production quickly. The model-building pipeline must be flexible to allow data scientists and engineers to remain unconstrained in terms of how they solve problems.
These may seem like competing goals, since optimizing for realtime calculations during a web transaction creates a focus on speed and reliability, whereas optimizing for model building and iteration creates more of a focus on flexibility. At Airbnb, engineering and data teams have worked closely together to develop a framework that accommodates both goals: a fast, robust scoring framework with an agile model-building pipeline.
Feature Extraction and Scoring Framework
In keeping with our service-oriented architecture, we built a separate fraud prediction service to handle deriving all the features for a particular model. When a critical event occurs in our system, e.g., a reservation is created, we query the fraud prediction service for this event. This service can then calculate all the features for the “reservation creation” model, and send these features to our Openscoring service, which is described in more detail below. The Openscoring service returns a score and a decision based on a threshold we’ve set, and the fraud prediction service can then use this information to take action (i.e., put the reservation on hold).
The fraud prediction service has to be fast, to ensure that we are taking action on suspicious events in near realtime. Like many of our backend services for which performance is critical, it is built in java, and we parallelize the database queries necessary for feature generation. However, we also want the freedom to occasionally do some heavy computation in deriving features, so we run it asynchronously so that we are never blocking for reservations, etc. This asynchronous model works for many situations where a few seconds of delay in fraud detection has no negative effect. It’s worth noting, however, that there are cases where you may want to react in realtime to block transactions, in which case a synchronous query and precomputed features may be necessary. This service is built in a very modular way, and exposes an internal restful API, making adding new events and models easy.
Openscoring
Openscoring is a Java service that provides a JSON REST interface to the Java Predictive Model Markup Language (PMML) evaluator JPMML. Both JPMML and Openscoring are open source projects released under the Apache 2.0 license and authored by Villu Ruusmann (edit – the most recent version is licensed the under AGPL 3.0) . The JPMML backend of Openscoring consumes PMML, an xml markup language that encodes several common types of machine learning models, including tree models, logit models, SVMs and neural networks. We have streamlined Openscoring for a production environment by adding several features, including kafka logging and statsd monitoring. Andy Kramolisch has modified Openscoring to permit using several models simultaneously.
As described below, there are several considerations that we weighed carefully before moving forward with Openscoring:
Advantages
- Openscoring is opensource - this has allowed us to customize Openscoring to suit our specific needs.
- Supports random forests - we tested a few different learning methods and found that random forests had an appropriate precision-recall for our purposes.
- Fast and robust - after load testing our setup, we found that most responses took under 10ms.
- Multiple models - after adding our customizations, Openscoring allows us to run many models simultaneously.
- PMML format - PMML allows analysts and engineers to use any compatible machine learning package (R, Python, Java, etc.) they are most comfortable with to build models. The same PMML file can be used with pattern to perform large-scale distributed model evaluation in batch via cascading.
Disadvantages
- PMML doesn’t support some types of models - PMML only supports relatively standard ML models; therefore, we can’t productionize bleeding-edge models or implement significant modifications to standard models.
- No native support for online learning - models cannot train themselves on-the-fly. A secondary mechanism needs to be in place to automatically account for new ground truth.
- Rudimentary error handling - PMML is difficult to debug. Editing the xml file by hand is a risky endeavor. This task is best left to the software packages, so long as they support the requisite features. JPMML is known to return relatively cryptic error messages when the PMML is invalid.
After considering all of these factors, we decided that Openscoring best satisfied our two-pronged goal of having a fast and robust, yet flexible machine learning framework.
Model Building Pipeline

A schematic of our model-building pipeline using PMML is illustrated above. The first step involves deriving features from the data stored on the site. Since the combination of features that gives the optimal signal is constantly changing, we store the features in a json format, which allows us to generalize the process of loading and transforming features, based on their names and types. We then transform the raw features through bucketing or binning values, and replacing missing values with reasonable estimates to improve signal. We also remove features that are shown to be statistically unimportant from our dataset. While we omit most of the details regarding how we perform these transformations for brevity here, it is important to recognize that these steps take a significant amount of time and care. We then use our transformed features to train and cross-validate the model using our favorite PMML-compatible machine learning library, and upload the PMML model to Openscoring. The final model is tested and then used for decision-making if it becomes the best performer.
The model-training step can be performed in any language with a library that outputs PMML. One commonly used and well-supported library is the R PMML package. As illustrated below, generating a PMML with R requires very little code.
This R script has the advantage of simplicity, and a script similar to this is a great way to start building PMMLs and to get a first model into production. In the long run, however, a setup like this has some disadvantages. First, our script requires that we perform feature transformation as a pre-processing step, and therefore we have add these transformation instructions to the PMML by editing it afterwards. The R PMML package supports many PMML transformations and data manipulations, but it is far from universal. We deploy the model as a separate step — post model-training — and so we have to manually test it for validity, which can be a time-consuming process. Yet another disadvantage of R is that the implementation of the PMML exporter is somewhat slow for a random forest model with many features and many trees. However, we’ve found that simply re-writing the export function in C++ decreases run time by a factor of 10,000, from a few days to a few seconds. We can get around the drawbacks of R while maintaining its advantages by building a pipeline based on Python and scikit-learn. Scikit-learn is a Python package that supports many standard machine learning models, and includes helpful utilities for validating models and performing feature transformations. We find that Python is a more natural language than R for ad-hoc data manipulation and feature extraction. We automate the process of feature extraction based on a set of rules encoded in the names and types of variables in the features json; thus, new features can be incorporated into the model pipeline with no changes to the existing code. Deployment and testing can also be performed automatically in Python by using its standard network libraries to interface with Openscoring. Standard model performance tests (precision recall, ROC curves, etc.) are carried out using sklearn’s built-in capabilities. Sklearn does not support PMML export out of the box, so have written an in-house exporter for particular sklearn classifiers. When the PMML file is uploaded to Openscoring, it is automatically tested for correspondence with the scikit-learn model it represents. Because feature-transformation, model building, model validation, deployment and testing are all carried out in a single script, a data scientist or engineer is able to quickly iterate on a model based on new features or more recent data, and then rapidly deploy the new model into production.
Takeaways: ground truth > features > algorithm choice
Although this blog post has focused mostly on our architecture and model building pipeline, the truth is that much of our time has been spent elsewhere. Our process was very successful for some models, but for others we encountered poor precision-recall. Initially we considered whether we were experiencing a bias or a variance problem, and tried using more data and more features. However, after finding no improvement, we started digging deeper into the data, and found that the problem was that our ground truth was not accurate.
Consider chargebacks as an example. A chargeback can be “Not As Described (NAD)” or “Fraud” (this is a simplification), and grouping both types of chargebacks together for a single model would be a bad idea because legitimate users can file NAD chargebacks. This is an easy problem to resolve, and not one we actually had (agents categorize chargebacks as part of our workflow); however, there are other types of attacks where distinguishing legitimate activity from illegitimate is more subtle, and necessitated the creation of new data stores and logging pipelines.
Most people who’ve worked in machine learning will find this obvious, but it’s worth re-stressing:
If your ground truth is inaccurate, you’ve already set an upper limit to how good your precision and recall can be. If your ground truth is grossly inaccurate, that upper limit is pretty low.
Towards this end, sometimes you don’t know what data you’re going to need until you’ve seen a new attack, especially if you haven’t worked in the risk space before, or have worked in the risk space but only in a different sector. So the best advice we can offer in this case is to log everything. Throw it all in HDFS, whether you need it now or not. In the future, you can always use this data to backfill new data stores if you find it useful. This can be invaluable in responding to a new attack vector.
Future Outlook
Although our current ML pipeline uses scikit-learn and Openscoring, our system is constantly evolving. Our current setup is a function of the stage of the company and the amount of resources, both in terms of personnel and data, that are currently available. Smaller companies may only have a few ML models in production and a small number of analysts, and can take time to manually curate data and train the model in many non-standardized steps. Larger companies might have many, many models and require a high degree of automation, and get a sizable boost from online training. A unique challenge of working at a hyper-growth company is that landscape fundamentally changes year-over-year, and pipelines need to adjust to account for this.
As our data and logging pipelines improve, investing in improved learning algorithms will become more worthwhile, and we will likely shift to testing new algorithms, incorporating online learning, and expanding on our model building framework to support larger data sets. Additionally, some of the most important opportunities to improve our models are based on insights into our unique data, feature selection, and other aspects our risk systems that we are not able to share publicly. We would like to acknowledge the other engineers and analysts who have contributed to these critical aspects of this project. We work in a dynamic, highly-collaborative environment, and this project is an example of how engineers and data scientists at Airbnb work together to arrive at a solution that meets a diverse set of needs. If you’re interested in learning more, contact us about our data science and engineering teams!
*Illustration above by Shyam Sundar Srinivasan

22 Comments
Hey guys, props on your first non-search model scoring pipeline. You mentioned your fraud vectors are changing all the time. Have you considered ensembling an unsupervised approach with your supervised technique? Fraud targets tend to be sparse and hard to recognize, so many companies build systems that describe the ‘typical’ transaction with hundreds or thousands of automatically generated rules, and then flag for review if a statistically significant number of them are violated. Hope it helps! -RG
There are certainly vectors for which an unsupervised approach works. As we mentioned above, anomaly detection is an important part of our system. However, we haven’t tried using an ensemble of supervised and unsupervised for a specific type of attack, and I’d be interested in hearing about specific examples for which an ensemble like this has worked well. Taking chargeback fraud as an example, an unsupervised approach would probably find many different groups (frequent travelers, consultants traveling during the week, active hosts, etc), and it’s not clear to me that one group would have more bad actors than the others, versus having a few bad actors in each group. Feel free to contact me directly (my first name @airbnb.com)
Please open source your PMML exporter for sk-learn and return something back to the community.
There are tentative plans to do so, but there is always a balance between contributing new projects, maintaining our existing open source projects (see http://nerds.airbnb.com/open-source/) and addressing pressing concerns within the company. One complicating factor is that this particular piece of the pipeline is owned by the analytics team, which is relatively lean (around 15 members for the entire company, and only a few dedicated to risk). We understand that this code could be generally useful and will do our best to prioritize releasing it.
Like many others, I’d love to use your PMML exporter for sk-learn. Do let us know when you may release it. Great doc – thanks!
Great article! We would love to gain access to your PMML exporter as well – we are currently building the prototype for our analytics pipeline and having the ability to generate models in python and export them as PMML for real time scoring would be a huge win. Here’s to hoping it makes it on the priority list!
Great post. Very instructive!
Totally agree with the guys asking for open sourcing the PMML exporter.
I think there is a great need for such library. This will take scikit-learn and machine learning in Python to the next level.
Thank you so much for looking into this! The PMML export out of scikit-learn would be a great contribution to the Python/ML community!!
Thank you Naseem and Aaron for this excellent overview. Also, greetings to Chris, Andy, Lukasz and others who contributed greatly to the design and development of Openscoring in the summer of 2013. Your support and encouragement has been invaluable to me.
First, I would like to correct a statement about JPMML and Openscoring licensing terms. This article covers the original 1.0.X codebase that is released under the BSD 3-Clause License. Starting from March 2014 all my efforts have been pointed towards the new 1.1.X codebase that was re-licensed under the GNU Affero General Public License (AGPL), version 3.0. So, all your statements about software freedom and openness still hold, but there may be some restrictions to proprietary use.
Second, I’m intrigued by the step #5 in your Model Building Pipeline. Have you investigated and generalized what are the most common causes for test failures? Are they like “soft failures” where Openscoring and scikit-learn make different predictions, or are more like “hard failures” where Openscoring is simply unable to make a prediction (ie. throws an exception)? In the former case, it can be explained by subtle differences between the PMML specification and selected machine learning software implementations. For example, in a situation when there is a tie between several scores then the PMML specification tells to return the first score as a winner, whereas other software (e.g. R) may decide to return the last score as a winner (or, select it randomly). In the latter case, could you please share those PMML files with me?
Finally, what are your plans for releasing your internal PMML tools? I know there are very many people longing for a faster export of RandomForest models.
Villu, thank you for your comments and for contributing this excellent piece of software. Regarding steps #5 and #6 (the upload and checking steps) we find it useful to perform sanity checks to ensure that the analyst has deployed the model he or she intended to. This step was particularly relevant when we were developing the PMML exporter, and performing complicated transformations to the data. Now that we have streamlined our process, we rarely encounter failures. Openscoring has, to my knowledge, always performed as intended, so long as it received the proper input.
Also, I have edited the sentence about licensing and linked to the current openscoring page.
Great post. I work on risk ML at Facebook. I’m wondering what the performance difference was for random forest vs. logistic regression or GBT with your data set. Thanks!
Great question. Our current features, settings, and training sets are optimized for random forests, so simply running a different model through our pipeline is not necessarily a fair comparison at this point. In general, we find that for our setup random forests perform the best, GLMs perform the worst and GBTs are in between. If we fix the recall to a desired value, the precision can vary by up to a factor of two between random forests and GLMs. (Shout-out to Alok Gupta for calculating the most current statistics.)
Oh interesting. Why did you decide to fix the recall instead of the false positive rate?
Naseem, Aaron,
Have you considered using an advanced ML service like BigML? I believe that you would significantly reduce all the complexity and brittleness of the steps and scripts above.
Besides providing a RESTful API and bindings for several languages, one can use BigML with an advanced client-side, command-line interface called bigmler. For example:
You could create an RF in just one line (http://bigmler.readthedocs.org/):
bigmler –train data.csv –objective is_fraud –number_of_models 20 –randomize
You can also do feature selection and cross validation in one line:
bigmler analyze –dataset dataset/5357eb2637203f1668000004 –features
BigML also provides a dataset transformation language called Flatline (https://github.com/bigmlcom/flatline). With it, it’s possible to filter rows according to sophisticated criteria and to add new features as combination of other columns or rows.
BigML also makes dealing with missing values very simple at both modeling and prediction time, and, you can always clean up your dataset upfront with Flatline.
If you’re not in a programming mood, you can do everything from a beautiful web interface. You can get interactive models that can help you better understand your data.
Every ML resource BigML generates (source, dataset, model, evaluation, etc) is available, via a REST API or the web interface, as a transparent, easily manipulable JSON document (a lightweight version of PMML we call JSON PML). This allows you to implement more sophisticated ML strategies using any of the BigML’s bindings.
All your models, evaluations, etc will have an id and be stored in your dashboard for further use or analysis. You can easily retrieve them by tag, date, etc
You can get lightning fast predictions using BigML PredictServer (https://bigml.com/predictserver) or just using our Node.js bindings, that can cache your models/ensembles and serve thousands of predictions per second.
If you’re concerned about data privacy, you can run BigML in your own private cloud or we can deploy a private version in your servers as well.
As you point out, a significant portion of your time goes into data transformations and feature engineering. So, why spend time reinventing an architecture something that BigML can give you FREE or at a very affordable price? BigML can save you some valuable time that you can later spend on elaborating better modeling and prediction strategies and also measuring the impact of your ML solution—after all, that’s what makes the difference.
Random Forest should be excellent for matching and recommendation algorithms as well (“I am looking for a place like this…”)
Very interesting and great solution! Main question is why build all this from scratch when there are so many ML API services? Did you try them out and find them lacking? What you’ve done is pretty cool, but it reminds me of early 2000 when folks would have home grown ETL solutions instead of a proper ETL tool. Eventually they were nightmare to maintain for the business. I’m asking this because I have a lot of clients in a very similar position and would love to hear your take.
I think that each company will have a different cost-benefit analysis to consider based on their goals, resources, and stage of growth. For example, it wouldn’t make any sense for Google to use an ML API service. The scale and complexity at which they operate means that it is more cost effective, performant, and reliable for them to build this in-house. On the other end of the scale, a two-person bootstrapped startup probably shouldn’t be writing their own ML algorithms (unless that is explicitly their company’s core competency)
At Airbnb, machine learning is critical enough to our business, and we foresee operating at such a scale, that it is more cost effective and performant to build what we’ve described in-house. This is just what we think makes sense for us at this point in time – we had a different system last year, and this system will likely evolve into something different in the coming years.
Hi Naseem & Aaron,
You mentioned a requirement for your scoring engine was:
Since fraud vectors constantly morph, new models and features must be tested and pushed into production quickly.
I was wondering how you deal with changes to features? Does that require a programmer to change your fraud prediction service to pull in any new data that is required or can your statisticians change this in some other way?
Cheers,
Dean
Hi guys, thank you very much for sharing. Nice to see a scoring system implemented this way for risk.
Awesome article post.Thanks Again. Much obliged. efcbgkdfckee
Hey guys, this is a fantastic article. Cool!
For ‘models cannot train themselves on-the-fly’, could give me some hints or reading materials introducing how to enable ‘training on-the-fly’ . Thanks a lot!!!
Comments are closed.