Airbnb Engineering http://nerds.airbnb.com Nerds Tue, 25 Aug 2015 16:41:56 +0000 en-US hourly 1 http://wordpress.org/?v=373 Life of a ticket http://nerds.airbnb.com/life-of-a-ticket/ http://nerds.airbnb.com/life-of-a-ticket/#comments Tue, 25 Aug 2015 16:41:56 +0000 http://nerds.airbnb.com/?p=176595173 At the end of 2012, a cumulative total of over 4 million guests had stayed on Airbnb. This was huge! It took us nearly four years to get our first million total guests — now over 2 million guests stay on Airbnb every month. With this growth comes the challenge of scaling marketplace operations which […]

The post Life of a ticket appeared first on Airbnb Engineering.

]]>
At the end of 2012, a cumulative total of over 4 million guests had stayed on Airbnb.

This was huge! It took us nearly four years to get our first million total guests — now over 2 million guests stay on Airbnb every month. With this growth comes the challenge of scaling marketplace operations which is the task that my team aims to improve.

I work on a team called Internal Products which is comprised of two areas, Product Excellence and Service Excellence. We’re tasked with building tools to improve the ability for our guests and hosts to help themselves, as well as increase the productivity of our Customer Experience specialists.

As Yishan Wong says, “your operating efficiency … [is] directly impacted by the ingenuity of your internal products.” At Airbnb, we want to offer the best customer service in the world, so we take this very seriously and have an entire team dedicated to it.

Our team gets to spend its time directly with its customers, the Customer Experience (CX) team, understanding their needs and fixing problems. Our team has a tight feedback loop with our CX specialists which results not only building tools that are tailored to their needs, but improving by iterating quickly.

One of the clear issues our team uncovered – was how to speed up processing a customer support ticket. The best way to share our solution is to show you our approach to the problem by examining the lifespan of a ticket.

There are 3 channels to get in touch with a specialist at Airbnb:

• Voice: by calling our support number. In this scenario, when a person calls our support line, we will gather what we know about them based on their phone number. This is where we assign them to a specific agent based on their current state (if they have a current reservation, if they’re a host, a guest, etc.).
• Email: in this scenario, a person will have to contact us via our contact page. They will be asked to select their reservation first (if relevant) and what they need help with (their issue), after which we magically display some troubleshooting tips, based on their current state as well. If this does not answer their question, they will have the opportunity to send us an email.
• Chat: similar to the email flow above, after selecting their issue, they will have the option to start a chart with one of our CX specialists or with a community member.

After calling, starting a chat or sending the email, a ticket will be created with us. This ticket eventually gets assigned to an Airbnb CX specialist, where the guest/host and the specialist correspond. When the issue is resolved, the specialist solves the ticket, and the case is closed.

Our team was put in charge of improving two areas while working on the ticket flow:

• whenever we magically display the troubleshooting tips with the Contact Page Editor
• whenever a ticket is processed with the ticket backend (Admin Console)

### The Contact Page Editor

Let’s rewind to 2012 for a moment. At the time, when a person needed to contact us, they would simply select an issue on the contact page, fill out the form and submit. We aim to be accessible, but we want to make sure we prioritize the most urgent requests. In order to do this we needed to give certain people and issues priority, and reduce the priority of issues that could be handled by people themselves. We needed an application that would enable our content team to make the necessary changes, rather than needing engineers every time we had a product update or workflow change.

We set to turn our contact page into a troubleshooting wizard; something that could provide custom content depending on the issue you choose, as well as add robust tagging and routing in the event a ticket needed to be created.

For example, if someone contacts us about how to upload their profile picture, we could display some FAQs on how to upload their profile picture, provide a link to the upload page and if this still didn’t help, give them the option to contact Airbnb.

And you could go even deeper by displaying content  based on the person’s platform, whether they are a host or a guest, whether they have a reservation within the next 3 days, etc.

Our team and I built the necessary tools for our content team so they can personalize each issue where we would ask the right question, display the right help content, put you in touch with a CX specialist. We call it the Contact Page Editor and it looks like this:

The left section of this screenshot shows what the Contact Page Editor looks like (admin side) whereas the right section shows the same content but on the contact page.

The backend relies on a simple decision tree that contains different types of nodes. There are three types of nodes:

• Content nodes: they will only display content like a list of FAQs, a text or a linked button.
• Contextual nodes: they will display content based on the person’s state. For example if a guest/host has a reservation starting within 3 days, if they use an iPhone or Android, etc.
• Action nodes: they contain other nodes and they will display content based on the person’s interactions like a select option dropdown.

By asking the right kind of question to our guest/host, each and every issue becomes more personalized, which helps speed up person/CX specialist interaction since there would be less back and forth.

By building a flow of custom and more personalized content on the contact page, we were able to help 15% more people in 2013.

The admin console is our Customer Experience tool tailored to our Customer Experience specialists needs. We have observed and analyzed the work being done by our specialists to design a tool that will not only automate tasks that a computer can do faster, but also surface the information that is needed only when it’s needed.

The admin console is a Rails application with a very heavy front-end. We started the project with 2 engineers. The goal here was to build a tool to boost productivity in solving tickets coming from the contact page for our Customer Experience team.

I was in charge of the front-end and at the time we were only using Backbone. Since the admin console is very dynamic, the content changes constantly, I decided to use knockout.js for data-binding, thus letting us not manipulate the DOM directly with a library like jQuery, but allowing the use of data-binding for DOM manipulation and so removing extra complexity by having only one single source of truth in the code. The code snippets below show an example of how a typical Backbone/Knockout component was built.


<!-- template.hbs -->
<div id='myComponent'>
<span data-bind='text: name'></span>
<ul>
<!-- ko foreach: listings -->
<li data-bind='text: name'></li>
<!-- /ko -->
</ul>
</div>


// view.js
var Backbone = require('backbone');
var ko = require('knockout');
var _ = require('underscore');

var data = {
name: 'John Doe',
listings: [{
name: 'Quiet room in Duboce Triangle'
}]
};

var userModel = new Backbone.Model(data);
var userViewModel = {
name: ko.observable(userModel.get('name')),
listings: ko.observableArray(userModel.get('listings'))
};

userModel.on('change', function() {
_.each(userViewModel, function(value, key) {
if (userModel.has(key)) {
userViewModel[key](userModel.get(key));
}
});
}, this);

ko.applyBindings(userViewModel, document.getElementById('myComponent'));


Over time, we added more functionality to the admin console and it didn’t scale very well. The pages got slower and the code harder to read. It was time to move away from Backbone/Knockout and look for other alternatives … and React came along. The application became much faster, more responsive and cleaner thanks to its Shadow DOM and the combination of the template and javascript in a single file. As for the engineers, they have a more enjoyable developer experience, less prone to have buggy code, a cleaner architecture (eq. less code duplication, readability, modularity) and can work faster because React makes your rethink how to write your code.

The code below shows the same Backbone/Knockout component from above written in ES6 syntax and built with React and flux.


// Modules
const React = require('react');

// Data
const data = {
name: "John Doe",
listings: [{
id: 1,
name: "Quiet Room in Duboce Triangle"
}]
};

// Components
class ListingView extends React.Component {
constructor(props) {
super(props);
this.state = data;
}

render() {
return (
<div>
<span>{this.state.name}</span>
{this.state.listings.map((listing) => {
return <li key={listing.id}>{listing.name}</li>;
})}
</div>
);
}
};

export default ListingView;

Our team was the first one to experiment with React. We used it in the admin console and in the Resolution Center. This successful experiment of React on the Resolution Center gave our front-end team the confidence to make React part of the front-end stack at Airbnb. We still have some Backbone code for routing, models, and collections in older apps, but we are moving away from it. New applications are being written with Alt for unidirectional data flow and react-router for routing.

Our team is a great playground for any engineer who wants to experiment with new technologies, since we are building products for our friends and fellow employees. Due to the interplay of teams, our roadmap can be more flexible than other groups.  One weekend in 2012, I  decided to build an internal chat, called Tin Can. I built it with with node.js, websockets and redis. All new technologies that I had never used. I presented it to the team the following Monday and shipped it a week later after feedback. Tin Can is now used by all Customer Experience agents and processes 70k messages a day, improving our specialists productivity every day. We have since rewritten the code to scale the product but this is a great example of quickly an idea can become a product on our team.

These are only two of the many projects my team has been working on to improve our customers’ experience. We help people get in touch with an Airbnb specialist when needed, but we also improve our specialists’ ability to help our guests and hosts by building tools that are faster, more thoughtful, and more intuitive. All of which is done with fun and colored pants in the pony lounge.

The post Life of a ticket appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/life-of-a-ticket/feed/ 0
Developer Infrastructure with Airbnb and Bundler http://nerds.airbnb.com/developer-infrastructure-with-airbnb-and-bundler/ http://nerds.airbnb.com/developer-infrastructure-with-airbnb-and-bundler/#comments Thu, 20 Aug 2015 21:59:47 +0000 http://nerds.airbnb.com/?p=176595159 How does Bundler work, anyway? Bundler has turned out to be a super-useful tool for installing and managing dependencies, but many Rubyists don’t really have a handle on why it exists or how, exactly, it works. This talk aims to explain the huge mess that existed before Bundler, and then talk about the solutions to […]

The post Developer Infrastructure with Airbnb and Bundler appeared first on Airbnb Engineering.

]]>

### How does Bundler work, anyway?

Bundler has turned out to be a super-useful tool for installing and managing dependencies, but many Rubyists don’t really have a handle on why it exists or how, exactly, it works. This talk aims to explain the huge mess that existed before Bundler, and then talk about the solutions to those problems that Bundler provides. The talk won’t spend a lot of time on code slides, but will include a fairly detailed explanation of Ruby’s require system, Rubygems, gem dependencies, dependency graph resolution, and how Bundler interacts with them. At the end, the talk will cover the Bundler “superpowers”, allowing rapid development and deployment in ways that were highly impractical before Bundler existed.

André Arko
André thinks Ruby is pretty neat. He leads the Bundler team, co-authored the third edition of The Ruby Way, and runs Ruby Together, the Ruby trade association. At his day job, he provides expert development, architecture, and teaching through Cloud City Development in San Francisco.

### Automate the Boring Parts: Web Apps to Ship Web Apps

Delays, bureaucracy, angry release engineers: turning working software into production systems can be a drag. What if you could write software to automate away the pain, and have intelligent systems ship your code for you when it’s ready? Learn how Airbnb replaced manual release processes with code to make shipping smooth, fast, and simple.

Matt Baker
Matt is currently a software engineer at Airbnb on the Developer Infrastructure team, and was previously a game developer and computer graphics researcher. His work spans high-performance distributed systems and low-level rendering techniques. Having grown up across six states and two countries, he enjoys reading, ‘riting, ‘rithmetic, and road trips.

Igor Serebryany

The post Developer Infrastructure with Airbnb and Bundler appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/developer-infrastructure-with-airbnb-and-bundler/feed/ 0
Airbnb Price Tips Powered By Aerosolve Machine Learning http://nerds.airbnb.com/airbnb-machine-learning/ http://nerds.airbnb.com/airbnb-machine-learning/#comments Wed, 19 Aug 2015 20:11:58 +0000 http://nerds.airbnb.com/?p=176595153 Matching buyers with sellers depends a lot on price. Price too high and buyers are turned off. Price too low and the seller loses out. To increase the odds of matching buyers and sellers, Airbnb’s team of data scientists and engineers have worked hard to develop pricing technology, said Bar Ifrach, Airbnb data scientist, during […]

The post Airbnb Price Tips Powered By Aerosolve Machine Learning appeared first on Airbnb Engineering.

]]>

Matching buyers with sellers depends a lot on price. Price too high and buyers are turned off. Price too low and the seller loses out.

To increase the odds of matching buyers and sellers, Airbnb’s team of data scientists and engineers have worked hard to develop pricing technology, said Bar Ifrach, Airbnb data scientist, during a recent OpenAir 2015 talk.

“We’re trying to empower our hosts with tools to price their listings and get bookings seamlessly and effectively, so we have more hosts and stays on Airbnb and more matches on the platform,” Ifrach said.

Airbnb displays a host’s calendar. Days shown in white are available to book. Gray indicates days already booked, in the past, or that are days the host doesn’t want to book.

At the bottom of days in white, the host sees a color bar, which indicates how likely the host is to get a booking for that day at a given price. Green indicates a high likelihood. Yellow means there’s a medium chance of booking at the current price; red suggests a low probability of booking.

The technology behind this is divided into two parts, Ifrach said: Modeling and Aerosolve.

With modeling, Airbnb is “trying to predict for every day of year, for every listing, what will be the likelihood of getting a booking for any possible price,” Ifrach explained. “Then we can find a price that works best.” Airbnb modeling is doing this “on a huge, global scale” by looking at millions of derived features and over 5 billion training points, he said.

Three main considerations go into the pricing model:

• Demand, or the impact of seasonality or special events (such as Austin’s SXSW conference) on an area. Airbnb’s model translates demand features into pricing predictions.
• A listing’s location, such as the market, neighborhood, or street block. Example: San Francisco has many distinct neighborhoods that appeal to different crowds, and “we need to account for that” in pricing, said Ifrach. This is accomplished through grids and k-d trees.
• A listing’s type and quality. Airbnb’s pricing model is “unique,” Ifrach said, because it incorporates such factors as a property’s size and specific qualities.

As an example, he compared a houseboat situated across from the Eiffel Tower in Paris and a private room with a view of the tower. To determine pricing for each, Airbnb’s model takes into account the differences in the two listing types (houseboat vs. a room), amenities (such as air conditioning), and other qualities. To quantify a listing’s quality, Airbnb’s modeling uses Dirac and Cubic splines to capture the effect of guest reviews.

The second part of Airbnb’s dynamic pricing technology is Aerosolve, “machine learning for humans,” said Airbnb engineer Hector Yee, who took the stage after Ifrach.

Airbnb introduced Aerosolve at OpenAir 2015 as an open-source machine learning library available on GitHub. Aerosolve is designed to interpret complex data sets so that humans can easily understand them. While Airbnb’s modeling suggests pricing to hosts, Aerosolve’s goal is to add context to the pricing recommendation so the host understands it. “You have to understand why the model makes it decisions. Maybe the host’s pricing is too high because they have no reviews or because it’s low season,” Yee explained.

The blog post “Aerosolve: Machine learning for humans” offers details about how Aerosolve works and how developers can use it.

The post Airbnb Price Tips Powered By Aerosolve Machine Learning appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/airbnb-machine-learning/feed/ 0
Anomaly Detection for Airbnb’s Payment Platform http://nerds.airbnb.com/anomaly-detection/ http://nerds.airbnb.com/anomaly-detection/#comments Tue, 18 Aug 2015 20:19:57 +0000 http://nerds.airbnb.com/?p=176595132 With hosts and guests around the globe, Airbnb aspires to provide a frictionless payments experience where our guests can pay in their local currency via a familiar payment method, and our hosts can receive money via convenient means in their preferred currency. For example, in Brazil the currency is Brazilian Real, and people are familiar […]

The post Anomaly Detection for Airbnb’s Payment Platform appeared first on Airbnb Engineering.

]]>
With hosts and guests around the globe, Airbnb aspires to provide a frictionless payments experience where our guests can pay in their local currency via a familiar payment method, and our hosts can receive money via convenient means in their preferred currency. For example, in Brazil the currency is Brazilian Real, and people are familiar with Boleto as a payment method. These are quite different than what we use in the US, and imagine this problem spread across the 190 countries Airbnb serves.

In order to achieve this, our Payments team has built a world-class payments platform that is secure and easy to use. The team’s responsibilities include support of guest payments and host payouts, new payment experiences like gift cards, and assisting in financial reconciliation, to name a few.

Because Airbnb operates in 190 countries, we support a great number of currencies and processors. Most of the time, our system functions without incident, but we do encounter some hiccups where a certain currency cannot be processed or a certain payment gateway is inaccessible. In order to catch these interruptions as quickly as possible, the Data Science team has built an anomaly detection system that can identify problems in real time as they develop. This helps the product team detect issues and opportunities quickly while freeing up data scientists’ time to work on A/B testing (new payment methods, new product launches), statistical analysis (impact of price on bookings, forecasting), and building machine learning models to personalize user experience.

To give you a look at the anomaly detection tool we built to detect outliers in payments dataset, in this blog I will use a few different sets of mock data to demonstrate how the model works. I will pretend to run an ecommerce shop in the summer of 2020 that sells three hackers items: Monitors, Keyboards, and Mouses. Also I have two suppliers: Lima and Hackberry.

### Motivation

The primary objective for the anomaly detection system is to find outliers in our time series dataset. Sometimes, a high level outlook would be sufficient, but most of the time we need to cut the data to decipher underlying trends. Let’s consider the case below where we are monitoring Monitors’ imports.

This overall numbers for Monitors look quite normal. We then take a look at the imports of Monitors by our two different suppliers: Lima and Hackberry.

Here we can see, Lima, our major supplier for Monitors, was not delivering the expected amount on the 18th of August, 2020 for around 3 days. We automatically fell back on our secondary supplier, Hackberry, during this time. If we only look at the high level data, we wouldn’t have detected the issue, but looking a few levels deeper provides us the information to target the right problem.

### The model

Simple regression model

An intuitive model is to run a simple Ordinary Least Square regression with dummy variables indicating day of the week. The model takes the following form:

$$y = at + b + \sum_{i=1}^{7} a_i*{I_\textrm{day}}_i + e$$

where y is the amount we want to track, t is the time variable, $${I_\textrm{day}}_i$$ is the indicator variable that denotes if today is the $$i^\textrm{th}$$ day of the week, and e is the error term. This model is pretty simple, and it generally does a good job identifying the trend. However, there are several drawbacks:

• The growth predictor is linear. If we experience exponential growth, it’s not good at modeling the trend.
• There is a strong assumption that the time series only shows weekly seasonality. It cannot deal with products with other seasonality patterns.
• Too many dummy variables require a bigger sample size for the coefficients to achieve the desirable significance level.

Even though we can observe the pattern of the metrics we want to track, and manually change the form of the model (i.e. we can add additional dummy variables when observing a strong monthly or yearly seasonality), the process is not scalable. An automated way to identify seasonality helps us to avoid our own bias and enables us to use this technique in data sets beyond payments.

Fast Fourier Transform model

When building a model about a time series with both trend and seasonality, it’s a common practice to build a model that takes the following form:

$$Y = S + T + e$$

where Y is the metric; S represents the seasonality; T represents the trend; e is the error term. For example, in our simple regression model, S is represented by the summation of the indicator functions, and T is represented by at+b.

In this section, we will develop new methods to detect trend and seasonality, incorporating the knowledge we gained from the previous section. We will use the sales of the two imaginary products, Keyboards and Mouses, to demonstrate how the model works. The two products’ sales values are depicted in the following graph:

As shown above, Keyboards are the major product initiated in September 2016, and Mouses gets introduced in Aug 2017. We will model the seasonalities and trends, and try to find anomalies where the error terms are too far from the average.

### Seasonality

To detect the seasonality, we will use Fast Fourier Transform (FFT). In simple linear regression model, we had assumed a weekly seasonality. As we can see above, there is no strong weekly pattern in Mouses, so blindly assuming such a pattern can hurt the model because of unnecessary dummy variables. In general, FFT is a good tool to detect seasonality if we have a good amount of historical data. It is very good at detecting seasonal patterns. After applying FFT to both time series, we get the following graph:

where season_day is the period for the cosine wave of the transformation. In FFT, we usually only select periods with peak amplitudes to represent the seasonality and treat all other periods as noise. In this case, for Keyboards, we see two big peaks at 7 and 3.5 and another two smaller peaks at 45 and 60. For Mouses, we see a significant peak at 7 day, and some smaller peaks at 35, 60, and 80. The seasonalities of Keyboards and Mouses generated from the FFT are represented in the following graph:

As we can see, the amplitude of the seasonality for Keyboards grows with time, and it contains a major weekly seasonality, whereas the amplitude of Mouses shows a significant seasonality both in weekly trend and in a 40 day period.

### Trend

Here we will use the rolling median as a trend of a time series. The assumption here is that the growth is not very significant in a very short period of time. For example, for a certain day, we will use the previous 7-day rolling median as the trend level for that day. The advantage of using the median instead of the mean is having better stability in case of outliers. For example, if we have a sudden increase of 10x value for a day or two, that will not affect the trend if we look at median. However, it will affect our trend if we use mean. In this demonstration, we use 14 day median as our trend, shown in the graph below:

### Error

After getting the seasonality and trend, we are going to evaluate the error term. We will use the error terms to determine if we have an anomaly in the time series dataset. When we combine the trend and the seasonality together, and subtract it from the original sales data, we get the error term. We plot the error term in the following graph:

As we can see, we have some spikes in the error term which represent anomalies in the time series data. Depending on how many false positives we can tolerate, we can choose how many standard deviations away from 0 we allow. Here we will use 4 standard deviations to get a reasonable amount or alerts.

As we see above, the alert system does a good job identifying most of the spikes in the error term, which corresponds with some anomalies in the data. Note that some of the anomalies we have detected are not abnormal to human eyes, but they are actually real anomalies because of the seasonality pattern.

Overall, based on our internal tests, this model performs well in identifying the anomalies while making minimal assumptions about the data.

### Conclusion

Hopefully this blog post provides some insights on how to build a model to detect the anomalies. Most anomaly detection models involve modeling seasonality and trend. One key element of modeling is to make as few assumptions as possible. This will make the model general enough to suit lot more situations. However, if some assumptions greatly simplify your modeling process, don’t shy away from adopting those assumptions.

The post Anomaly Detection for Airbnb’s Payment Platform appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/anomaly-detection/feed/ 0
Building Periscope for Android http://nerds.airbnb.com/building-periscope-for-android/ http://nerds.airbnb.com/building-periscope-for-android/#comments Thu, 13 Aug 2015 18:56:16 +0000 http://nerds.airbnb.com/?p=176595128 Yesterday, Periscope announced that it reached 10 Million Periscope accounts. At OpenAir, Sara Haider detailed how she and her colleagues, faced multiple challenges in developing an Android version of the live video streaming app. There are two video livestreaming protocols, Haider explained: Real-Time Messaging Protocol (RTMP), which is suitable for low latency and real-time interaction […]

The post Building Periscope for Android appeared first on Airbnb Engineering.

]]>

Yesterday, Periscope announced that it reached 10 Million Periscope accounts. At OpenAir, Sara Haider detailed how she and her colleagues,
faced multiple challenges in developing an Android version of the live video streaming app.

There are two video livestreaming protocols, Haider explained: Real-Time Messaging Protocol (RTMP), which is suitable for low latency and real-time interaction among a few participants but isn’t widely supported; and HTTP Live Streaming (HLS), which sits atop the HTTP layer. HLS offers much higher latency, is ideal for non-interactive broadcasters, and is widely supported.
Low latency for Periscope is a must, however, so both protocols were needed to deliver the user experience on Android.

For playback, Periscope used ExoPlayer, Android’s application-level media player, which began supporting HLS earlier this year but lacks RTMP support. “So we built from scratch a completely custom RTMP broadcasting and playback stack for Periscope on Android,” Haider said.

Because RTMP isn’t widely scalable, Periscope relies on HLS to support scale. For example, when a Periscope video stream chatroom gains a lot of participants, Periscope limits the number of people who can comment to keep things moving smoothly for everyone. Those users get an HLS video stream that keeps them in sync while also enabling Periscope to scale the broadcast “to thousands of users,” Haider said. (Jump to 5:19 in the video to hear more.)

The chats accompanying a video feed are handled on their own Real-Time Chat (RTC) channel and come in instantly, though the video feed has variable latency. Also, Android clocks are “completely unreliable,” Haider said, with variations of as much as 15 seconds behind or ahead. To compensate for those variations, Periscope uses the Network Time Protocol (NTP).

The Periscope Android team consisted, at various times, of three app engineers, one video engineer, and one designer. The “bootstrapped stack” the team built consists of open-source libraries Retrofit, OkHttp, EventBus, Glide, and Spongy Castle. They used the standard tools DexGuard, Crashlytics (Twitter’s crash reporting system), Localytics, Android Studio, and Genymotion. As for application libraries, Periscope used PubNub for its chatroom channel in addition to the ExoPlayer media player. (Go to 14:21 in the video to hear more.)

The post Building Periscope for Android appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/building-periscope-for-android/feed/ 0
How Marketplace Search Differs from Traditional Search http://nerds.airbnb.com/how-marketplace-search-differs-from-traditional-search/ http://nerds.airbnb.com/how-marketplace-search-differs-from-traditional-search/#comments Tue, 04 Aug 2015 17:33:58 +0000 http://nerds.airbnb.com/?p=176595123 A marketplace search is when you go to a specific website, such as Airbnb or Amazon, to hunt for a particular product, service or, in Airbnb’s case, property. As Airbnb engineering manager Surabhi Gupta explained in her recent OpenAir 2015 talk, there are three main reasons why marketplace search is more challenging for engineers than […]

The post How Marketplace Search Differs from Traditional Search appeared first on Airbnb Engineering.

]]>

A marketplace search is when you go to a specific website, such as Airbnb or Amazon, to hunt for a particular product, service or, in Airbnb’s case, property. As Airbnb engineering manager Surabhi Gupta explained in her recent OpenAir 2015 talk, there are three main reasons why marketplace search is more challenging for engineers than traditional search (such as on Google or Bing):

1. Conversion takes more than a click.

With traditional search, a user types a keyword or phrase, reviews the results, then clicks on the most relevant result. With a marketplace search on Airbnb, other steps sometimes must happen before a user ‘converts’ (makes a booking).

For example, a user may identify a listing she likes but wants to ask the host questions. The host can respond to the questions and accept the reservation; answer the questions but reject the booking; or not get the chance to answer the questions, and the guest moves on to another listing.

Airbnb’s search ranking uses machine learning to try and predict a final booking outcome, in order to help the guest easily find the best listing. To accomplish this, the Airbnb search team modeled the five intermediate states it cares most about: Impression (the displayed search results); Clicks (did the user like a result enough to click on it?); Accept (the host accepts the guest’s reservation); Reject (the dates don’t work out); and Booking.

Each state is given a score based on past user actions. By serving up search results influenced by how likely the user is to book the properties displayed, Airbnb has achieved “a huge booking gain,” Gupta said.

2. Decision making requires a lot of context.

Because Airbnb properties are unique, users need a lot of context to decide whether a listing is right for them. Airbnb’s engineering team has experimented with different ways of presenting information to users. For example, because location and price are the two most important requirements, Airbnb has given users information about neighborhood characteristics displayed on a city map as well as price histograms.

3. The supply is perishable.

Airbnb listings are “perishable” because they are unique properties whose availability comes and goes. To prevent users from reading about a property only to discover it’s not available when they want it, Airbnb built a real-time information infrastructure using MySQL databases, a centralized index, and Ruby on Rails. (Jump to 15:53 in the video for more details.)

Going forward, Airbnb wants to make it easier for users to pick up exactly where they left off when returning to the site; add more personalization options; and obtain a deeper understanding of its listings in order to give guests “the best possible experience” when matching them with hosts.

The post How Marketplace Search Differs from Traditional Search appeared first on Airbnb Engineering.

]]>
Netflix Algorithms Are Key to the ‘Future of Internet Television’ http://nerds.airbnb.com/netflix-algorithms-are-key-to-the-future-of-internet-television/ http://nerds.airbnb.com/netflix-algorithms-are-key-to-the-future-of-internet-television/#comments Fri, 24 Jul 2015 00:02:05 +0000 http://nerds.airbnb.com/?p=176595119 When Netflix’s 60 million subscribers log in to the streaming video service, their home page is populated with TV show and movie recommendations. The recommendations are key to Netflix’s success, as they drive two out of every three hours of video streamed.   The user’s home page “is where all our algorithms for recommending TV […]

The post Netflix Algorithms Are Key to the ‘Future of Internet Television’ appeared first on Airbnb Engineering.

]]>

When Netflix’s 60 million subscribers log in to the streaming video service, their home page is populated with TV show and movie recommendations. The recommendations are key to Netflix’s success, as they drive two out of every three hours of video streamed.

The user’s home page “is where all our algorithms for recommending TV shows and movies come together,” said Carlos Gomez Uribe, Netflix VP of Innovation. In his recent OpenAir 2015 talk, Gomez Uribe said over 100 engineers are focused on developing algorithms to help Netflix meet its business goal of “inventing the future of Internet television.”

One Netflix algorithm organizes the entire video catalog in a personalized way for users. Another looks for similarities between all content Netflix offers. A master algorithm “looks at all the other algorithms to decide which videos make it onto a user’s home page,” said Gomez Uribe.

Keyword searches drive 20 percent of video streaming hours, so Netflix’s search algorithm is tied into its recommendations and other algorithms. When users search for a title Netflix doesn’t have, the search results will display recommendations for similar shows. “We try to recommend movies related to a search, even though it’s not exactly what you wanted. All this requires a large number of algorithms,” Gomez Uribe said.

Personalization is important because it’s more likely to drive higher engagement with Netflix content vs. simply showing a user what’s popular. When Netflix organizes videos by popularity on user home pages, the “take rate” (the percentage of suggested videos that are actually watched) is “OK,” Gomez Uribe said. “But when we personalize recommendations, the take rate goes way up.” (Go to 5:00 in the video to hear more.)

Algorithms also help Netflix perform long-term A/B testing on its user interface, providing alternate ways to organize and display recommendations to users, said Gomez Uribe. In turn, the A/B testing can help Netflix measure subscriber cancellation rates more effectively. Cancellations are an easier metric to track than new member sign-ups because the latter are often fueled by word of mouth—which is notoriously difficult to track.

A/B testing has enabled Netflix to “stand our ground” on occasion, Gomez Uribe added. In 2011, Netflix.com unveiled a new user interface. A/B test results influenced the design, as the data showed that the new look-and-feel decreased cancellations and increased hours streamed.

Netflix was “so proud” of the new interface that it ran a blog post about it, “New Look and Feel for the Netflix Website” (June 8, 2011). But in short order, Netflix received a considerable number of snarky comments about the new look. Wrote one displeased subscriber: “Please inform your employers that a drunken dyslexic monkey would be a more acceptable design lead for your web concepts.”

Data from the A/B tests told Netflix that, despite the snark, “the majority of users were better off” with the new interface. And so, rather than rolling back to old the UI, Netflix moved forward with the new one, continuing to fine-tune it along the way. (Discussion begins around 10:40 in the video.)

The post Netflix Algorithms Are Key to the ‘Future of Internet Television’ appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/netflix-algorithms-are-key-to-the-future-of-internet-television/feed/ 0
At Airbnb, Data Science Belongs Everywhere: Insights from Five Years of Hypergrowth http://nerds.airbnb.com/scaling-data-science/ http://nerds.airbnb.com/scaling-data-science/#comments Tue, 07 Jul 2015 17:32:21 +0000 http://nerds.airbnb.com/?p=176595109 Five years ago, I joined Airbnb as its first data scientist. At that time, the few people who’d even heard of the company were still figuring out how to pronounce its name, and the roughly 7 person team (depending on whether you counted that guy on the couch, the intern, and the barista at our […]

The post At Airbnb, Data Science Belongs Everywhere: Insights from Five Years of Hypergrowth appeared first on Airbnb Engineering.

]]>
Five years ago, I joined Airbnb as its first data scientist.

At that time, the few people who’d even heard of the company were still figuring out how to pronounce its name, and the roughly 7 person team (depending on whether you counted that guy on the couch, the intern, and the barista at our favorite coffee shop) was still operating out of the founders’ apartment in SOMA. Put simply, it was pretty early stage.

Bringing me on was a forward-looking move on the part of our founders. This was just prior to the big data craze and the conventional wisdom that data can be a defining competitive advantage. Back then, it was a lot more common to build a data team later in a company’s lifecycle. But they were eager to learn and evolve as fast as possible, and I was attracted to the company’s culture and mission. So even though we were a very small-data shop at the time, I decided to get involved.

There’s a romanticism in Silicon Valley about the early days of a startup: you move fast, make foundational decisions, and any good idea could become the next big thing. From my perspective, that was all true.

Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production mysql database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.

But five years and 43,000% growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.

So I thought it might be worth pairing our posts on specific problems we’re solving with an overview of the higher-level issues data teams encounter as companies grow, and how we at Airbnb have responded. This will mostly center around how to connect data science with other business functions, but I’ll break it into three concepts — how we characterize data science, how it’s involved in decision-making, and how we’ve scaled it to reach all sides of Airbnb. I won’t say that our solutions are perfect, but we do work every day to retain the excitement, culture, and impact of the early days.

### Data Isn’t Numbers, It’s People

The foundation upon which a data science team rests is the culture and perception of data elsewhere in the organization, so defining how we think about data has been a prerequisite to ingraining data science in business functions.

In the past, data was often referenced in cold, numeric terms. It was construed purely as a measurement tool, which paints data scientists as Spock-like characters expected to have statistics memorized and available upon request. Interactions with us would therefore tend to come in the form of a request for a fact: how many listings do we have in Paris?What are the top 10 destinations in Italy?

While answering questions and measuring things is certainly part of the job, at Airbnb we characterize data in a more human light: it’s the voice of our customers. A datum is a record of an action or event, which in most cases reflects a decision made by a person. If you can recreate the sequence of events leading up to that decision, you can learn from it; it’s an indirect way of the person telling you what they like and don’t like – this property is more attractive than that one, I find these features useful but those.. not so much.

This sort of feedback can be a goldmine for decisions about community growth, product development, and resource prioritization. But only if you can decipher it. Thus, data science is an act of interpretation – we translate the customer’s ‘voice’ into a language more suitable for decision-making.

This idea resonates at Airbnb because listening to guests and hosts is core to our culture. Since the early days, our team has met with community members to understand how to make our product better suit their needs. We still do this, but the scale of the community is now beyond the point where it’s feasible to connect with everyone everywhere.

So, data has become an ally. We use statistics to understand individual experiences and aggregate those experiences to identify trends across the community; those trends inform decisions about where to drive the business.

Over time, our colleagues on other teams have come to understand that the data team isn’t a bunch of Vulcans, but rather that we represent the very human voices of our customers. This has paved the way for changes to the structure of data science at Airbnb.

### Proactive Partnership v. Reactive Stats-Gathering

A good data scientist is therefore able to get in the mind of people who use our product and understand their needs. But if they’re alone in a forest with no one to act on the insight they uncovered, what difference does it make?

Our distinction between good and great is impact — using insights to influence decisions and ensuring that the decisions had the intended effect. While this may seem obvious, it doesn’t happen naturally – when data scientists are pressed for time, they have a tendency to toss the results of an analysis ‘over the wall’ and then move on to the next problem. This isn’t because they don’t want to see the project through, but with so much energy invested into understanding the data, ensuring statistical methods are rigorous, and making sure results are interpreted correctly, the communication of their work can feel like a trivial afterthought.

But when decision-makers don’t understand the ramifications of an insight, they don’t act on it. When they don’t act on it, the value of the insight is lost.

The solution, we think, is connecting data scientists as tightly as possible with decision-makers. In some cases, this happens naturally; for example when we develop data products (more on this in a future post). But there’s also a strong belief in cross-functional collaboration at Airbnb, which brings up questions about how to structure the team within the broader organization.

A lot has been written about the pros and cons of centralized and embedded data science teams, so I won’t focus on that. But suffice to say we’ve landed on a hybrid of the two.

We began with the centralized model, tempted by its offering of opportunities to learn from each other and stay aligned on metrics, methodologies, and knowledge of past work. While this was all true, we’re ultimately in the business of decision-making, and found we couldn’t do this successfully when silo’d: partner teams didn’t fully understand how to interact with us, and the data scientists on our team didn’t have the full context of what they were meant to solve or how to make it actionable. Over time we became viewed as a resource and, as a result, our work became reactive – responding to requests for statistics rather than being able to think proactively about future opportunities.

So we made the decision to move from a fully-centralized arrangement to a hybrid centralized/embedded structure: we still follow the centralized model, in that we have a singular data science team where our careers unfold, but we have broken this into sub-teams that partner more directly with engineers, designers, product managers, marketers, and others. Doing so has accelerated the adoption of data throughout the company, and has elevated data scientists from reactive stats-gatherers to proactive partners. And by not fully shifting toward an embedded model we’re able to maintain a vantage point over every piece of the business, allowing us to form a neural core that can help all sides of the company learn from one another.

### Customer-driven decisions

Structure is a big step toward empowering impactful data science, but it isn’t the full story. Once situated within a team that can take action against an insight, the question becomes how and when to leverage the community’s voice for business decisions.

Through our partnership with all sides of the company, we’ve encountered many perspectives on how to integrate data into a project. Some people are naturally curious and like to begin by understanding the context of the problem they’re facing. Others view data as a reflection of the past and therefore a weaker guide for planning; but these folks tend to focus more on measuring the impact of their gut-driven decisions.

Both perspectives are fair. Being completely data-driven can lead to optimizing toward a local maximum; finding a global maximum requires shocking the system from time to time. But they reflect different points where data can be leveraged in a project’s lifecycle.

Over time, we’ve identified four stages of the decision-making process that benefit from different elements of data science:

1. We begin by learning about the context of the problem, putting together a full synopsis of past research and efforts toward addressing the opportunity. This is more of an exploratory process aimed at sizing opportunities, and generating hypotheses that lead to actionable insights.
2. That synopsis translates to a plan, which encompasses prioritizing the lever we intend to utilize and forming a hypothesis for the effect of our efforts. Predictive analytics is more relevant in this stage, as we have to make a decision about what path to follow, which is based on where we expect to have the largest impact.
3. As the plan gets underway, we design a controlled experiment through which to roll the plan out. A/B testing is very common now, but our collaboration with all sides of the business opens up opportunities to use experimentation in a broader sense — operational market-based tests, as well as more traditional online environments.
4. Finally, we measure the results of the experiment, identifying the causal impact of our efforts. If successful, we launch to the whole community; if not, we cycle back to learning why it wasn’t successful and repeat the process.

Sometimes a step is fairly straightforward, for example if the context of the problem is obvious – the fact that we should build a mobile app doesn’t necessitate a heavy synopsis upfront. But the more disciplined we’ve become about following each step sequentially, the more impactful everyone at Airbnb has become. This makes sense because, ultimately, this process pushes us to solve problems relevant to the community in a way that addresses their needs.

### Democratizing Data Science

The above model is great when data scientists have sufficient bandwidth. But the reality of a hypergrowth startup is that the scale and speed at which decisions need to be made will inevitably outpace the growth of the data science team.

This became especially clear in 2011 when Airbnb exploded internationally. Early in the year, we were still a small company based entirely in SF, meaning our army of three data scientists could effectively partner with everyone.

Six months later, we opened over 10 international offices simultaneously, while also expanding our product, marketing, and customer support teams. Our ability to partner directly with every employee suddenly, and irrevocably, disappeared.

Just as it became impossible to meet every new member of the community, it was now also impossible to meet and work with every employee. We needed to find a way to democratize our work, broadening from individual interactions, to empowering teams, the company, and even our community.

Doing this successfully requires becoming more efficient and effective, mostly through investment in the technology surrounding data. Here are some examples of how we’ve approached each level of scale:

1. Individual interactions become more efficient as data scientists are empowered to move more quickly. Investing in data infrastructure is the biggest lever here – adopting faster and more reliable technologies for querying an ever-growing volume of data. Stabilizing ETL has also been valuable, for example through our development of Airflow.
2. Empowering teams is about removing the burden of reporting and basic data exploration from the shoulders of data scientists so they can focus on more impactful work. Dashboards are a common example of a solution. We’ve also developed a tool to help people author queries (Airpal) against a robust and intuitive data warehouse.
3. Beyond individual teams, where our work is more tactical, we think about the culture of data in the company as a whole. Educating people on how we think about Airbnb’s ecosystem, as well as how to use tools like Airpal, removes barriers to entry and inspires curiosity about how everyone can better leverage data. Similar to empowering teams, this has helped liberate us from ad hoc requests for stats.
4. The broadest example of scaling data science is enabling guests and hosts to learn from each other directly. This mostly happens through data products, where machine learning models interpret signals from one set of community-members to help guide others. Location relevance was one example we wrote about, but as this work is becoming more commonplace in other areas of the company, we’ve developed tools for making it easier to launch and understand the models we develop.

Scaling a data science team to a company in hypergrowth isn’t easy. But it is possible. Especially if everyone agrees that it’s not just a nice part of the company, it’s an essential part of the company.

### Wrestling the train from the monkey

Five years in, we’ve learned a lot. We’ve improved how we leverage the data we collect; how we interact with decision-makers; and how we democratize this ability out to the company. But to what extent has all of this work been successful?

Measuring the impact of a data science team is ironically difficult, but one signal is that there’s now a unanimous desire to consult data for decisions that need to be made by technical and non-technical people alike. Our team members are seen as partners in the decision-making process, not just reactive stats-gatherers.

Another is that our increasing ability to distill the causal impact of our work has helped us wrestle the train away from the monkey. This has been trickier than one might expect because Airbnb’s ecosystem is complicated — a two-sided marketplace with network effects, strong seasonality, infrequent transactions, and long time horizons — but these challenges make the work more exciting. And as much as we’ve accomplished over the last few years, I think we’re still just scratching the surface of our potential.

We’re at a point where our infrastructure is stable, our tools are sophisticated, and our warehouse is clean and reliable. We’re ready to take on exciting new problems. On the immediate horizon we look forward to shifting from batch to realtime processing; developing a more robust anomaly detection system; deepening our understanding of network effects; and increasing our sophistication around matching and personalization.

But these ideas are just the beginning. Data is the (aggregated) voice of our customers. And wherever we go next–wherever we belong next–will be driven by those voices.

This post originally appeared on Venturebeat.

The post At Airbnb, Data Science Belongs Everywhere: Insights from Five Years of Hypergrowth appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/scaling-data-science/feed/ 0
Recap of OpenAir http://nerds.airbnb.com/recap-of-openair/ http://nerds.airbnb.com/recap-of-openair/#comments Mon, 06 Jul 2015 16:04:20 +0000 http://nerds.airbnb.com/?p=176595098 Three weeks ago we hosted OpenAir 2015, our second technology conference. We had an amazing turnout of bright minds from across the industry, more than doubling attendance from 2014. A new generation of companies are emerging whose customers aren’t judging them by their apps and websites but on the experiences and content the products connect […]

The post Recap of OpenAir appeared first on Airbnb Engineering.

]]>
Three weeks ago we hosted OpenAir 2015, our second technology conference. We had an amazing turnout of bright minds from across the industry, more than doubling attendance from 2014.

A new generation of companies are emerging whose customers aren’t judging them by their apps and websites but on the experiences and content the products connect them with. With that in mind the theme for OpenAir 2015 was scaling human connection and we focused on online to offline and the better matching that enables it.

Throughout the day we learned how Instagram helps their users discover new content that inspires them; how Stripe helps people transact across borders, how LinkedIn used data to power their social network, how Periscope came to life on Android, and of course, how Airbnb helps turns strangers into friends.

Behind all of these challenges there are central concepts that we as a tech industry need to understand better – trust, personalization and the data that enables both.

With that in mind, Airbnb open-sourced two new tools for wrangling data. The first is called Airflow which is a sophisticated tool to programmatically author, schedule and monitor data pipelines. People in the industry will know this work as ETL engineering. The second was Aerosolve. Aerosolve is a machine learning package for Apache Spark. It’s designed to combine high capacity to learn with an accessible workflow that encourages iteration and deep understanding of underlying patterns on a human level. Since we launched these tools they have gotten over 2000 stars on GitHub – we can’t wait to see how people use and contribute to them.

We also announced a new tool for our hosts called Price Tips, which is powered by Aerosolve. Price Tips creates ongoing tips for our hosts on how to price their listing, not just for one day, but for each day of the year. This pricing is fully dynamic — it takes into account demand, location, travel trends, amenities, type of home and much more. There are hundreds of signals that go into the model to produce each price tip. We believe that better pricing will be a great way to further empower our hosts to meet their personal goals through hosting.

Finally we closed out the opening keynote morning with the launch of our brand new Gift Cards website. Now anyone in the US can give their family, friends, colleagues, frenemies, whomever, the gift of travel on Airbnb. And for those lucky folks in the audience, we gave everyone a \$100 gift card.

We will be following up with more videos from the event, so keep your eyes on this space.

The post Recap of OpenAir appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/recap-of-openair/feed/ 0
Designing Machine Learning Models: A Tale of Precision and Recall http://nerds.airbnb.com/designing-machine-learning-models/ http://nerds.airbnb.com/designing-machine-learning-models/#comments Wed, 01 Jul 2015 18:02:21 +0000 http://nerds.airbnb.com/?p=176595076 At Airbnb, we are focused on creating a place where people can belong anywhere. Part of that sense of belonging comes from trust amongst our users and knowing that their safety is our utmost concern. While the vast majority of our community is made up of friendly and trustworthy hosts and guests, there exists a […]

The post Designing Machine Learning Models: A Tale of Precision and Recall appeared first on Airbnb Engineering.

]]>
At Airbnb, we are focused on creating a place where people can belong anywhere. Part of that sense of belonging comes from trust amongst our users and knowing that their safety is our utmost concern.

While the vast majority of our community is made up of friendly and trustworthy hosts and guests, there exists a tiny group of users who try to take advantage of our site. These are very rare occurrences, but nevertheless, this is where the Trust and Safety team comes in.

The Trust and Safety team deals with any type of fraud that might happen on our platform. It is our main objective to try to protect our users and the company from various types of risks. An example risk is chargebacks – a problem that most ecommerce companies are familiar with. To reduce the number of fraudulent actions, the Data Scientists within the Trust and Safety team build various Machine Learning models to help identify the different types of risks. For more information on the architecture behind our models, please refer to a previous blog post on Architecting A Machine Learning System For Risk.

In this post, I give a brief overview of the thought process that comes with building a Machine Learning model. Of course every model is different, but hopefully it will give readers an insight on how we use data in a Machine Learning application to help protect our users, and the different approaches we use to improve our models. For this blog post, suppose we want to build a model to predict if certain fictional characters are evil*.

### What are we trying to predict?

The most fundamental question in model building is determining what you would like the model to predict. I know this sounds silly, but often times, this question alone raises other deeper questions.

Even a seemingly straightforward character classification model can raise many questions as we think more deeply about the kind of model to build. For example, what do we want this model to score: just newly introduced characters or all characters? If the former, how far into the introduction do we want to score the characters? If the latter, how often do we want to score these characters?

A first thought might be to build a model that scores each character upon introduction. However, with such a model, we would not be able to track characters’ scores over time. Furthermore, we could be missing out on potentially evil characters that might have “good” characteristics at the time of introduction.

We could instead build a model that scores a character every time he/she appears in the plot. This would allow us to study the scores over time and detect anything unusual. But, given that there might not be any character development in every single appearance, this may not be the most practical route to pursue.

After much consideration, we might decide on a model design that falls in between these two initial ideas i.e. build a model that scores each character each time something significant happened such as gathering of new allies, possessions of dragons, etc. This way, we would still be able to track the characters’ scores over time without unnecessarily scoring those with no recent development.

### How do we model scores?

Since our objective is to analyze scores over time, our training data set needs to reflect characters’ activities across a period of time. The resulting training data set will look similar to the following:

The periods associated with each character are not necessarily consecutive since we are only interested in days where there exist significant developments.

In this instance, Jarden has significant character developments on 3 different occasions and is constantly growing his army over time. Dineas has significant character developments on 5 different occasions and is responsible for 4 dragons mid-plot.

### Sampling

Often with Machine Learning models, it is necessary to down-sample the number of observations. The sampling process itself can be quite straightforward i.e. once one has the desired training data set, one can do a row-based sampling on the population.

However, because the model described herein is dealing with multiple periods per character, row-based sampling might result in scenarios where the occasions pertaining to a character get split between the data for model build and the validation data. The table below shows an example of such scenario:

This is not ideal because we are not getting a holistic picture of each character and those missing observations could be crucial to building a good model.

For this reason, we need to do character-based sampling. This will ensure that either all of the occasions pertaining to a character get included in the model build data, or none at all.

The same logic applies when it comes time to splitting our data into training and validation sets.

### Feature Engineering

Feature engineering is an integral part of Machine Learning, and a good understanding of the data helps generate ideas on the types of features to engineer for a better model. Examples of feature engineering include feature normalization and treatment of categorical features.

Feature normalization is a way to standardize features that allows for more sensible comparisons. Let’s take the table below as an example:

Both characters have 10,000 soldiers. However, Serion has been in power for 5 years, while Dineas has only been in power for 2 years. Comparing the absolute number of soldiers across these characters might not have been very useful. However, normalizing them with the characters’ years in power could provide better insights and produce a more predictive feature.

Feature engineering on categorical features probably deserves a separate blog post due to the many different ways to deal with them. In particular for missing values imputation, please take a look at a previous blog post on Overcoming Missing Values in a Random Forest Classifier.

The most common approach for transforming categorical features is vectorizing (also known as one-hot encoding). However, when dealing with many categorical features with many different levels, it is more practical to use conditional-probability coding (CP-coding).

The basic idea of CP-coding is to compute the probability of an event occurring given a categorical level. This method allows us to project all levels of a categorical feature into a single numerical variable.

However, this type of transformation may result in noisy values for levels that are not represented well. In the example above, we only have one observation from the House of Tallight. As a result, the corresponding probability is either 0 or 1. To get around this issue and to reduce the noise in general, one can adjust how the probabilities are computed by taking into account the weighted average, the global probability, as well as introduce a smoothing hyperparameter.

So, which method is better? It depends on the number and levels of the categorical features. CP-coding is good because it reduces the dimensionality of the feature, but by doing so, we are sacrificing information on feature-to-feature interactions, which is something that vectorizing retains. Alternatively, we could integrate both methods i.e. combine the categorical features of interest, and then performing CP-coding on the interacted features.

### Evaluating Model Performance

When it comes time to evaluate model performance, we need to be mindful about the proportion of good/evil characters. With our example model, the data is aggregated at [character*period] level (left table below).

However, the model performance should be measured at character level (right table below).

As a result, the proportion of good/evil characters between the model build and model performance data is significantly different. It is crucial that one assigns proper weights when evaluating a model’s precision and recall.

Additionally, because we would likely have down-sampled the number of observations, we need to rescale the model’s precision and recall to account for the sampling process.

### Assessing Precision and Recall

The two main performance metrics for model evaluation are Precision and Recall. In our example, precision is the proportion of evil characters the model is able to predict correctly. It measures the accuracy of the model at a given threshold. Recall, on the other hand, is the proportion of evil characters the model is able to detect. It measures how comprehensive the model is at identifying evil characters at a given threshold. This can be confusing, so I’ve broken it down in the table below to illustrate the difference:

It is often helpful to classify the numbers into the 4 different bins:

1. True Positives (TP): Character is evil and model predicts it as such
2. False Positives (FP): Character is good, but model predicts it to be evil
3. True Negatives (TN): Character is good and model predicts it as such
4. False Negatives (FN): Character is evil, but model fails to identify it

Precision is measured by calculating: Out of the characters predicted to be evil, how many did the model identify correctly i.e. TP / (TP + FP)?

Recall is measured by calculating: Out of all evil characters, how many are predicted by the model i.e. TP / (TP + FN)?

Observe that even though the numerator is the same, the denominator is referring to different sub-populations.

There is always a trade-off between choosing high precision vs. high recall. Depending on the purpose of the model, one might choose higher precision over higher recall. However, for fraud prediction models, higher recall is generally preferred even if some precision is sacrificed.

There are many ways one can improve model’s precision and recall. These include adding better features, optimizing pruning of trees and building a bigger forest to name a few. However, given how extensive this discussion can be, I will leave it for a separate blog post.

### Epilogue

Hopefully, this blog post has given readers a glimpse of what building a Machine Learning model entails. Unfortunately, there is no one-size-fits-all solution for building a good model, but knowing the context of the data well is key because it translates into deriving more predictive features, and thus a better model.

Lastly, classifying characters as good or evil can be subjective, but labels are a really important part of machine learning and bad labeling usually results in a poor model. Happy Modeling!

* This model assumes that each character is either born good or evil i.e. if they are born evil, then they are labeled as evil their entire lives. The model design will be completely different if we assume characters could cross labels mid-life.

The post Designing Machine Learning Models: A Tale of Precision and Recall appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/designing-machine-learning-models/feed/ 0