Airbnb Engineering http://nerds.airbnb.com Nerds Tue, 29 Mar 2016 17:30:33 +0000 en-US hourly 1 http://wordpress.org/?v=177 Building for Trust: Insights from our efforts to distill the fuel for the sharing economy http://nerds.airbnb.com/building-for-trust/ http://nerds.airbnb.com/building-for-trust/#comments Tue, 29 Mar 2016 17:30:33 +0000 http://nerds.airbnb.com/?p=176595424 “I want to tell you about the time that I almost got kidnapped in the trunk of a red Mazda Miata.” That was the intro to a talk Joe Gebbia, one of Airbnb’s co-founders, recently gave at TED. You can watch it here to find out how the story ends, but (spoiler alert) the theme centers on […]

]]>
“I want to tell you about the time that I almost got kidnapped in the trunk of a red Mazda Miata.”

That was the intro to a talk Joe Gebbia, one of Airbnb’s co-founders, recently gave at TED. You can watch it here to find out how the story ends, but (spoiler alert) the theme centers on trust — one of the most important challenges we face at Airbnb.

Designing for trust is a well understood topic across the hospitality industry, but our efforts to democratize hospitality mean we have to rely on trust in an even more dramatic way. Not long ago our friends and families thought we were crazy for believing that someone would let a complete stranger stay in their home. That feeling stemmed from the fact that most of us were raised to fear strangers.

“Stranger danger” is a natural human defense mechanism; overcoming it requires a leap of faith for both guests and hosts. But that’s a leap we can actively support by understanding what trust is, how it works, and how to build products that support it.

How best to support trust — particularly between groups of people who may not have the opportunity to interact with each other on a daily basis — is a core research topic for our data science and experience research teams. In preparation for Joe’s talk, we reflected on how we think about trust, and we pulled together insights from a variety of past projects. The goal of this post is to share some of the thoughts and insights that didn’t make it into the TED talk and to inspire more thinking about how to cultivate the fuel that helps the sharing economy run: trust.

Building the scaffold

When Airbnb was just getting started, we were keenly aware of the need to build products that encourage trust. Convincing someone to try us for the first time would require some confidence that our platform helps to protect them, so we chose to take on a series of complex problems.

We began with the assumption that people are fundamentally good and, with the right tools in place, we could help overcome the stranger-danger bias. To do so, we needed to remove anonymity, giving guests and hosts an identity in our community. We built profile pages where they could upload pictures of themselves, write a description about who they are, link social media accounts, and highlight feedback from past trips. Over time we’ve emphasized these identity pages more and more. Profile pictures, for example, are now mandatory — because they are heavily relied upon. In nearly 50% of trips, guests visit a host’s profile at least once, and 68% of the visits occur in the planning phase that comes before booking. When people are new to Airbnb these profiles are especially useful: compared to experienced guests, first time guests are 20% more likely to visit a host’s profile before booking.

In addition to fostering identity, we knew we also needed defensive mechanisms that would help build confidence. So we chose to handle payments, a complicated technical challenge, but one that would enable us to better understand who was making a booking. This also put us in a position to design rules to help remove some uncertainty around payments. For example, we wait 24 hours until after a guest checks-in before releasing funds to the host to give both parties some time to notify us if something isn’t right. And when something goes wrong there needs to be a way to reach us, so we built a customer support organization that now covers every timezone and many languages, 24/7.

One way we measure the effect of these efforts is through retention — the likelihood that a guest or host uses Airbnb more than once. This isn’t a direct measure of trust, but the more people come to trust Airbnb, the more likely they may be to continue using our service, so there’s likely a correlation between the two. Evaluating customer support through this lens makes a clear case for its value: if a guest has a negative experience, for example a host canceling their reservation right before their trip, their retention rate drops 26%; intervention by customer support almost entirely negates this loss — retention rebounds up from 26% to less than 6%.

We didn’t get everything right at first, and we still don’t, but we have improved. One thing we learned after a bad early experience is that we needed to do more to give hosts confidence that we’d be there for them if anything goes wrong, so we rolled out our $1 million guarantee for eligible hosts. But each year, more and more people are giving this a try because we’ve been able to build confidence that their experience is likely to be a good one. This isn’t the same as trusting the person they will stay with, but it’s an important first step: if trust is the building that hosts and guests construct together, then confidence is the scaffold. Just like the scaffold on a building, our efforts to build confidence make it easier for the work of trust-building to happen, but they won’t create trust. Only hosts and guests can do that. Enacting trust Researchers define trust in many ways, but one interesting definition comes from political scientist Russell Hardin. He argues that trust is really about “encapsulated interest”: if I trust you, I believe that you’re going to look after me and my interests, that you’re going to take my interests to heart and make decisions about them like I would. People who are open to trusting others aren’t suckers — they usually need evidence that the odds are stacked in their favor when they choose to trust a stranger. Reviews thus form the raw material that we can collect and then surface to users on our platform. This is one of our most important data products; we refer to it as our reputation system. The reputation system is an invaluable tool for the Airbnb community, and it’s heavily used — more than 75% of trips are voluntarily reviewed. This is particularly interesting because reviews don’t benefit the individuals who leave them; they benefit future guests and hosts, validating members of the Airbnb community and helping compatible guests and hosts find each other. Having any reputation at all is a strong determinant of a host’s ability to get a booking — a host without reviews is about four times less likely to get a booking than a host that has at least one. Our reputation system helps guide guests and hosts toward positive experiences and it also helps overcome stereotypes and biases that unconsciously affect our decisions. We know biases exist in society, and one of the strongest biases we have in life is that we tend to trust others who are similar to us — sociologists call this homophily. As strong a social force as homophily is, it turns out reputation information can help counteract it. In a recent collaborative study with a team of social psychologists at Stanford, we found evidence of homophily among Airbnb travelers, but we also found that having enough positive reviews can help to counteract homophily, meaning in effect that high reputation can overcome high similarity. (Publication forthcoming) Given the importance of reputation, we’re always looking for ways to increase the quantity and quality of reviews. Several years ago, one member of our team observed that reviews can be biased upward due to fears of retaliation and bad experiences were less likely to be reviewed at all. So we experimented with a ‘double blind’ process where guest and host reviews would be revealed only after both had been submitted or after a 14-day waiting period, whichever came first. The result was a 7% increase in review rates and a 2% increase in negative reviews. These may not sound like big numbers, but the results are compounding over time — it was a simple tweak that has improved the travel experiences of millions of people since. The effect on communities Once trust takes root, powerful community effects begin to emerge and long-standing barriers can begin to fall. First and foremost, people from different cultures become more connected. On New Year’s Eve last year, for example, over a million guests, hailing from almost every country on earth, spent the night with hosts in over 150 countries. Rather than staying with other tourists, they stayed with locals, creating an opportunity for cross-cultural connection that can break down barriers and increase understanding. This is visualized in the graphic below, which shows how countries are being connected via Airbnb trips. Countries on the vertical axis are where people are traveling from, and countries on the horizontal axis are where people are traveling to. The associated link will take you to an interactive visualization where you can see trends in connections relative to different measures of distance. Airbnb experiences are overwhelmingly positive, which creates a natural incentive to continue hosting. Hosts’ acceptance rates rise as they gain more experience hosting. And we see evidence of their relishing the cross-cultural opportunities Airbnb provides: guests from a different country than the host gain a 6% edge in acceptance rates. Hosting also produces more practical benefits. About half of our hosts report that the financial boost they receive through Airbnb helps them stay in their homes and pay for regular household expenses like rent and groceries; depending on the market, 5–20% of hosts report that this added source of earnings has helped them avoid foreclosure or eviction. The remaining earnings pads long-term savings and emergency funds, which helps them weather financial shocks in the future, or is used for vacations, which provides similar economic benefits to other markets. As communities become more trusting, they also become more durable; they can serve as a source of strength when times are tough. In 2012, after Hurricane Sandy hit the east coast, one of our hosts in New York asked for help changing her listing’s price to$0 in order to provide shelter to neighbors in need. Our engineers worked around the clock to build this and other features that would enable our community to respond to natural disasters which resulted in over 1,400 additional hosts making spaces available to those affected by the hurricane. Since then, we have evolved these tools into an international disaster response program, allowing our community to support those impacted by disasters — as well as the relief workers supporting the response — in cities around the world. In the last year we have responded to disasters and crises including the Nepal earthquake, the Syrian refugee crisis, the Paris attacks and most recently Tropical Storm Winston in Figi.

Joe, Brian, and Nate realized from the beginning how crucial trust was going to be for Airbnb. These are just some of the stories that have followed years of effort to build confidence in our platform and facilitate one-on-one trust. While the results are quite positive, we still have a long way to go.

One ongoing challenge for us is to concretely measure trust. We regularly ask hosts and guests about their travel experiences, their relationships with each other, and their perceptions of the Airbnb community overall. But none of those are perfect proxies for trust, and they don’t scale well. The standard mechanisms researchers often use to measure trust are cumbersome, and we can’t reliably infer trust from behavioral data. But we’re working to build this measurement capacity so we can continue to carefully design and optimize for trust over time.

Our motivation to understand trust doesn’t end with the need to build great products — it’s also about understanding what communities full of trust can do. In one night last year, 1.2m guests stayed with 300,000 hosts. Each of those encounters is an opportunity to break down barriers and form new relationships that further strengthen communities over time.

]]>
http://nerds.airbnb.com/building-for-trust/feed/ 0
Beginning with Ourselves http://nerds.airbnb.com/beginning-with-ourselves/ http://nerds.airbnb.com/beginning-with-ourselves/#comments Mon, 22 Feb 2016 19:14:15 +0000 http://nerds.airbnb.com/?p=176595414 In a recent post, we offered some insights into how we scaled Airbnb’s data science team in the context of hyper-growth. We aspired to build a team that was creative and impactful, and we wanted to develop a lasting, positive culture. Much of that depends on the points articulated in that previous post, however there is […]

The post Beginning with Ourselves appeared first on Airbnb Engineering.

]]>
In a recent post, we offered some insights into how we scaled Airbnb’s data science team in the context of hyper-growth. We aspired to build a team that was creative and impactful, and we wanted to develop a lasting, positive culture. Much of that depends on the points articulated in that previous post, however there is another part of the story that deserves its own post — on a topic that has been receiving national attention: diversity.

For us, this challenge came into focus a year ago. We’d had a successful year of hiring in terms of volume, but realized that in our push for growth we were not being as mindful of culture and diversity as we wanted to be. For example, only 10% of our new data scientists were women, which meant that we were both out of sync with our community of guests and hosts, and that the existing female data scientists at Airbnb were quickly becoming outnumbered. This was far from intentional, but that was exactly the problem — our hiring efforts did not emphasize a gender balanced team.

There are, of course, many ways to think about team balance; gender is just one dimension that stood out to us. And there are known structural issues that form a headwind against progress in achieving gender balance (source). So, in a hyper-growth environment where you’re under pressure to build your team, it is easy to recruit and hire a larger proportion of male data scientists.

But this was not the team we wanted to build. Homogeneity brings a narrower range of ideas and gathers momentum toward a vicious cycle, in which it becomes harder to attract and retain talent within a minority group as it becomes increasingly underrepresented. If Airbnb aspires to build a world where people can belong anywhere, we needed to begin with our team.

We worried that some form of unconscious bias had infiltrated our interviews, leading to lower conversion rates for women. But before diving into a solution, we decided to treat this like any problem we work on — begin with research, identify an opportunity, experiment with a solution, and iterate.

Over the year since, the results have been dramatic: 47% of hires were women, doubling the overall ratio of female data scientists on our team from 15% to 30%. The effect this has had on our culture is clear — in a recent internal survey, our team was found to have the highest average employee satisfaction in the company. In addition, 100% of women on our team indicated that they expect to still be here a year from now and felt like they belonged at Airbnb.

Our work is by no means done. There’s still more to learn and other dimensions of diversity to improve, but we feel good enough about our progress to share some insights. We hope that teams at other companies can adopt similar approaches and build a more balanced industry of data scientists.

When we analyze the experience of a guest or host on Airbnb, we break it into two parts: the top-of-funnel (are there enough guests looking for places to stay and enough hosts with available rooms) and conversion (did we find the right match and did it result in a booking). Analyzing recruiting experiences is quite similar.

And, like any project, our first task was to clean our data. We used the EEOC reporting in Greenhouse (our recruiting tool) to better understand the diversity of our applicants, doing our own internal audit of data quality as well. One issue we faced is that while Greenhouse collects diversity data on applicants who apply directly through the Airbnb jobs page, it does not collect information on the demographics of referrals (candidates who were recommended for the job by current Airbnb employees), which represent a large fraction of hires. Then we combined this with data from an internal audit of our teams history and from Workday, our HR tool, in order to compare the composition of applicants to the composition of our team.

When we dug in, we found that historically about 30% of our applicants — the top of the funnel — had been women. This told us that there were opportunities for improvement on both fronts. Our proportion of female applicants was twice that of employees, so there was clearly room for improvement in our hiring process — the conversion portion. However, there wasn’t male/female parity in our applicant pool so this could also prove a meaningful lever.

In addition, we wanted to ensure that our efforts to diversify our data science team didn’t end with us. Making changes to the top of the funnel — to how many women want to and feel qualified to apply for data science jobs — could help us do that. Our end goal is to create a world where there is diversity across the entire data science field, not just at Airbnb.

We decided that the best way to achieve these goals would be to look beyond our own applicants to inspire and support women in the broader field. One observation was that while there were a multitude of meetups for women who code, and many great communities of women in engineering, we hadn’t seen the same proliferation of events for women in data science.

We decided to create a series of lightning talks featuring women in data, under the umbrella of the broader Airbnb “Taking Flight” initiative. The goals were twofold: to showcase the many contributions of women in the field, and to create a forum for celebrating the contributions of women to data science. At the same time, we wanted to highlight diversity on multiple dimensions. For each lightning talk, we created a panel of women from many different racial and ethnic backgrounds, practicing different types of data science. The talks were open to anyone who supported women in data science.

We came up with the title “Small Talks, Big Data” and started with an event in November 2014 where we served food and created a space and time for mingling. The event sold out, with over 100 RSVPs. Afterward we ran a survey to see what our attendees thought we could improve in subsequent events and turned “Small Talks, Big Data” into a series, all of which have continued to sell out. Given this level of interest, several of the women on our team volunteered to write blog posts about their accomplishments (for example, Lisa’s analysis of NPS and Ariana’s overview of machine learning) in order to circulate their stories beyond San Francisco, and to give talks and interviews (for example, Get to know Data Science Panelist Elena Grewal). Many applicants to our team have cited these talks and posts as inspirations to consider working at Airbnb.

In parallel to these large community events we put together smaller get-together for senior women in the field to meet, support one another, and share best practices. We hosted an initial dinner at Airbnb and were amazed at what wonderful conversations and friendships were sparked by the event. This group has continued to meet informally, with women from other companies taking the lead on hosting events at their companies, further exposing this group to the opportunities in the field.

Increasing conversion

Alongside our efforts to broaden our applicant pool, we scrutinized our approach to interviewing. As with any conversion funnel, we broke our process down into discrete steps, allowing us to isolate where the drop-off was occurring.

There are essentially three stages to interviewing for a data science role at Airbnb: a take-home challenge used to assess technicality and attention to detail; an onsite presentation demonstrating communication and analytical rigor; and a set of 1:1 conversations with future colleagues where we evaluate compatibility with our culture and fit for the role itself. Conversion in the third step was relatively equal, but quite different in steps one and two.

We wanted to keep unconscious bias from affecting our grading of take-home challenges, either relating to reviewers being swayed by the name and background of the candidate (via access to their resume) or to subjective views of what constitutes success. To combat this, we removed access to candidate names[1] and implemented a binary scoring system for the challenge, tracking whether candidates did or did not do certain tasks, in an effort to make ratings clearer and more objective. We provided graders with a detailed description of what to look for and how to score, and trained them on past challenges before allowing them to grade candidates in flight. The same challenge would circulate through multiple graders to ensure consistency.

Our hypothesis for the onsite presentation was that we had created an environment that catered more to men. Often, a candidate would be escorted into a room where there would be a panel of mostly male data scientists who would scrutinize their approach to solving the onsite challenge. The most common critique of unsuccessful candidates was that they were ‘too junior’, stemming from poor communication or a lack of confidence. Our assumption was that this perception was skewed by the fact that they were either nervous or intimidated by the presentation atmosphere we had created.

A few simple changes materially improved this experience. We made it a point to ensure women made up at least half of the interview panel for female candidates. We also began scheduling an informal coffee chat for the candidate and a member of the panel before the presentation, so they would have a familiar face in the room (we did this for both male and female candidates and both said they appreciated this change). And, in our roundup discussions following the presentation, we would focus the conversation on objective traits of the presentation rather than subjective interpretations of overall success.

Taken together, these efforts had a dramatic effect on conversion rates. While our top-of-funnel initiatives increased the relative volume of female candidates, our interviewing initiatives helped create an environment in which female candidates would be just as likely to succeed as any male candidate. Furthermore, these changes to our process didn’t just help with diversity; they improved the candidate experience and effectiveness of hiring data scientists in general.

Why this is important

The steps we took over the last year grew the gender balance on our team from 15% to 30%, which has made our team stronger and our work more impactful. How?

First, it makes us smarter (source) by allowing for divergent voices, opinions, and ideas to emerge. As Airbnb scales, it has access to more data and increasingly relies upon the data science team’s creativity and sophistication for making strategic decisions about our future. If we were to maintain a homogenous team, we would continue to rely upon the same approaches to the challenges we face: investing in the diversity of data scientists is an investment in the diversity of perspectives and ideas that will help us jump from local to global maxima. Airbnb is a global company and people from a multitude of backgrounds use Airbnb. We can be smarter about how we understand that data when our team better reflects the different backgrounds of our guests and hosts.

Second, a diverse team allows us to better connect our insights with the company. The impact of a data science team is dependent upon its ability to influence the adoption of its recommendations. It is common for new members of the field to assume that statistical significance speaks for itself; however, colleagues in other fields tend to assume the statistical voodoo of a data scientist’s work is valid and instead focus on the way their ideas are conveyed. Our impact is therefore limited by our ability to connect with our colleagues and convince them of the potential our recommendations hold. Indeed, the pairing of personalities between data scientists and partners is often more impactful than the pairing of skillsets, especially at the leadership level. Increasing diversity is an investment in our ability to influence a broader set of our company’s leadership.

Finally, and perhaps most importantly, increasing our team’s diversity has improved our culture. The women on the data science team feel that they belong and that their careers can grow at Airbnb. As a result, they are more likely to stay with the company and are more invested in helping to build this team, referring people in their networks for open roles. We are not done, but we have reversed course from a vicious to virtuous cycle. Additionally, the results aren’t just restricted to women — the culture of the team as a whole has improved significantly over past years; in our annual internal survey, the data science team scores the highest in employee satisfaction across the company.

Of course, gender is only one dimension of diversity that we aim to balance within the team. In 2015 it was our starting point. As we look to 2016 and beyond, we will use this playbook to enhance diversity in other respects, and we expect this will strengthen our team, our culture, and our company.

The post Beginning with Ourselves appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/beginning-with-ourselves/feed/ 0
Growth at Scale: Getting to product + sharing fit http://nerds.airbnb.com/growth-at-scale/ http://nerds.airbnb.com/growth-at-scale/#comments Mon, 01 Feb 2016 22:31:49 +0000 http://nerds.airbnb.com/?p=176595394 Travelers like to talk about Airbnb. Some of this conversation occurs outside of Airbnb’s purview (like an in-person conversation or a Snap of an Airbnb listing), but other times it happens on airbnb.com, the iOS or Android apps, or on a site linking to Airbnb. In the latter case, we (Airbnb) have an opportunity to […]

The post Growth at Scale: Getting to product + sharing fit appeared first on Airbnb Engineering.

]]>
Travelers like to talk about Airbnb.

Some of this conversation occurs outside of Airbnb’s purview (like an in-person conversation or a Snap of an Airbnb listing), but other times it happens on airbnb.com, the iOS or Android apps, or on a site linking to Airbnb. In the latter case, we (Airbnb) have an opportunity to influence how the conversation transpires.

In this post, I will share a few features and optimizations that the Growth Team launched in 2015 to aid travelers in their conversation about Airbnb.

The first code I shipped at Airbnb was adding a Twitter Card for the referral page. Being a Twitter nerd, I knew how nice it was to share links that are backed by Twitter Cards in order to give viewers more context (and hopefully increase the click-through rate). This is the current experience of sharing an Airbnb referral code on Twitter (via https://www.airbnb.com/invite):

The highlight text gives the Twitter user a default tweet which they can modify. I like how ours has the dollar amount and a short message.

This Twitter Card shows a title, image, and description for the shared link.

By adding the above Twitter Card, we reassured viewers about the referral link’s authenticity and surrounded the tweet with details which may have been left out of the user-customized message.

Airbnb Listings via Email and Facebook

The first iteration of our new listing share widget. See the current version here: https://www.airbnb.com/rooms/5781222

Earlier this year, we looked at how people were sharing Airbnb listings and realized that most travelers preferred to share via Email and Facebook. So we ran some experiments and ended up altering our sharing widget to focus on Email and Facebook and put everything else in a “More” dropdown. This drove a significant increase in sharing since we molded our product to match people’s intended conversation format.

This US version of the desktop sharing module for Airbnb listings.

One of my personal favorite changes we made in 2015 was adding Facebook Messenger as a prioritized sharing option in front of Facebook timeline sharing. Our experiment results lead us to further tailor our sharing to match a private trip collaboration mindset instead of a mass share to all friends and followers.

Airbnb Listing Photos

A brand new type of sharing introduced in 2015 was the ability to share individual photos of listings. We based our work on a hunch that people might want to share different photos than just the first host-picked photo for a listing. After running a successful experiment, we made each photo shareable through all of our channels so now you can email, message, pin, or even embed any listing photo!

Sharing Airbnb in 2016

On the Growth Team we are always thinking about new ways to help people share the Airbnb experience. Each time we successfully add more metadata to a social share, tune the UI, or craft delightful contextual copy, we enhance digital conversations about Airbnb. Stay tuned for more features and updates in the coming year!

The post Growth at Scale: Getting to product + sharing fit appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/growth-at-scale/feed/ 0
Airbnb Product and Engineering Teams Now Landing in Portland http://nerds.airbnb.com/portland/ http://nerds.airbnb.com/portland/#comments Fri, 15 Jan 2016 14:04:37 +0000 http://nerds.airbnb.com/?p=176595382 Airbnb is on a mission to help people belong anywhere in the world. To reach this goal, we are building an exceptional product for our guests and hosts, with a fast-growing team of some of the smartest people in the industry. That’s why today, we are excited to announce that we are expanding our presence […]

The post Airbnb Product and Engineering Teams Now Landing in Portland appeared first on Airbnb Engineering.

]]>
Airbnb is on a mission to help people belong anywhere in the world. To reach this goal, we are building an exceptional product for our guests and hosts, with a fast-growing team of some of the smartest people in the industry. That’s why today, we are excited to announce that we are expanding our presence to include a product and engineering team at our award-winning office in Portland, Oregon.

This is the first time we’ve had engineers, product managers, designers, usability researchers, and data scientists outside of San Francisco, and we wanted to take a really thoughtful approach to expanding our team. For the last few months, we’ve had an engineering landing team who have laid the groundwork for some of the cool projects the team is going to work on this year, as well as scoped out the best places to get caffeinated. Additionally, we’ve gotten to know a lot of folks in Portland through our hosting the HackPDX Winter Hackathon in December, and we are excited to announce that in 2016 we will be sponsoring Django Girls Portland. Our new product team in Portland is going to build a world-class platform and set of tools for our global customer experience organization. To learn more about some of the projects the team has tackled so far, check out this blog post from Emre, one of the engineers on the team.

The post Airbnb Product and Engineering Teams Now Landing in Portland appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/portland/feed/ 0
Mobile Infrastructure http://nerds.airbnb.com/mobile-infrastructure/ http://nerds.airbnb.com/mobile-infrastructure/#comments Mon, 14 Dec 2015 21:26:26 +0000 http://nerds.airbnb.com/?p=176595377 Building a Continuous testing environment for Android Performance Performance improvements in your application shouldn’t be put off until the last minute! But sadly, performance profiling & tooling is still pretty manual and archaic. In this talk, Colt McAnlis will walk through how to build a automated perf-testing environment for your code. Then, you can run […]

The post Mobile Infrastructure appeared first on Airbnb Engineering.

]]>

Building a Continuous testing environment for Android Performance

Performance improvements in your application shouldn’t be put off until the last minute! But sadly, performance profiling & tooling is still pretty manual and archaic. In this talk, Colt McAnlis will walk through how to build a automated perf-testing environment for your code. Then, you can run tests daily, weekly, or when your co-worker checks in their code (you know the one.. they like to use ENUMs everywhere….)

Colt McAnlis is a Developer Advocate at Google focusing on Performance & Compression; Before that, he was a graphics programmer in the games industry working at BlizzardMicrosoft (Ensemble), and Petroglyph. He’s been an Adjunct Professor at SMU Guildhall, a UDACITY instructor (twice), and a Book Author. When he’s not working with developers, Colt spends his time preparing for an invasion of giant ants from outer space.

Deep links are great opportunities to engage users by linking them to deeper content within an application. While Android provides mechanisms to handle deep links, it’s not perfect and leaves opportunity to improve how deep links are managed. We’ll present our tool, Deep Link Dispatch, and demonstrate how Airbnb handles deep links in a structured and convenient way.

Christian Deonier is a member of the Airbnb Android team focused on foundation for features and architecture of the application. He is also the co-creator of DeepLinkDispatch, an Android library for deep links. Before joining Airbnb, he focused on making mobile applications on both iOS and Android, and also worked at Oracle. In his free time, he races cars and, true to Airbnb, travels frequently.

Felipe Lima is a Brazilian Software Engineer at Airbnb working on the Android team, focused on its infrastructure, developer productivity and Open Source tools. Before joining Airbnb, Felipe worked at We Heart It.

Localization on iOS

Airbnb is an international company that aims to bring a local experience to all of our users. Presenting information in a way that is appropriate for the user’s locale is crucial. In this talk, we will share some of the engineering challenges involved in delivering truly local experiences on iOS, and how we’ve solved them here at Airbnb.

Youssef Francis is a Software Engineer on the Airbnb iOS team, focused on improving the search, discovery and booking experience on mobile. Before joining Airbnb, he founded a small startup dedicated to building usability enhancements for jailbroken iOS devices, and has been a member of the iOS jailbreak community since 2007. He likes to play board games and solve puzzles.

The post Mobile Infrastructure appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/mobile-infrastructure/feed/ 0
How well does NPS predict rebooking? http://nerds.airbnb.com/nps-rebooking/ http://nerds.airbnb.com/nps-rebooking/#comments Thu, 10 Dec 2015 23:16:14 +0000 http://nerds.airbnb.com/?p=176595350 Data scientists at Airbnb collect and use data to optimize products, identify problem areas, and inform business decisions. For most guests, however, the defining moments of the “Airbnb experience” happen in the real world – when they are traveling to their listing, being greeted by their host, settling into the listing, and exploring the destination. […]

The post How well does NPS predict rebooking? appeared first on Airbnb Engineering.

]]>
Data scientists at Airbnb collect and use data to optimize products, identify problem areas, and inform business decisions. For most guests, however, the defining moments of the “Airbnb experience” happen in the real world – when they are traveling to their listing, being greeted by their host, settling into the listing, and exploring the destination. These are the moments that make or break the Airbnb experience, no matter how great we make our website. The purpose of this post is to show how we can use data to understand the quality of the trip experience, and in particular how the ‘Net promoter score’ adds value.

Currently, the best information we can gather about the offline experience is from the review that guests complete on Airbnb.com after their trip ends. The review, which is optional, asks for textual feedback and rating scores from 1-5 for the overall experience as well as subcategories: Accuracy, Cleanliness, Checkin, Communication, Location, and Value. Starting at the end of 2013, we added one more question to our review form, the NPS question.

NPS, or the “Net Promoter Score”, is a widely used customer loyalty metric introduced by Fred Reicheld in 2003 [https://hbr.org/2003/12/the-one-number-you-need-to-grow/ar/1] . We ask guests “How likely are you to recommend Airbnb to a friend?” – a question called “likelihood to recommend” or LTR. Guests who respond with a 9 or 10 are labeled as “promoters”, or loyal enthusiasts, while guests who respond with a score of 0 to 6 are “detractors”, or unhappy customers. Those who leave a 7 or 8 are considered to be “passives”. Our company’s NPS (Net Promoter Score) is then calculated by subtracting the percent of “detractors” from the percent of “promoters”, and is a number that ranges from -100 (worst case scenario: all responses are detractors) to +100 (best case scenario: all responses are promoters).

By measuring customer loyalty as opposed to satisfaction with a single stay, NPS surveys aim to be a more effective methodology to determine the likelihood that the customer will return to book again, spread the word to their friends, and resist market pressure to defect to a competitor. In this blog post, we look to our data to find out if this is actually the case. We find that higher NPS does in general correspond to more referrals and rebookings. But we find that controlling for other factors, it does not significantly improve our ability to predict if a guest will book on Airbnb again in the next year. Therefore, the business impact of increasing NPS scores may be less than what we would estimate from a naive analysis.

Methodology

We will refer to a single person’s response to the NPS question as their LTR (likelihood to recommend) score. While NPS ranges from -100 to +100, LTR is an integer that ranges from 0 to 10. In this study, we look at all guests with trips that ended between January 15, 2014 and April 1, 2014. If a guest took more than one trip within that time frame, only the first trip is considered. We then try to predict if the guest will make another booking with Airbnb, up to one year after the end of the first trip.

One thing to note is that leaving a review after a trip is optional, as are the various components of the review itself. A small fraction of guests do not leave a review or leave a review but choose not to respond to the NPS question. While NPS is typically calculated only from responders, in this analysis we include non-responders by factoring in both guests who do not a leave a review as well as those who leave a review but choose not to answer the NPS question.

To assess the predictive power of LTR, we control for other parameters that are correlated with rebooking. These include:

• Overall review score and responses to review subcategories. All review categories are on a scale of 1-5.
• Guest acquisition channel (e.g. organic or through marketing campaigns)
• Trip destination (e.g. America, Europe, Asia, etc)
• Origin of guest
• Previous bookings by the guest on Airbnb
• Trip Length
• Number of guests
• Price per night
• Month of checkout (to account for seasonality)
• Room type (entire home, private room, shared room)
• Number of other listings the host owns

We acknowledge that our approach may have the following shortcomings:

• There may be other forms of loyalty not captured by rebooking. While we do look at referrals submitted through our company’s referral program, customer loyalty can also be manifested through word of mouth of referrals that are not captured in this study.
• There may be a longer time horizon for some guests to rebook. We look one year out, but some guests may travel less frequently and would rebook in two to three years.
• One guest’s LTR may not be a direct substitute for the aggregate NPS. It is possible that even if we cannot accurately predict one customer’s likelihood to rebook based on their LTR, we would fare better if we used NPS to predict an entire cohort’s likelihood to rebook.

Despite these shortcomings, we hope that this study will provide a data informed way to think about the value NPS brings to our understanding of the offline experience.

Descriptive Stats of the Data

Our data covers more than 600,000 guests. Our data shows that out of guests who submitted a review, two-thirds of guests were NPS promoters.  More than half gave an LTR of 10. Of the 600,000 guests in our data set, only 2% were detractors.

While the overall review score for a trip is aimed at assessing the quality of the trip, the NPS question serves to gauge customer loyalty. We look at how correlated these two variables are by looking at the distributions of LTR scores broken down by overall review score. Although the LTR and overall review rating are correlated, they do provide some differences in information. For example, of the small number of guests who had a disappointing experience and left a 1-star review, 26% were actually promoters of Airbnb, indicating that they were still very positive about the company.

Keeping in mind that a very small fraction of our travelers are NPS detractors and that LTR is heavily correlated to the overall review score, we investigate how LTR correlates to rebooking rates and referral rates.

We count a guest as a referrer if they referred at least one friend via our referral system in the 12 months after trip end. We see that out of guests who responded to the NPS question, higher LTR corresponds to a higher rebook rate and a higher referral rate.

Without controlling for other variables, someone with a LTR of 10 is 13% more likely to rebook and 4% more likely to submit a referral in the next 12 months than someone who is a detractor (0-6). Interestingly, we note that the increase in rebooking rates for responders is nearly linear with LTR (we did not have enough data to differentiate between people who gave responses between 0-6). These results imply that for Airbnb, collapsing people who respond with a 9 versus a 10 into one “promoter” bucket results in loss of information. We also note that guests who did not leave a review behave the same as detractors. In fact, they are slightly less likely to rebook and submit a referral than guests with LTR of 0-6. However, guests who submitted a review but did not answer the NPS question (labeled as “no_nps”) behave similar to promoters. These results indicate that when measuring NPS, it is important to keep track of response rate as well.

Next, we look at how other factors might influence rebooking rates. For instance, we find just from our 10 weeks of data that rebooking rates are seasonal. This is likely because more off season travelers tend to be loyal customers and frequent travelers.

We see that guests who had shorter trips are more likely to rebook. This could be because some guests will use Airbnb mostly for longer stays and they just aren’t as likely to take another one of those in the next year.

We also see that the rebooking rate has kind of a parabolic relationship to the price per night of the listing. Guests who stayed in very expensive listings are less likely to rebook, but guests who stayed in very cheap listings are also unlikely to rebook.

Which review categories are most predictive of rebooking?

In addition to the Overall star rating and the LTR score, guests can choose to respond to the following subcategories in their review, all of which are on a 1-5 scale:

• Accuracy
• Cleanliness
• Checkin
• Communication
• Location
• Value

In this section we will investigate the power of review ratings to predict whether or not a guest will take another trip on Airbnb in the 12 months after trip end. We will also study which subcategories are most predictive of rebooking.

To do this, we compare a series of nested logistic regression models. We start off with a base model, whose dependent variables include only the non-review characteristics of the trip that we mentioned in the above section:

f0 = 'rebooked ~ dim_user_acq_channel + n_guests + nights + I_(price_per_night*10) + I((price_per_night*10)^2) + guest_region + host_region + room_type + n_host_listings + first_time_guest + checkout_month

Then, we build a series of models adding one of the review categories to this base model:

f1 = f0 + communication
f2 = f0 + cleanliness
f3 = f0 + checkin
f4 = f0 + accuracy
f5 = f0 + value
f6 = f0 + location
f7 = f0 + overall_score
f8 = f0 + ltr_score

We compare the quality of each of the models f1 to f8 against that of the nested model f0 by comparing the Akaike information criterion (AIC) of the fits. AIC trades off between the goodness of the fit of the model and the number of parameters, thus discouraging overfitting.

If we were just to include one review category, LTR and overall score are pretty much tied for first place. Adding any one of the subcategories also improves the model, but not as much as we were to include overall score or LTR.

Next, we adjust our base model to include LTR and repeat the process to see what is the second review category we could add.

Given LTR, the next subcategory that will improve our model the most is the overall review score. Adding a second review category to the model only marginally improves the fit of the model (note the difference is scale of the two graphs).

We repeat this process, incrementally adding review categories to the model until the models are not statistically significant anymore. We are left with the following set of review categories:

• LTR
• Overall score
• Any three of the six subcategories

These findings show that because the review categories are strongly correlated with one another, once we have the LTR and the overall score, we only need three of the six subcategories to optimize our model. Adding more subcategories will add more degrees of freedom without significantly improving the predictive accuracy of the model.

Finally we tested the predictive accuracies of our models:

 Categories Accuracy LTR Only 55.997% Trip Info Only 63.495% Trip info + LTR 63.58% Trip info + Other review categories 63.593% Trip Info + LTR + Other review categories 63.595%

Using only a guest’s LTR at the end of trip, we can accurately predict if they will rebook again in the next 12 months 56% of the time. Given just basic information we know about the guest, host and trip, we improve this predictive accuracy to 63.5%. Adding review categories (not including LTR), we add an additional  0.1% improvement. Given all this, adding LTR to the model only improves the predictive accuracy by another 0.002%.

Conclusions

Post trip reviews (including LTR) only marginally improves our ability to predict whether or not a guest rebooks 12 months after checkout. Controlling for trip and guest characteristics, review star ratings only improve our predictive accuracy by ~0.1%. Out of all the review subcategories, LTR is the most useful in predicting rebooking, but it only adds 0.002% increase in predictive accuracy if we control for other review categories. This is because LTR and review scores are highly correlated.

Reviews serve purposes other than to predict rebooking. They enable trust in the platform, help hosts build their reputation, and can also be used for host quality enforcement. We found that guests with higher LTR are more likely to refer someone through our referral program. They could also be more likely to refer through word of mouth. Detractors could actually detract potential people from joining the platform. These additional ways in which NPS could be connected to business performance are not explored here. But given the extremely low number of detractors and passives and the marginal power post trip LTR has in predicting rebooking, we should be cautious putting excessive weight on guest NPS.

The post How well does NPS predict rebooking? appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/nps-rebooking/feed/ 0
How Technology and Engineers Can Impact Social Change http://nerds.airbnb.com/impacting-social-change/ http://nerds.airbnb.com/impacting-social-change/#comments Wed, 18 Nov 2015 23:55:32 +0000 http://nerds.airbnb.com/?p=176595344 At the OpenAir 2015 conference, the panel discussion “How Tech Can Reach Underserved Communities” explored how technology can create positive social change—and how engineers in particular can make a difference. The panelists: Alanna Scott, engineer, Airbnb Grace Garey, co-founder, Watsi Raquel Romano, software engineer, Google.org Moderator: Mario Lugay, impact advisor, Kapor Center for Social Impact […]

The post How Technology and Engineers Can Impact Social Change appeared first on Airbnb Engineering.

]]>

At the OpenAir 2015 conference, the panel discussion “How Tech Can Reach Underserved Communities” explored how technology can create positive social change—and how engineers in particular can make a difference.

The panelists:

Alanna Scott, engineer, Airbnb
Grace Garey, co-founder, Watsi
Moderator: Mario Lugay, impact advisor, Kapor Center for Social Impact

Some highlights (edited for brevity or clarity):

Examples of projects to reach underserved communities (3:11 in the video)

Romano said she had been working with a Google.org group focused on crisis response and reaching people before, during, and after a natural disaster. For example, the team developed data feeds that would provide warnings about impending local floods or hurricanes in relevant search results for Google users.

Scott said Airbnb started a Disaster Response Tool three years ago in the wake of Hurricane Sandy. “We were inspired by a host (in the area where the storm hit) who started opening up her home to people who had been displaced. We wanted to build something to support what she was doing and enable the rest of our host community to participate as well.”

Scott: “The Disaster Response Tool was built as a side project. But now we can activate the tool within minutes for a specific location or area that has been hit by a natural disaster. Hosts can list their space for free and we wave all of our fees and create a way for displaced people in that area to find a place to stay.”

Garey: “Watsi is entirely a social impact organization. We let people directly fund healthcare for people all around the world, and 100 percent of donations go to the patient. Technology seemed to be the answer we needed to focus on. We saw people using technology like Airbnb to bust open narrow channels to allow person-to-person interaction and create new ways to solve a problem. So we decided to do the same thing to tackle healthcare in a new way.”

How technology can make a difference (16:13)

Scott: “In the case of a natural disaster, people don’t always have reliable Internet access, or they might not have much battery left on their phone. So we’ve been thinking about how those people can use Airbnb when they are facing technical limitations.”

Helping people in disaster-hit areas may require “using old technology” rather than the latest tech, Scott continued. For example, SMS messaging often continues to work after a disaster when phone calls, email and online access can be difficult, so Airbnb has been exploring ways for users to book or accept reservations through SMS.

Romano: “We’re working on an initiative at Google.org to see how technology can help people with disabilities live more independently. What if we could recognize and translate sign language? What if we could analyze content in video and provide natural language descriptions of it?” Another area of investigation is mobility, in which “eye trackers connect to a communication device, so you can communicate with the world by typing with your eyes.”

How engineers can contribute to social change (20:46)

Scott: “We have a woman user in Florence who donates 50 percent of her Airbnb earnings to a community art project. Another user donates 10 percent of his earnings, and he and his guest decide together which local organization to contribute to. So my advice is to look at how your users are already helping other people with your product, then figure out how to scale it and open it up to your whole community.”

Romano recommended marrying your passion for technology with social issues you care about, because the two are “an amazing combination.” Find others with shared passions by asking around. “Talk to people about what they’re working on and tell them what you’re interested in.”

Romano added that “it’s really hard when you’re trying to prioritize and focus to create space and resources to work on (social impact projects). What works is when people just start doing things (for social impact) without asking for permission. You get other passionate people together and come up with a proof of concept and you can start seeing how it could be better if you had a product manager, user experience person, and multiple engineers working on it.”

Garey added that in 10 to 15 years, the areas of engineering and social change will blur. “So don’t feel like you have to make a choice between working at a company with a product that’s creating value and making a lot of money vs. doing something that’s good for the world. You can do well and do good at the same time.”

The post How Technology and Engineers Can Impact Social Change appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/impacting-social-change/feed/ 0
Confidence Splitting Criterions Can Improve Precision And Recall in Random Forest Classifiers http://nerds.airbnb.com/confidence-splitting-criterions/ http://nerds.airbnb.com/confidence-splitting-criterions/#comments Tue, 20 Oct 2015 16:34:09 +0000 http://nerds.airbnb.com/?p=176595333 The Trust and Safety Team maintains a number of models for predicting and detecting fraudulent online and offline behaviour. A common challenge we face is attaining high confidence in the identification of fraudulent actions. Both in terms of classifying a fraudulent action as a fraudulent action (recall) and not classifying a good action as a […]

]]>
The Trust and Safety Team maintains a number of models for predicting and detecting fraudulent online and offline behaviour. A common challenge we face is attaining high confidence in the identification of fraudulent actions. Both in terms of classifying a fraudulent action as a fraudulent action (recall) and not classifying a good action as a fraudulent action (precision).

A classification model we often use is a Random Forest Classifier (RFC). However, by adjusting the logic of this algorithm slightly, so that we look for high confidence regions of classification, we can significantly improve the recall and precision of the classifier’s predictions. To do this we introduce a new splitting criterion (explained below) and show experimentally that it can enable more accurate fraud detection.

A RFC is a collection of randomly grown ‘Decision Trees’. A decision tree is a method for partitioning a multi-dimensional space into regions of similar behaviour. In the context of fraud detection, identifying events as ‘0’ for non-fraud and ‘1’ for fraud, a decision tree is binary and tries to find regions in the signal space that are mainly 0s or mainly 1s. Then, when we see a new event, we can look at which region it belongs to and decide if it is a 0s region or a 1s region.

Typically, a Decision Tree is grown by starting with the whole space, and iteratively dividing it into smaller and smaller regions until a region only contains 0s or only contains 1s. Each final uniform region is called a ‘leaf’. The method by which a parent region is partitioned into two child regions is often referred to as the ‘Splitting Criterion’. Each candidate partition is evaluated and the partition which optimises the splitting criterion is used to divide the region. The parent region that gets divided is called a ‘node’.

Suppose a node has $$N$$ observations and take $$L_0^{(i)}$$ and $$R_0^{(i)}$$ to denote the number of 0s in the left and right child respectively, and similarly $$L_1^{(i)}$$ and $$R_1^{(i)}$$ for 1s. So for each candidate partition $$i$$ we have $$N=L_0^{(i)}+L_1^{(i)}+R_0^{(i)}+R_1^{(i)}$$. Now let $$l_0^{(i)}=L_0^{(i)}/(L_0^{(i)}+L_1^{(i)}$$) be the probability of selecting a 0 in the left child node for partition $$i$$, and similarly denote the probabilities $$l_1^{(i)}$$, $$r_0^{(i)}$$, and $$r_1^{(i)}$$. The two most common splitting criterions are:

A. Gini Impurity: Choose $$i$$ to minimise the probability of mislabeling i.e. $$i_{gini} = \arg \min_i H_{gini}^{(i)}$$ where

$$H_{gini}^{(i)} = \frac{L_0^{(i)}+L_1^{(i)}}{N} \big[ l_0^{(i)} (1-l_0^{(i)}) + l_1^{(i)} (1-l_1^{(i)}) \big] + \frac{R_0^{(i)}+R_1^{(i)}}{N} \big[ r_0^{(i)} (1-r_0^{(i)}) + r_1^{(i)} (1-r_1^{(i)}) \big]$$

B. Entropy: Choose $$i$$ to maximise the informational content of the labeling i.e. $$i_{entropy} = \arg \min_i H_{entropy}^{(i)}$$ where

$$H_{entropy}^{(i)} = – \frac{L_0^{(i)}+L_1^{(i)}}{N} \big[ l_0^{(i)} \log(l_0^{(i)}) + l_1^{(i)} \log(l_1^{(i)}) \big] - \frac{R_0^{(i)}+R_1^{(i)}}{N} \big[ r_0^{(i)} \log(r_0^{(i)}) + r_1^{(i)} \log(r_1^{(i)}) \big].$$

However, notice that both of these criterions are scale invariant: A node with $$N=300$$ observations and partition given by $$L_1=100,L_0=100,R_1=100,R_0=0$$ achieves an identical spliting criterion score $$H_{gini}=1/3$$ to a node with $$N=3$$ observations and $$L_1=1,L_0=1,R_1=1,R_0=0$$. The former split is a far stronger result (more difficult partition to achieve) than the latter. It may be useful to have a criterion that is able to differentiate between the likelihood of each of these partitions.

Confidence Splitting Criterion

Theory

Let $$L^{(i)}=L_0^{(i)}+L_1^{(i)}$$ and $$R^{(i)}=R_0^{(i)}+R_1^{(i)}$$. And let $$p_0$$ be the proportion of 0s and $$p_1$$ the proportion of 1s in the node we wish to split. Then we want to find partitions where the distribution of 0s and 1s is unlikely to have occured by random. If the null hypothesis is that each observation is a 0 with probability $$p_0$$ then the probability of $$L_0^{(i)}$$ or more 0s occuring in a partition of $$L^{(i)}$$ observations is given by the Binomial random variable $$X(n,p)$$:

$$\mathbb{P}[X(L^{(i)},p_0) >= L_0^{(i)}] = 1 – B(L_0^{(i)};L^{(i)},p_0)$$

where $$B(x;N,p)$$ is the cumulative distribution function for a Binomial random variable $$X=x$$ with $$N$$ trials and probability $$p$$ of success. Similarly for $$p_1$$ the probability of $$L_1^{(i)}$$ or more 1s occuring in the left partition is $$1-B(L_1^{(i)};L^{(i)},p_1)$$. Taking the minimum of these two probabilities gives us the equivalent of a two-tailed hypothesis test. The probability is essentially the p-value under the null hypothesis for a partition at least as extreme as the one given. We can repeat the statistical test for the right partition and take the product of these two p-values to give an overall partition probability.

Now, to enable biasing the splitting towards identifying high density regions of 1s in the observation space, one idea is to modify $$B(L_0^{(i)};L^{(i)},p_0)$$ to only be non zero if it is sufficiently large. In other words, replace it with

$$\mathbb{1}_{ \{B(L_0^{(i)};L^{(i)},p_0) >= C_0 \} }B(L_0^{(i)};L^{(i)},p_0)$$

where $$C_0 \in [0,1]$$ is the minimum confidence we require. $$C_0$$ might take value 0.5, 0.9, 0.95, 0.99, etc. It should be chosen to match the identification desired. Similarly for $$C_1$$. Thus we propose a new splitting criterion given by:

C. Confidence: Choose $$i$$ to minimise the probability of the partition being chance i.e. $$i_{confidence} = \arg \min_i H_{confidence}^{(i)}$$ where

$$H_{confidence}^{(i)} = \min_{j=0,1} \{1 – \mathbb{1}_{ \{ B(L_j^{(i)};L^{(i)},p_j) >= C_j \} } \, B(L_j^{(i)};L^{(i)},p_j) \} * \min_{j=0,1} \{1 – \mathbb{1}_{ \{ B(R_j^{(i)};R^{(i)},p_j) >= C_j \} } \, B(R_j^{(i)};R^{(i)},p_j) \}$$

where $$C_0$$ and $$C_1$$ are to be chosen to optimise the identification of 0s or 1s respectively.

Implementation

To run this proposed new splitting criterion, we cut a new branch of the Python opensource Scikit-Learn repository and updated the Random Forest Classifier Library. Two modifications were made to the analytical $$H_{confidence}^{(i)}$$ function to optimise the calculation:

1. Speed: Calculating the exact value of $$B(x;N,p)$$ is expensive, especially over many candidate partitions for many nodes across many trees. For large values of $$N$$ we can approximate the Binomial cumulative distribution function by the Normal cumulative distribution $$\Phi(x;Np,\sqrt{Np(1-p)})$$ which itself can be approximated using Karagiannidis & Lioumpas (2007) as a ratio of exponentials.
2. Accuracy: for large values of $$N$$ and small values of $$p$$ the tails of the Binomial distribution can be very small so subtraction and multiplication of the tail values can be corrupted by the precision of the machine’s memory. To overcome this we take the logarithm of the above approximation to calculate $$H_{confidence}^{(i)}$$.

After these tweaks to the algorithm we find an insignificant change to the runtime of the Scikit-Learn routines. The Python code with the new criterion looks something like this:


from sklearn.ensemble import RandomForestClassifier
# using [C_0,C_1] = [0.95,0.95]
rfc = RandomForestClassifier(n_estimators=1000,criterion='conf',conf=[0.95,0.95])
rfc.fit(x_train,y_train)
pred = rfc.predict_proba(x_test)


For more details on the Machine Learning model building process at Airbnb you can read previous posts such as Designing Machine Learning Models: A Tale of Precision and Recall and How Airbnb uses machine learning to detect host preferences. And for details on our architecture for detecting risk you can read more at Architecting a Machine Learning System for Risk.

Evaluation

Data

To test the improvements the Confidence splitting criterion can provide, we use the same dataset we used in the previous post Overcoming Missing Values In A Random Forest Classifier, namely the adult dataset from the UCI Machine Learning Repository. As before the goal is predict whether the income level of the adult is greater than or less than \$50k per annum using the 14 features provided.

We tried 6 different combinations of $$[C_0,C_1]$$ against the baseline RFC with Gini Impurity and looked at the changes in the Precision-Recall curves. As always we holdout a training set and evaluate on the unused test set. We build a RFC of 1000 trees in each of the 7 scenarios.

Results

Observe that $$C_0=0.5$$ (yellow and blue lines) offers very little improvement over the baseline RFC, modest absolute recall improvements of 5% at the 95% precision level. However, for $$C_0=0.9$$ (green and purple lines) we see a steady increase in recall from at precision levels of 45% and upwards. At 80% precision and above, $$C_0=0.9$$ improves recall by an absolute amount of 10%, riing to 13% at 95% precision level. There is little variation between $$C_1=0.9$$ (green line) and $$C_1=0.99$$ (purple line) for $$C_0=0.9$$ although $$[C_0,C_1]=[0.9,0.9]$$ (green line) does seem to be superior. For $$C_0=0.9$$ (pale blue and pink lines), the improvement is not so impressive or consistent.

Final Thoughts

It would be useful to extend the analysis to compare the new splitting criterion against optimising existing hyper-parameters. In the Scikit-Learn implementation of RFCs we could experiment with min_samples_split or min_samples_leaf to overcome the scaling problem. We could also test different values of class_weight to capture the asymmetry introduced by non-equal $$C_0$$ and $$C_1$$.

More work can be done on the implementation of this methodology and there is still some outstanding analytical investigation on how the confidence thresholds $$C_j$$ tie to the improvements in recall or precision. Note however that the methodology does already generalise to non binary classifiers, i.e. where j=0,1,2,3,…. It could be useful to implement this new criterion into the Apache Spark RandomForest library also.

For the dataset examined, the new splitting criterion seems to be able to better identify regions of higher density of 0s or 1s. Moreover, by taking into account the size of the partition and the probability of such a distribution of observations under the null hypothesis, we can better detect 1s. In the context of Trust and Safety, this translates into being able to more accurately detect fraudulent actions.

The business implications of moving the Receiver Operating Characteristic outwards (equivalently moving the Precision-Recall curve outwards) have been discussed in a previous post. As described in the ‘Efficiency Implications’ section of Overcoming Missing Values In A Random Forest Classifier post, even decimal percentage point savings in recall or precision can lead to enormous dollar savings in fraud mitigation and efficiency respectively.

]]>
http://nerds.airbnb.com/confidence-splitting-criterions/feed/ 0
How We Partitioned Airbnb’s Main Database in Two Weeks http://nerds.airbnb.com/how-we-partitioned-airbnbs-main-db/ http://nerds.airbnb.com/how-we-partitioned-airbnbs-main-db/#comments Tue, 06 Oct 2015 17:08:05 +0000 http://nerds.airbnb.com/?p=176595306 “Scaling = replacing all components of a car while driving it at 100mph” – Mike Krieger, Instagram Co-founder @ Airbnb OpenAir 2015 Airbnb peak traffic grows at a rate of 3.5x per year, with a seasonal summer peak. Heading into the 2015 summer travel season, the infrastructure team at Airbnb was hard at work scaling […]

The post How We Partitioned Airbnb’s Main Database in Two Weeks appeared first on Airbnb Engineering.

]]>

“Scaling = replacing all components of a car while driving it at 100mph”

– Mike Krieger, Instagram Co-founder @ Airbnb OpenAir 2015

Airbnb peak traffic grows at a rate of 3.5x per year, with a seasonal summer peak.

Heading into the 2015 summer travel season, the infrastructure team at Airbnb was hard at work scaling our databases to handle the expected record summer traffic. One particularly impactful project aimed to partition certain tables by application function onto their own database, which typically would require a significant engineering investment in the form of application layer changes, data migration, and robust testing to guarantee data consistency with minimal downtime. In an attempt to save weeks of engineering time, one of our brilliant engineers proposed the intriguing idea of leveraging MySQL replication to do the hard part of guaranteeing data consistency. (This idea is independently listed an explicit use cases of Amazon RDS’s “Read Replica Promotion” functionality.) By tolerating a brief and limited downtime during the database promotion, we were able to perform this operation without writing a single line of bookkeeping or migration code. In this blog post, we will share some of our work and what we learned in the process.

First, some context

We tend to agree with our friends at Asana and Percona that horizontal sharding is bitter medicine, and so we prefer vertical partitions by application function for spreading load and isolating failures. For instance, we have dedicated databases, each running on its own dedicated RDS instance, that map one-to-one to our independent Java and Rails services. However for historical reasons, much of our core application data still live in the original database from when Airbnb was a single monolithic Rails app.

Using a client side query profiler that we built in-house (it’s client side due to the limitations of RDS) to analyze our database access pattern, we discovered that Airbnb’s message inbox feature, which allows guests and hosts to communicate, accounted for nearly 1/3 of the writes on our main database. Furthermore, this write pattern grows linearly with traffic, so partitioning it out would be a particularly big win for the stability of our main database. Since it is an independent application function, we were also confident that all cross-table joins and transactions could be eliminated, so we began prioritizing this project.

In examining our options for this project, two realities influenced our decision making. First, the last time we partitioned a database was three years ago in 2012, so pursuing this operation at our current size was a new challenge for us and we were open to minimizing engineering complexity at the expense of planned downtime. Second, as we entered 2015 with around 130 software engineers, our teams were spread across a large surface area of products–ranging from personalized search, customer service tools, trust and safety, global payments, to reliable mobile apps that assume limited connectivity–leaving only a small fraction of engineering dedicated to infrastructure. With these considerations in mind, we opted to make use of MySQL replication in order to minimize the engineering complexity and investment needed.

Our plan

The decision to use MySQL’s built-in replication to migrate the data for us meant that we no longer had to build the most challenging pieces to guarantee data consistency ourselves as replication was a proven quantity. We run MySQL on Amazon RDS, so creating new read replicas and promoting a replica to a standalone master is easy. Our setup resembled the following:

We created a new replica (message-master) from our main master database that would serve as the new independent master after its promotion. We then attached a second-tier replica (message-replica) that would serve as the message-master’s replica. The catch is that the promotion process can take several minutes or longer to complete, during which time we have to intentionally fail writes to the relevant tables to maintain data consistency. Given that a site-wide downtime from an overwhelmed database would be much more costly than a localized and controlled message inbox downtime, the team was willing to make this tradeoff to cut weeks of development time. It is worth mentioning that for those who run their own database, replication filters could be used to avoid replicating unrelated tables and potentially reduce the promotion period.

Phase one: preplanning

Moving message inbox tables to a new database could render existing queries with cross-table joins invalid after the migration. Because a database promotion cannot be reverted, the success of this operation depended on our ability to identify all such cases and deprecate them or replace them with in-app joins. Fortunately, our internal query analyzer allowed us to easily identify such queries for most of our main services, and we were able to revoke relevant database permission grants for the remaining services to gain full coverage. One of the architectural tenets that we are working towards at Airbnb is that services should own their own data, which would have greatly simplified the work here. While technically straightforward, this was the most time consuming phase of the project as it required a well-communicated cross-team effort.

Next, we have a very extensive data pipeline that powers both offline data analytics and downstream production services. So the next step in the preplanning was to move all of our relevant pipelines to consume the data exports of message-replica to ensure that we consume the newest data after the promotion. One side effect of our migration plan was that the new database would have the same name as our existing database (not to be confused with the name of our RDS instances, e.g. message-master and message-replica) even though the data will diverge after the promotion. However, this actually allowed us to keep our naming convention consistent in our data pipelines, so we opted not to pursue a database rename.

Lastly, because our main Airbnb Rails app held exclusive write access to these tables, we were able to swap all relevant service traffic to the new message database replica to reduce the complexity of the main operation.

Phase two: the operation

Members of the Production Infrastructure team on the big day.

Once all the preplanning work was done, the actual operation was performed as follows:

1. Communicate the planned sub-10 minute message inbox downtime with our customer service team. We are very sensitive to the fact that any downtime could leave guests stranded in a foreign country as they try to check-in to their Airbnb, so it was important to keep all relevant functions in the loop and perform the op during the lowest weekly traffic.
2. Deploy change for message inbox queries to use the new message database user grants and database connections. At this stage, we still point the writes to the main master while reads go to the message replica, and so this should have no outward impact yet. However we delay this step until the op began because it doubles the connection to main master, so we want this stage to be as brief as possible. Swapping the database host in the next step does not require a deploy as we have configuration tools to update the database host entries in Zookeeper, where they can be discovered by SmartStack.
3. Swap all message inbox write traffic to the message master. Because it has not been promoted yet, all writes on the new master fail and we start clocking our downtime. While reads queries will succeed, in practice nearly all of messaging is down during this phase because marking a message as read requires a db write.
4. Kill all database connections on the main master with the message database user introduced in step 2. By killing connections directly, as opposed to doing a deploy or cluster restart, we minimize the time it takes to move all writes to the replica that will serve as the new master, a prerequisite for replication to catch up.
5. Verify that replication has caught up by inspecting:
1. The newest entries in all the message inbox tables on message master and message replica
2. All message connections on the main master are gone
3. New connections on the message master are made
6. Promote message master. From our experience, the database is completely down for about 30 seconds during a promotion on RDS and in this time reads on the master fail. However, writes will fail for nearly 4 minutes as it takes about 3.5 minutes before the promotion kicks in after it is initiated.
7. Enable Multi-AZ deployment on the newly-promoted message master before the next RDS automated backup window. In addition to improved failover support, Multi-AZ minimizes latency spikes during RDS snapshots and backups.
8. Once all the metrics look good and databases stable, drop irrelevant tables on the respective databases. This wrap-up step is important to ensure that no service consumes stale data.

Should the op have failed, we would have reverted the database host entries in Zookeeper and the message inbox functionality would have been restored almost immediately. However, we would have lost any writes that made it to the now-independent message databases. Theoretically it would be possible to backfill to restore the lost messages, but it would be a nontrivial endeavor and confusing for our users. Thus, we robustly tested each of the above steps before pursing the op.

The result

Clear drop in main database master writes.

End-to-end, this project took about two weeks to complete and incurred just under 7 1/2 minutes of message inbox downtime and reduced the size of our main database by 20%. Most significantly, this project brought us significant database stability gains by reducing the write queries on our main master database by 33%. These offloaded queries were projected to grow by another 50% in coming months, which would certainly have overwhelmed our main database, so this project bought us valuable time to pursue longer-term database stability and scalability investments.

One surprise: RDS snapshots can significantly elevate latency

According to the RDS documentation:

Unlike Single-AZ deployments, I/O activity is not suspended on your primary during backup for Multi-AZ deployments for the MySQL, Oracle, and PostgreSQL engines, because the backup is taken from the standby. However, note that you may still experience elevated latencies for a few minutes during backups for Multi-AZ deployments.

We generally have Multi-AZ deployment enabled on all master instances of RDS to take full advantage of RDS’s high availability and failover support. During this project, we observed that given a sufficiently heavy database load, the latency experienced during an RDS snapshot even with Multi-AZ deployment can be significant enough to create a backlog of our queries and bring down our database. We were always aware that snapshots lead to increased latency, but prior to this project we had not been aware of the possibility of full downtime from nonlinear increases in latency relative to database load.

This is significant given that RDS snapshots is a core RDS functionality that we depend on for daily automated backups. Previous unbeknownst to us, as the load on our main database increases, so did the likelihood of RDS snapshots causing site instability. Thus in pursuing this project, we realized that it had been more urgent than we initially anticipated.

Xinyao, lead engineer on the project, celebrates after the op.

Acknowledgements: Xinyao Hu led the project while I wrote the initial plan with guidance from Ben Hughes and Sonic Wang. Brian Morearty and Eric Levine helped refactor the code to eliminate cross-table joins. The Production Infrastructure team enjoyed a fun afternoon running the operation.

Checkout more past projects from the Production Infrastructure team:

The post How We Partitioned Airbnb’s Main Database in Two Weeks appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/how-we-partitioned-airbnbs-main-db/feed/ 0
Unboxing the Random Forest Classifier: The Threshold Distributions http://nerds.airbnb.com/unboxing-the-random-forest-classifier/ http://nerds.airbnb.com/unboxing-the-random-forest-classifier/#comments Thu, 01 Oct 2015 20:18:30 +0000 http://nerds.airbnb.com/?p=176595290 In the Trust and Safety team at Airbnb, we use the random forest classifier in many of our risk mitigation models. Despite our successes with it, the ensemble of trees along with the random selection of features at each node makes it difficult to succinctly describe how features are being split. In this post, we […]

The post Unboxing the Random Forest Classifier: The Threshold Distributions appeared first on Airbnb Engineering.

]]>

In the Trust and Safety team at Airbnb, we use the random forest classifier in many of our risk mitigation models. Despite our successes with it, the ensemble of trees along with the random selection of features at each node makes it difficult to succinctly describe how features are being split. In this post, we propose a method to aggregate and summarize those split values by generating weighted threshold distributions.

Motivation

Despite its versatility and out-of-box performances, the random forest classifier is often referred to as a black box model. It is easy to see why some might be inclined to think so. First, the optimal decision split at each node is only drawn from a random subset of the feature set. And to make matters more obscure, the model generates an ensemble of trees using bootstrap samples of the training set. All these just means that a feature might split at different nodes of the same tree and possibly with different split values and this could be repeated in multiple trees.

With all this randomness being thrown around, a certainty still remains. After training a random forest, we know exactly every detail of the forest. For each node, we know what feature is used to split, with what threshold value, and with what efficiency. All the details are there, the challenge is knowing how to piece them together to built an accurate and informative description of the allegedly black box.

One common way to describe a trained random forest in terms of the features is to rank them by their importances based on their splitting efficiencies. Although this method manages to quantify the contribution of impurity decreases from each of the features, it does not shed light on how the model makes decisions from them. We propose in this post one way to concisely describe the inner decisions of the forest by mining node-by-node the entire forest and present the weighted distributions (by split efficiencies and sample sizes) of the thresholds for each feature.

Methodology

For purpose of illustrating this method, we resort to the publicly available Online News Popularity Data Set from the UCI Machine Learning Repository. The dataset contains 39,797 observations of Mashable articles each with 58 features. The positive labels for this dataset are defined as whether the number of shares for a particular article is greater or equal to 1,400. The features are all numerical and ranges from simple statistics like number of words in the content to more complex ones like the closeness to a particular LDA-derived topic.

After training the random forest, we crawl through the entire forest and extract the following information from each non-terminal (or non-leaf) node:

1. Feature name
2. Split threshold – The value at which the node is splitting
3. Sample Size – Number of observations that went through the node when training
4. Is greater than threshold? – Direction where majority of positive observations go
5. Gini change – Decrease in impurity after split
6. Tree index – Identifier for the tree

which can be collected into a table like the following: A common table like the above will very likely contain the same feature multiple times across different trees and even within the same tree. It might be tempting at this point to just collect all the thresholds for a particular feature and pile them up in a histogram. This, however, would not be fair since nodes where the optimal feature-threshold tuples are found from a handful of observations should not have the same weight as those found from thousands of observations. Hence, we define, for a splitting node $$i$$, the sample size $$N_i$$ as the number of observations that reached the splitting node i during the training phase of the random forest.

Similarly, larger impurity changes should be weighted more than those with near zero impurity change. For that, we first define the impurity change $$\Delta G(i)$$ at the splitting node $$i$$ as,

$$\Delta G(i) = G(i) – \frac{L_i}{N_i} G(l_i) – \frac{R_i}{N_i} G(r_i)$$

where, $$N_i$$ is the sample size of the splitting node, $$L_i$$ the sample size for the left child node with index $$l_i$$, and $$R_i$$ the sample size for the right child node with index $$r_i$$. As for the gini impurity itself at node $$i$$, we define it as

$$G(i) = 1 – \sum\limits_{k=0}^{K} \left( \frac{n_{i,k}}{N_i} \right)^2$$

where $$K$$ is the number of classes for which we are classifying, $$n_{i,k}$$ the number of observations with label $$k$$, and $$N_i$$ total number of observations. Here is an illustration that should clarify the above definitions:

Since our goal is that of generating an overall description of the trained model, we place more weights on those that split more efficiently. A simple combined weight factor can be just the product of the sample size and the impurity change. More specifically, we can express this combined weight for a non-terminal node $$i$$ as $$(1 + \Delta G(i)) \times N_i$$ (the gini impurity change $$\Delta G(i)$$ is added an offset of one to make its range strictly positive).

Another type of variation among nodes splitting on the same feature can be exemplified with the following case:

where the filled circles are the positively labeled observations whereas the unfilled circles are the negatively labeled ones. In this example, although both nodes split on the same feature, their reasons to do so is quite different. In the left splitting node, the majority of the positive observations end up on the less-or-equal-to branch whereas in the right splitting node, the majority of the positive observations end up in on the greater-than branch. In other words, the left splitting node views the positively labeled observations to be more likely to have smaller feature X values whereas the right splitting node views the positively labeled observations to be more likely to have larger feature X values. This difference is accounted using the is_greater_than_threshold (or chirality) flag, where a 1 is true (or greater than) and a 0 is false (or less or equal to).

Finally, for each feature, we can build two threshold distributions (one per chirality) and have them be weighted by the combined weight of $$(1 + \Delta G_i) \times N_i$$.

Example

After training the classifier model, we crawl through the entire forest and collect all the information specified in the previous section. This information empowers us to describe which thresholds dominates the splittings for, say num_hrefs (number of links in the article):

In the above plot we see that there are two distributions. The orange one corresponds to the nodes where the chirality is greater_than and the red-ish (or rausch) one where the chirality is less-than-equal-to. These weighted distributions for num_hrefs indicate that whenever the feature num_hrefs is used to decide whether an article is a popular one (+1,400 shares) the dominant description is greater than ~15 links, which is illustrated by the spiking bin of the greater-than distribution near 15. Another interesting illustration of this method is on the global_rate_positive_words and the global_rate_negative_words, which are defined as the proportion of, respectively, positive and negative words in the content of the article. The former one is depicted as follows:

where, as far as the model is concerned, popular articles tend to be dominated by larger global_rate_positive_words where the cutoff is dominated by 0.03. However, a more interesting distribution is that for the global_rate_negative_words:

where the distributions indicate use of negative words greater than ~0.01 words per content size adds a healthy dose of popularity whereas too much of it, say, more than ~0.02 will make the model predict a lower popularity. This is inferred from the spike of the greater-than distribution spiking at ~0.01 whereas the less-than-equal-to distribution spikes at ~0.02.

What’s Next

In TnS we are eager to make our models more transparent. This post only deals with one very specific way to inspect the inner workings of a trained model. Other possible approaches can be by asking the following questions:

• What observations does the model see as clusters?
• How do features interact in a random forest?

The post Unboxing the Random Forest Classifier: The Threshold Distributions appeared first on Airbnb Engineering.

]]>
http://nerds.airbnb.com/unboxing-the-random-forest-classifier/feed/ 0