No Strangers

Airbnb is trying to build a world where people can belong anywhere and there are no strangers. This helps hosts feel comfortable opening their homes and guests be confident traveling around the globe to stay with people they have never met before.

While almost all members of the Airbnb community interact in good faith, there is an ever shrinking group of bad actors that seek to take advantage of the platform for profit. This problem is not unique to Airbnb: social networks battle with attempts to spam or phish users for their details; ecommerce sites try to prevent the use of stolen credit cards. The Trust and Safety team at Airbnb works tirelessly to remove bad actors from the Airbnb community and to help make the platform a safer and trustworthy place to experience belonging.

Missing Values In A Random Forest

We can train machine learning models to identify new bad actors (for more details see the previous blog post Architecting a Machine Learning System for Risk). One particular family of models we use is Random Forest Classifiers (RFCs). A RFC is a collection of trees, each independently grown using labeled and complete input training data. By complete we explicitly mean that there are no missing values i.e. NULL or NaN values. But in practice the data often can have (many) missing values. In particular, very predictive features do not always have values available so they must be imputed before a random forest can be trained.

Typically, random forest methods/packages encourage two ways of handling missing values: a) drop data points with missing values (not recommended); b) fill in missing values with the median (for numerical values) or mode (for categorical values). While a) does not use all the available information by dropping data points, b) can sometimes brush too broad a stroke for data sets with many gaps and significant structure.

There are alternative techniques for dealing with missing values, but most of these are computationally expensive e.g. repeated iteration of random forest training to compute proximities. What we propose in this post is a one-step pre-computation method which normalises features to construct a distance metric for filling in missing values with the median of their k-nearest neighbors.

All of the features in our fraud prediction models fall into two types: a) numerical and b) categorical. Boolean features can be thought of as a special case of categorical features. Since we work in the business of fraud detection, our labels are binary: 0 if the data point is not fraud and 1 if the data point is fraud. Below are some feature transformations we wish to compare for missing value treatment.

Transformations

  1. Lets first deal with numerical features. Write a numerical feature as a random variable \(X\) taking values in \(\mathbb{R}\). And denote the data labels by the random variable \(Y\) taking values in \(\{0,1\}\). Then for \(X=x\) the transformation \(F_n:\mathbb{R} \rightarrow [0,1]\) defined by$$
    x \mapsto F_n(x) := \mathbb{P}[X \leq x | Y=1]
    $$has the following properties:

    1. \(F_n\) is order preserving i.e. if \(a \leq b\) then \(F(a) \leq F(b)\)
    2. \(F_n\) is invertible i.e. given a value for \(y=F(x)\) we can find \(x=F^{-1}(y)\) (follows from i)
    3. \(F_n(X|{Y=1}) \sim U[0,1]\) i.e. \(F_n(X|{Y=1})\) is uniformly distributed on \([0,1]\).

     

  2. Similarly, consider a categorical feature represented by random variable \(X\) taking values in finite set \(\Omega = \{a,b,…\}\). Take ordering \(\preceq\) on \(\Omega\) according to the rate of fraud for each categorical value. Then for categorical \(X=x\) we can define the transformation \(F_c:\Omega \rightarrow [0,1]\) equivalently by$$
    x \mapsto F_c(x) := \mathbb{P}[X \preceq x | Y=1]
    $$which also has properties i), ii) and iii) from above with respect to the ordering \(\preceq\).
  3. A more common method for transforming a numerical feature \(X\) to the unit interval is given by \(G_n\) where$$
    x \mapsto G_n(x) := \mathbb{P}[X \leq x]
    $$which would also have properties i), ii) and iii) in the above.
  4. We can adapt the popular method \(g(x) = \mathbb{P}[Y=1 | X=x]\) for transforming a categorical feature \(X\) to a numerical value to give us \(G_c:\Omega \rightarrow [0,1]\) defined by$$
    x \mapsto G_c(x) := [ g(x) - \min g(x) ] / [\max g(x) - \min g(x)]
    $$The conditional probability transform \(g\) intuitively makes more sense, but violates properties i), ii), and iii) above. And without invertibility of \(G_c\) we cannot in the RFC easily (or practically) distinguish between two different categorical values \(a\) and \(b\) that have \(G_c(a) \approx G_c(b)\).

Interpretation

The aim of the scaling transforms \(F_n\) and \(F_c\) is two fold. First of all to make the transformation invertible so no information is lost. Secondly, to uniformly distribute the data points with fraud in the interval [0,1] for each feature. If we think of data as points in \(N\) dimensional space where \(N\) is the number of features, then the distance in each dimension between two data points becomes comparable. By comparable we mean that a distance of 0.4 in the the first dimension contains twice as many fraud data points as a distance of 0.2 in the second dimension. This enables better construction of distance metrics to identify fraud.

Imputation Using K-Nearest Neigbours

Suppose there are \(N\) features \(X_1,X_2,…,X_N\) and data points \(\mathbf{a},\mathbf{b},…\in \mathbf{D}\). After we have transformed the \(i\)th feature, by \(F_i\) say, for \(i=1,2,…,N\) we can construct a distance metric \(d\) between any two data points \(\mathbf{a}=(a_1,a_2,…,a_N)\) and \(\mathbf{b}=(b_1,b_2,…,b_N)\) as follows:

$$
d(\mathbf{a},\mathbf{b};\lambda) := \sum_{i=1}^{N} \lambda + \mathbf{1}_{a_i \neq NULL,b_i \neq NULL}(|F_i(a_i)-F_i(b_i)| – \lambda)
$$

where \(\lambda\) is a pre-chosen constant which determines how to weight the distance between two feature values when at least one of them is missing. By increasing \(\lambda\) we push out neighbours of a data point that have many missing values. Then, for a data point \(\mathbf{a}\) with missing value \(a_j\), say, we calculate the nearest neighbours missing value (NNMV) as:

$$
m(a_j;d,k) = \text{median} ( [\text{argmin}^{(k)}_{\mathbf{b} \in \mathbf{D}} d(\mathbf{a},\mathbf{b};\lambda)]_j )
$$

where \(k\) denotes how many nearest neigbhours we wish to use for calculating the median value of the \(j\)th feature. In other words, find the \(k\) closest neighbours and then in the \(j\)th dimension take the median of the \(k\) values.

Experiment

In order to see the effect of the above feature transforms, we use the adult dataset from the UCI Machine Learning Repository and assess the performance of the model under different feature transformations and proportions of missing values. The dataset containts 32,561 rows and 14 features, of which 8 are categorical and the remaining 4 are numerical. The boolean labels correspond to whether the income level of the adult is greater than or less than $50k per annum. We divide the dataset into a training and test set in the ratio of 4 to 1 respectively.

For the first experiment we compare different models using the methodology:

  1. Use the same training and test data set split
  2. Remove \(M\)% of values from the training and test data sets
  3. Either calculate the median/mode for the missing values or use \(m(.;d(.,.;\lambda),100)\) from the training set
  4. Fill in the missing values in both training and test set using the previous step’s calculations
  5. Train with the same number of trees (100)

Observe that \(M\) and \(\lambda\) are unspecified parameters – we will loop over different values of these during experimentation. The Performance of each model will be judged using Area Under Curve (AUC) scores which measures the area under the Reciever Operating Characteristic (ROC) graph (this is a plot of the true postive rate vs the false postitive rate). We will test the following nine models:

  1. Baseline model (no feature transformation and missing value imputation using median/mode)
  2. \(G_n\) and \(G_c\) with median imputation
  3. \(F_n\) and \(F_c\) with median imputation
  4. \(G_n\) and \(G_c\) with NNMV imputation
  5. \(G_n\) and \(F_c\) with NNMV imputation
  6. \(F_n\) and \(G_c\) with NNMV imputation
  7. \(O_n\) and \(F_c\) with NNMV imputation
  8. \(F_n\) and \(O_c\) with NNMV imputation
  9. \(F_n\) and \(F_c\) with NNMV imputation

where \(O_n(x) = O_c(x) := 0\), i.e. is the null map that takes all of a feature’s values to zero. The \(O_n\) and \(O_c\) maps help us verify that both the numerical and categorical features are contributing to the model.

In the second experiment we compare model 8) to model 0) for different combinations of number of trees and number of nearest neighbours.

Results

Models Comparison

First we consider how the models perform for different values of \(M\) and \(\lambda\) with a fixed number of trees (100) in the RFC training process and fixed number of nearest neighbours (100) for NNMV imputation.

rfc_mis_vals_cdf_transforms_13_0

The graphs above display interesting patterns, some intuitive and some surprising:

  1. As we increase the percentage of missing values, the performance of the baseline model deterioriates
  2. In the scenario where no missing values are added (top left graph) but the data set still has some missing values, the improvement from all models is roughly the same at 0.007
  3. For all scenarios, the models with median imputation (red and yellow lines) perform very similarly to each other and sometimes over 0.005 worse than the NNMV models
  4. Across all scenarios the ‘\(F_n\) and \(F_c\) + NNMV’ graph (light brown line) outperforms the median imputation models, with the improvement increasing as the percentage of missing values increases
  5. Across all scenarios the ‘\(F_n\) and \(F_c\) + NNMV’ graph (light brown line) outperforms the other NNMV graphs, by upto 0.003 in some cases (bottom right graph)
  6. Across all scenarios the \(O_n\) and \(O_c\) graphs (light blue and pink lines) reduce the power of the model considerably – as we would expect – so much so that they are outside of the scope of the graphs
  7. The results do not appear to be very sensitive to \(\lambda\) as long as \(\lambda>0\).

Robustness Checks

Having observed the outperformance of model 8) over the other candidates, we next check how model 8) compares to the baseline as we vary i) the number of trees in the RFC training and ii) the number of nearest neigbours in the NNMV imputation. We take the fourth scenario above – where 60% of values are missing – and we chose \(\lambda\)=0.5.

rfc_mis_vals_cdf_transforms_16_1

The left hand plot does not suggest the performance of model 8) improves with the number of nearest neighbours used in the NNMV imputation. However, there is a consistent pattern of improved performance as the number of trees increases, plateauing after about 100 trees. The right hand plot shows how much faster it is to train a RFC with the transformed feature set. This is to be expected as, instead of exploding categorical features to many binary features in the baseline model, we keep the number of features fixed in model 8).

Efficiency Implications

Consider the ROC curves for one of the scenarios above, say, where the number of trees is 100 and the number of nearest neighbours used is 100.

rfc_mis_vals_cdf_transforms_19_0

The improvement of the ROC curve suggests, for example, that holding recall fixed at 80%, say, the false positive rate falls from 26% to 24%. Suppose each day we are scoring 1 million events, 99% of which are non-fraud, each flagged event needs to be manually reviewed by a human, and each review takes 10 seconds. Then the aforementioned decrease in the false positive rate can save reviewing 1,000,000 x 0.99 x 0.02 = 19,800 events or 19,800 / (6 x 60) = 55 hours of reviewing per day! This is why even single digit or decimal digit improvements in the auc score of a RFC can have a dramatic effect on a department’s efficiency.

Want to work with us? We're hiring!