The Trust and Safety Team maintains a number of models for predicting and detecting fraudulent online and offline behaviour. A common challenge we face is attaining high confidence in the identification of fraudulent actions. Both in terms of classifying a fraudulent action as a fraudulent action (recall) and not classifying a good action as a fraudulent action (precision).

A classification model we often use is a Random Forest Classifier (RFC). However, by adjusting the logic of this algorithm slightly, so that we look for high confidence regions of classification, we can significantly improve the recall and precision of the classifier’s predictions. To do this we introduce a new splitting criterion (explained below) and show experimentally that it can enable more accurate fraud detection.

A RFC is a collection of randomly grown ‘Decision Trees’. A decision tree is a method for partitioning a multi-dimensional space into regions of similar behaviour. In the context of fraud detection, identifying events as ‘0’ for non-fraud and ‘1’ for fraud, a decision tree is binary and tries to find regions in the signal space that are mainly 0s or mainly 1s. Then, when we see a new event, we can look at which region it belongs to and decide if it is a 0s region or a 1s region.

Typically, a Decision Tree is grown by starting with the whole space, and iteratively dividing it into smaller and smaller regions until a region only contains 0s or only contains 1s. Each final uniform region is called a ‘leaf’. The method by which a parent region is partitioned into two child regions is often referred to as the ‘Splitting Criterion’. Each candidate partition is evaluated and the partition which optimises the splitting criterion is used to divide the region. The parent region that gets divided is called a ‘node’.

Suppose a node has $$N$$ observations and take $$L_0^{(i)}$$ and $$R_0^{(i)}$$ to denote the number of 0s in the left and right child respectively, and similarly $$L_1^{(i)}$$ and $$R_1^{(i)}$$ for 1s. So for each candidate partition $$i$$ we have $$N=L_0^{(i)}+L_1^{(i)}+R_0^{(i)}+R_1^{(i)}$$. Now let $$l_0^{(i)}=L_0^{(i)}/(L_0^{(i)}+L_1^{(i)}$$) be the probability of selecting a 0 in the left child node for partition $$i$$, and similarly denote the probabilities $$l_1^{(i)}$$, $$r_0^{(i)}$$, and $$r_1^{(i)}$$. The two most common splitting criterions are:

A. Gini Impurity: Choose $$i$$ to minimise the probability of mislabeling i.e. $$i_{gini} = \arg \min_i H_{gini}^{(i)}$$ where

$$H_{gini}^{(i)} = \frac{L_0^{(i)}+L_1^{(i)}}{N} \big[ l_0^{(i)} (1-l_0^{(i)}) + l_1^{(i)} (1-l_1^{(i)}) \big] + \frac{R_0^{(i)}+R_1^{(i)}}{N} \big[ r_0^{(i)} (1-r_0^{(i)}) + r_1^{(i)} (1-r_1^{(i)}) \big]$$

B. Entropy: Choose $$i$$ to maximise the informational content of the labeling i.e. $$i_{entropy} = \arg \min_i H_{entropy}^{(i)}$$ where

$$H_{entropy}^{(i)} = – \frac{L_0^{(i)}+L_1^{(i)}}{N} \big[ l_0^{(i)} \log(l_0^{(i)}) + l_1^{(i)} \log(l_1^{(i)}) \big] - \frac{R_0^{(i)}+R_1^{(i)}}{N} \big[ r_0^{(i)} \log(r_0^{(i)}) + r_1^{(i)} \log(r_1^{(i)}) \big].$$

However, notice that both of these criterions are scale invariant: A node with $$N=300$$ observations and partition given by $$L_1=100,L_0=100,R_1=100,R_0=0$$ achieves an identical spliting criterion score $$H_{gini}=1/3$$ to a node with $$N=3$$ observations and $$L_1=1,L_0=1,R_1=1,R_0=0$$. The former split is a far stronger result (more difficult partition to achieve) than the latter. It may be useful to have a criterion that is able to differentiate between the likelihood of each of these partitions.

### Confidence Splitting Criterion

Theory

Let $$L^{(i)}=L_0^{(i)}+L_1^{(i)}$$ and $$R^{(i)}=R_0^{(i)}+R_1^{(i)}$$. And let $$p_0$$ be the proportion of 0s and $$p_1$$ the proportion of 1s in the node we wish to split. Then we want to find partitions where the distribution of 0s and 1s is unlikely to have occured by random. If the null hypothesis is that each observation is a 0 with probability $$p_0$$ then the probability of $$L_0^{(i)}$$ or more 0s occuring in a partition of $$L^{(i)}$$ observations is given by the Binomial random variable $$X(n,p)$$:

$$\mathbb{P}[X(L^{(i)},p_0) >= L_0^{(i)}] = 1 – B(L_0^{(i)};L^{(i)},p_0)$$

where $$B(x;N,p)$$ is the cumulative distribution function for a Binomial random variable $$X=x$$ with $$N$$ trials and probability $$p$$ of success. Similarly for $$p_1$$ the probability of $$L_1^{(i)}$$ or more 1s occuring in the left partition is $$1-B(L_1^{(i)};L^{(i)},p_1)$$. Taking the minimum of these two probabilities gives us the equivalent of a two-tailed hypothesis test. The probability is essentially the p-value under the null hypothesis for a partition at least as extreme as the one given. We can repeat the statistical test for the right partition and take the product of these two p-values to give an overall partition probability.

Now, to enable biasing the splitting towards identifying high density regions of 1s in the observation space, one idea is to modify $$B(L_0^{(i)};L^{(i)},p_0)$$ to only be non zero if it is sufficiently large. In other words, replace it with

$$\mathbb{1}_{ \{B(L_0^{(i)};L^{(i)},p_0) >= C_0 \} }B(L_0^{(i)};L^{(i)},p_0)$$

where $$C_0 \in [0,1]$$ is the minimum confidence we require. $$C_0$$ might take value 0.5, 0.9, 0.95, 0.99, etc. It should be chosen to match the identification desired. Similarly for $$C_1$$. Thus we propose a new splitting criterion given by:

C. Confidence: Choose $$i$$ to minimise the probability of the partition being chance i.e. $$i_{confidence} = \arg \min_i H_{confidence}^{(i)}$$ where

$$H_{confidence}^{(i)} = \min_{j=0,1} \{1 – \mathbb{1}_{ \{ B(L_j^{(i)};L^{(i)},p_j) >= C_j \} } \, B(L_j^{(i)};L^{(i)},p_j) \} * \min_{j=0,1} \{1 – \mathbb{1}_{ \{ B(R_j^{(i)};R^{(i)},p_j) >= C_j \} } \, B(R_j^{(i)};R^{(i)},p_j) \}$$

where $$C_0$$ and $$C_1$$ are to be chosen to optimise the identification of 0s or 1s respectively.

### Implementation

To run this proposed new splitting criterion, we cut a new branch of the Python opensource Scikit-Learn repository and updated the Random Forest Classifier Library. Two modifications were made to the analytical $$H_{confidence}^{(i)}$$ function to optimise the calculation:

1. Speed: Calculating the exact value of $$B(x;N,p)$$ is expensive, especially over many candidate partitions for many nodes across many trees. For large values of $$N$$ we can approximate the Binomial cumulative distribution function by the Normal cumulative distribution $$\Phi(x;Np,\sqrt{Np(1-p)})$$ which itself can be approximated using Karagiannidis & Lioumpas (2007) as a ratio of exponentials.
2. Accuracy: for large values of $$N$$ and small values of $$p$$ the tails of the Binomial distribution can be very small so subtraction and multiplication of the tail values can be corrupted by the precision of the machine’s memory. To overcome this we take the logarithm of the above approximation to calculate $$H_{confidence}^{(i)}$$.

After these tweaks to the algorithm we find an insignificant change to the runtime of the Scikit-Learn routines. The Python code with the new criterion looks something like this:


from sklearn.ensemble import RandomForestClassifier
# using [C_0,C_1] = [0.95,0.95]
rfc = RandomForestClassifier(n_estimators=1000,criterion='conf',conf=[0.95,0.95])
rfc.fit(x_train,y_train)
pred = rfc.predict_proba(x_test)


For more details on the Machine Learning model building process at Airbnb you can read previous posts such as Designing Machine Learning Models: A Tale of Precision and Recall and How Airbnb uses machine learning to detect host preferences. And for details on our architecture for detecting risk you can read more at Architecting a Machine Learning System for Risk.

### Evaluation

Data

To test the improvements the Confidence splitting criterion can provide, we use the same dataset we used in the previous post Overcoming Missing Values In A Random Forest Classifier, namely the adult dataset from the UCI Machine Learning Repository. As before the goal is predict whether the income level of the adult is greater than or less than \$50k per annum using the 14 features provided.

We tried 6 different combinations of $$[C_0,C_1]$$ against the baseline RFC with Gini Impurity and looked at the changes in the Precision-Recall curves. As always we holdout a training set and evaluate on the unused test set. We build a RFC of 1000 trees in each of the 7 scenarios.

Results

Observe that $$C_0=0.5$$ (yellow and blue lines) offers very little improvement over the baseline RFC, modest absolute recall improvements of 5% at the 95% precision level. However, for $$C_0=0.9$$ (green and purple lines) we see a steady increase in recall from at precision levels of 45% and upwards. At 80% precision and above, $$C_0=0.9$$ improves recall by an absolute amount of 10%, riing to 13% at 95% precision level. There is little variation between $$C_1=0.9$$ (green line) and $$C_1=0.99$$ (purple line) for $$C_0=0.9$$ although $$[C_0,C_1]=[0.9,0.9]$$ (green line) does seem to be superior. For $$C_0=0.9$$ (pale blue and pink lines), the improvement is not so impressive or consistent.

### Final Thoughts

It would be useful to extend the analysis to compare the new splitting criterion against optimising existing hyper-parameters. In the Scikit-Learn implementation of RFCs we could experiment with min_samples_split or min_samples_leaf to overcome the scaling problem. We could also test different values of class_weight to capture the asymmetry introduced by non-equal $$C_0$$ and $$C_1$$.

More work can be done on the implementation of this methodology and there is still some outstanding analytical investigation on how the confidence thresholds $$C_j$$ tie to the improvements in recall or precision. Note however that the methodology does already generalise to non binary classifiers, i.e. where j=0,1,2,3,…. It could be useful to implement this new criterion into the Apache Spark RandomForest library also.