In the Trust and Safety team at Airbnb, we use the random forest classifier in many of our risk mitigation models. Despite our successes with it, the ensemble of trees along with the random selection of features at each node makes it difficult to succinctly describe how features are being split. In this post, we propose a method to aggregate and summarize those split values by generating weighted threshold distributions.

Motivation

Despite its versatility and out-of-box performances, the random forest classifier is often referred to as a black box model. It is easy to see why some might be inclined to think so. First, the optimal decision split at each node is only drawn from a random subset of the feature set. And to make matters more obscure, the model generates an ensemble of trees using bootstrap samples of the training set. All these just means that a feature might split at different nodes of the same tree and possibly with different split values and this could be repeated in multiple trees.

With all this randomness being thrown around, a certainty still remains. After training a random forest, we know exactly every detail of the forest. For each node, we know what feature is used to split, with what threshold value, and with what efficiency. All the details are there, the challenge is knowing how to piece them together to built an accurate and informative description of the allegedly black box.

One common way to describe a trained random forest in terms of the features is to rank them by their importances based on their splitting efficiencies. Although this method manages to quantify the contribution of impurity decreases from each of the features, it does not shed light on how the model makes decisions from them. We propose in this post one way to concisely describe the inner decisions of the forest by mining node-by-node the entire forest and present the weighted distributions (by split efficiencies and sample sizes) of the thresholds for each feature.

Methodology

For purpose of illustrating this method, we resort to the publicly available Online News Popularity Data Set from the UCI Machine Learning Repository. The dataset contains 39,797 observations of Mashable articles each with 58 features. The positive labels for this dataset are defined as whether the number of shares for a particular article is greater or equal to 1,400. The features are all numerical and ranges from simple statistics like number of words in the content to more complex ones like the closeness to a particular LDA-derived topic.

After training the random forest, we crawl through the entire forest and extract the following information from each non-terminal (or non-leaf) node:

  1. Feature name
  2. Split threshold – The value at which the node is splitting
  3. Sample Size – Number of observations that went through the node when training
  4. Is greater than threshold? – Direction where majority of positive observations go
  5. Gini change – Decrease in impurity after split
  6. Tree index – Identifier for the tree

which can be collected into a table like the following: Screen Shot 2015-10-01 at 11.24.00 AM A common table like the above will very likely contain the same feature multiple times across different trees and even within the same tree. It might be tempting at this point to just collect all the thresholds for a particular feature and pile them up in a histogram. This, however, would not be fair since nodes where the optimal feature-threshold tuples are found from a handful of observations should not have the same weight as those found from thousands of observations. Hence, we define, for a splitting node \(i\), the sample size \(N_i\) as the number of observations that reached the splitting node i during the training phase of the random forest.

Similarly, larger impurity changes should be weighted more than those with near zero impurity change. For that, we first define the impurity change \(\Delta G(i)\) at the splitting node \(i\) as,

$$ \Delta G(i) = G(i) – \frac{L_i}{N_i} G(l_i) – \frac{R_i}{N_i} G(r_i) $$

where, \(N_i\) is the sample size of the splitting node, \(L_i\) the sample size for the left child node with index \(l_i\), and \(R_i\) the sample size for the right child node with index \(r_i\). As for the gini impurity itself at node \(i\), we define it as

$$ G(i) = 1 – \sum\limits_{k=0}^{K} \left( \frac{n_{i,k}}{N_i} \right)^2 $$

where \(K\) is the number of classes for which we are classifying, \(n_{i,k}\) the number of observations with label \(k\), and \(N_i\) total number of observations. Here is an illustration that should clarify the above definitions:

Screen Shot 2015-10-01 at 11.24.55 AM

Since our goal is that of generating an overall description of the trained model, we place more weights on those that split more efficiently. A simple combined weight factor can be just the product of the sample size and the impurity change. More specifically, we can express this combined weight for a non-terminal node \(i\) as \((1 + \Delta G(i)) \times N_i\) (the gini impurity change \(\Delta G(i)\) is added an offset of one to make its range strictly positive).

Another type of variation among nodes splitting on the same feature can be exemplified with the following case:

Screen Shot 2015-10-01 at 11.27.02 AM

where the filled circles are the positively labeled observations whereas the unfilled circles are the negatively labeled ones. In this example, although both nodes split on the same feature, their reasons to do so is quite different. In the left splitting node, the majority of the positive observations end up on the less-or-equal-to branch whereas in the right splitting node, the majority of the positive observations end up in on the greater-than branch. In other words, the left splitting node views the positively labeled observations to be more likely to have smaller feature X values whereas the right splitting node views the positively labeled observations to be more likely to have larger feature X values. This difference is accounted using the is_greater_than_threshold (or chirality) flag, where a 1 is true (or greater than) and a 0 is false (or less or equal to).

Finally, for each feature, we can build two threshold distributions (one per chirality) and have them be weighted by the combined weight of \((1 + \Delta G_i) \times N_i\).

Example

After training the classifier model, we crawl through the entire forest and collect all the information specified in the previous section. This information empowers us to describe which thresholds dominates the splittings for, say num_hrefs (number of links in the article):

 

Screen Shot 2015-10-01 at 11.28.25 AM

In the above plot we see that there are two distributions. The orange one corresponds to the nodes where the chirality is greater_than and the red-ish (or rausch) one where the chirality is less-than-equal-to. These weighted distributions for num_hrefs indicate that whenever the feature num_hrefs is used to decide whether an article is a popular one (+1,400 shares) the dominant description is greater than ~15 links, which is illustrated by the spiking bin of the greater-than distribution near 15. Another interesting illustration of this method is on the global_rate_positive_words and the global_rate_negative_words, which are defined as the proportion of, respectively, positive and negative words in the content of the article. The former one is depicted as follows: Screen Shot 2015-10-01 at 11.44.23 AM

where, as far as the model is concerned, popular articles tend to be dominated by larger global_rate_positive_words where the cutoff is dominated by 0.03. However, a more interesting distribution is that for the global_rate_negative_words:

Screen Shot 2015-10-01 at 12.36.26 PM

where the distributions indicate use of negative words greater than ~0.01 words per content size adds a healthy dose of popularity whereas too much of it, say, more than ~0.02 will make the model predict a lower popularity. This is inferred from the spike of the greater-than distribution spiking at ~0.01 whereas the less-than-equal-to distribution spikes at ~0.02.

What’s Next

In TnS we are eager to make our models more transparent. This post only deals with one very specific way to inspect the inner workings of a trained model. Other possible approaches can be by asking the following questions:

  • What observations does the model see as clusters?
  • How do features interact in a random forest?
Want to work with us? We're hiring!