Random Forest for a non-Technical Audience

One of the most widely used algorithms today is actually quite tricky to explain

If you haven’t already, you should read my article on Decision Trees here. Understanding the Decision Tree and its flaws is paramount to understand why the Random Forest exists, and why it is powerful.

According to KDNuggets the top 3 most commonly used algorithms by Data Scientists are Linear/Logistic Regression, Decision Tree or Random Forest and Gradient Boosting models. With 78% of responders using Decision Trees or Random Forests, it’s clear that this model is something we should understand.

If you are unable to explain a model to a business audience, it unlikely that it will get approval or buy-in from stakeholders. This article aims to explain Random Forest in a way that everyone can understand.

Random Forest can be used for both Classification and Regression, but I’m going to focus on Classification here.

This model is an ensemble method, meaning we use lots of models together. The main ensemble techniques are bagging and boosting. Random Forest falls within the bagging category. This will be explained later.

A quick recap on Decision Trees. If you aren’t familiar with these concepts, you can read my article here.

We want to split our data by one of our features, the feature is the node and the resulting data is a leaf.
We want to minimise impurity in each leaf.
Impurity is calculated by using the proportions of each class in the leaf. We can use the Gini or Entropy score.
The feature chosen at each node is the one that minimises in the Gini score

There are a few flaws with this approach

The algorithm is greedy which means we are selecting the best feature based on the split at that node, not the resulting splits in future nodes
Trees can overfit easily when they get very deep. We can limit this by pruning the tree but this reduces predictive power.

The answer is to introduce randomness and use more trees hence the name; Random Forest.

Summary

The algorithm can be broken down into the following steps, which I’ll explain one by one:

Create a new dataset from the existing dataset (bootstrapping).
Fit a Decision Tree on this dataset.
2a. Randomly sample X features at each node of the tree.
Repeat hundreds of times.
Combine the predictions from every tree (aggregating).

Bootstrapping

Bootstrapping is a statistical method of resampling that gained popularity in the 80s. It is used to calculate confidence intervals and standard error as well as playing the key part in the the Random Forest model. Here’s how it works…

First, we take our dataset, which in this is example is 12 fruits. Hopefully the pictures are self explanatory but as in the Decision Tree example, we have Apples, Pears and Grapes of different sizes and colours.

Our sample dataset. Image by Author.

Let’s create our first Bootstrap sample. This is done by sampling with replacement which simply means each sample can be taken multiple times. I’ve highlighted an example below where the yellow pear was sampled twice.

Creating a single bootstrapped sample. Image by Author.

It’s worth noting that although we could create some extreme samples by doing this, the Central Limit Theorem states that the means of our samples will be normally distributed around the mean of our original dataset.

Let’s create 5 bootstrap samples for this example (the scikit-learn default is 100).

Image by Author.

Hopefully, you can see how doing this could bring more randomness into our data. Consider how a decision tree might look when trained on each of these bootstrapped samples and which features would generate the best Gini score, causing different root nodes.

Randomise Features

We also have another step to introduce randomness. This is selecting a fixed number of features, randomly, from our dataset each time we create a node in a Decision Tree.

In our previous example, we had just 3 features; colour, shape and size. To illustrate this example slightly better, let’s imagine we also have a sweetness score, from 1–5.

By default, scikit-learn sets the max features as the square root of the total. In this example we have 4 features, so the max we will use at each node is 2.

When creating each node of the decision tree, we randomly select 2 features. This creates very different trees across our bootstrapped samples.

Features can be reused at each stage, however the algorithm will not reselect binary features unless it is in separate branches, because the split has already been made.

Fitting Decision Trees

We can now fit a decision tree on each bootstrapped dataset.

Let’s follow an example through for bootstrapped dataset 1. I have only approximated the Gini scores here so if you calculate them yourself, you might see some differences.

We are fitting a regular Decision Tree on the sample but with a small difference. At each node, we can only pick from 2 randomly selected features.

For the root node, the two randomly chosen features are Size and Shape. The root node is selected in the standard way. We calculate the Gini score and pick the best feature. In this case, its Size.

The root node of a Decision Tree. Image by Author.

We achieve purity for the leaf on the left but the leaf on the right still needs further classification.

Proceeding with the right-hand leaf, the randomly selected features at this node are Colour and Sweetness. As usual, we select the best feature using the Gini score. Colour = Red gives the best score here, so this is used as the next node.

Second node of the Decision Tree. Image by Author.

The node on the right isn’t pure yet, so we create another node from 2 more randomly selected features. Here, we get Shape and Size. Note that Size can be used again but we wouldn’t pick 3 cm as everything in this branch is already over 3cm.

Shape gives us purity, so the tree is complete

The last node to achieve purity. Image by Author.

Here’s the final tree.

Remember, at each node, we could only select from 2 features. These are randomly selected and we don’t control what they are.

Decision Tree on bootstrapped sample 1. Image by Author.

Repeat 100’s of Times

We’ve only used 5 samples here, the scikit-learn default is 100 but typically we’ll use several hundred when modelling. Increasing the number of trees will benefit the model, but suffers from diminishing returns — eventually more trees won’t produce significantly better results but will increase computation.

We now have hundreds of different samples, each with different combinations of features and with a decision tree fit on each one.

Here’s some example trees for our 5 samples.

Example trees for each bootstrapped sample. Image by Author.

How do we make predictions with all these trees?

Making Predictions

This is done simply by aggregate voting. Every tree gets an equally weighted vote. This way, poor predictions and overfit trees are likely to be compensated for by the other trees.

We simply add up the votes for each class to calculate a probability score which can be used for classification.

In this example, trees 1,2,3 and 5 correctly predict our new data as grapes. Tree 4 predicts apples. This gives a result of Grapes with a probability of 80%.

Image by Author.

Scoring

In Classification, we can assess the Random Forest model with typical metrics; accuracy, precision, recall and so on. However, the use of bootstrapping also provides a unique metric called out-of-bag scoring.

When creating bootstrapped samples, typically 1/3 of the original samples will not be included. These can be thought of as unseen data for each individual tree.

In the image below, we can see the out-of-bag samples for each bootstrap sample. You can see this isn’t close to 1/3 for each tree because of my not-so-random method of creating the samples.

Image by Author.

To create an out-of-bag score, we take each sample and generate a prediction for that sample from every tree that wasn’t trained on that sample.

In the example above, trees 2 and 4 didn’t see the large green pear, so we could generate a prediction from these trees on this sample because they haven’t been trained on it.

We can do this for every sample and calculate the percentage of correct predictions; the out-of-bag score.

Hyperparameters

I’m not going to go into a lot of detail here, rather than go through all the parameters, I’ll just explain the main concepts. If you aren’t familiar with hyperparameters, you can skip this section.

The majority of Random Forest parameters relate to the depth of the individual Decision Trees. Remember, if the depth of trees is left unchecked, they will overfit. However, bagging should prevent this from having too much impact. We can still get a performance boost by pruning trees. There are a number of different ways to prune trees using hyperparameters, the main thing to remember is that some of the hyperparameters overlap so make sure you aren’t limiting your trees with one parameter while trying to test another.

It’s always import to considr the number of individual Decision Trees we use. It’s important to find the balance between a powerful model and the amount of computer power you are using, bearing in mind that after a point, more trees won’t add significant model improvement.

Hyperparameters can be optimised using cross-validation on the accuracy or even the out-of-bag score.

Photo by Marita Kavelashvili on Unsplash

At the start of this article, I pointed out two key flaws in the standard Decision Tree model:

The algorithm is greedy which means we are selecting the best feature based on the split at that node, not the resulting splits in future nodes
Trees can overfit easily when they get very deep. We can limit this by pruning the tree but this reduces predictive power.

Random Forest tackles these in the following ways:

The trees are still greedy but we are now testing hundreds of permutations of trees with different combinations of features. We don’t need to find the optimal tree because we are using the power of many trees.
Overfitting is limited by aggregating the results from many trees, and only allowing random features at each decision boundary. Bootstrap samples can also help with overfitting. We can still prune the individual trees if needed.

Hopefully you can see how the Random Forest improves Decision Trees by introducing Bootstrapping and Aggregation. The combination of these is known as bagging.

The algorithm is still highly interpretable, we can extract feature importances. However, we can’t dive into the algorithm and look at each individual tree, so we can’t plot tree diagrams and decision boundaries as we can with regular Decision Trees.

Random Forest is one of the most widely used tree-based algorithms today but is one of many ensemble methods.

As usual, please feel free to contact me directly with any questions.

One of the most widely used algorithms today is actually quite tricky to explain

Summary

Bootstrapping

Randomise Features

Fitting Decision Trees

Repeat 100’s of Times

Making Predictions

Scoring

Hyperparameters

Footer