Random Forest for Data Scientists in 2021

The ensemble technique, bagging and pasting, bootstrapping, hard voting, soft voting, Bootstrapping, and more

Complete Roadmap:

How Random-Forest Work?
Important Terminology involves in Random-Forest algorithm.
Ensemble, Hard Voting and Soft Voting, Bagging and Pasting, Bootstrapping, Random patches and subspaces, Out-Of-Bag-Evaluation.
Live Implementation of Random-Forest using Scikit-Learn Library.
Plot Decision Trees of Random-Forest.

Random-Forest :

Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases, you will find that this aggregated answer is better than an expert’s answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

Random-Forest Behind Scene Working | Image By Author

Random-Forest is also called the bagging technique. It creates a forest of decision trees using bootstrapping and bagging. Each tree in the forest behaves as a weak learner this weak learner can compare with the ML model, can predict on weak learner every weak learner train on bootstrap samples data(train_X, train_y) and predict on test_X what Random-Forest do it ensembles all results and gives you most accurate results. Which means this all individual weak learners convert into strong learner what happened here is by bootstrapping the dataset and due to the ensemble of all decision trees overcome the low bias and high variance problem which is overfitting of decision tree previously. Good examples of Random-Forest in real life is who wants to be a billionaire or Kaun Banega crorepati, Classic lifeline of this popular show is Audience pole in which asked the audience about the question and they give their votes to available four options out of which option has most votes ( In percentage % ) selected.

All Terminology Involves in Random-Forest Algorithm.

1. Ensemble Technique.

Ensemble Technique performs bagging and bootstrapping( with the replacement of feature ) generates n-number of the decision-tree, each tree or weak learner from forest predict on test data ( classification or regression ). Random-Forest combines all predictions and outcomes in a single prediction by the voting classifier( hard voting or soft voting) or an average of prediction(regression ) is called Ensemble Technique.

2. Hard-Voting and Soft-Voting.

Hard-Voting and Soft-Voting | Image by Author

The Majority vote classifier is called a hard-voting classifier. For example, you see three times 1 and one times 0 by hard-voting 1 is selected to predict as an outcome. In, sci-kit-learn to predict the class with the highest probability, averaged over all the individual classifier called soft-voting. one has 0.75 in contrast zero has 0.25 probability by this 1 selected as a soft-voting classifier.

3.Bootstrapping:

Bootstrap Example | Image by Author

In Statistics, resampling with replacement is called bootstrapping. Let’s understand what this sentence tries to say when data a pass to an algorithm, generates an n-number of samples to create and train the decision trees. These n-samples by selecting some certain records, features (x,y) of the dataset, by the sampling of the dataset with a replacement which means one or more records instances can repeat. According to Statisticians, out of data, only 63% of data can be resampled.

4. Random Patches and Random Subspaces:

Data Sampling

The BaggingClassifier( ) from sklearn.ensemble supports the feature of this type of sampling controlled by four hyperparameters 1. max_features 2. bootstap_features 3.max_samples 4. bootstrap. This technique is useful when you dealing with high-dimensional data-inputs such as image classification. Sampling with both records and features ( x,y ) is called random patches. This parameter tune by bootstrap = True, bootstrap_features=True and max_samples= 1.0

Keeping one-hyperparameter = False and max_samples less than 1.0 is called Random-subspaces.

5. Bagging and Pasting :

In Machine Learning, Bagging stands for Bootstrapping and Aggregating. Resampling of data with replacement this method called as Bagging. From the dataset, all data points are bootstrapped and passing through the decision trees and generate n-predictions, aggregating all predictions this whole process known as Bagging.

When Sampling performed without replacement is called pasting.

Bagging and Pasting | Image By Author

Suppose, we take five hundred decision-tree to train and check out the predictions of all trees looks like the below graph. As per the graph compare one versus five hundred trees that tree aggregation performs well than the individual.

A single Decision Tree (left) versus a bagging ensemble of 500 trees (right)

6. Out-Of-Bag-Evaluation.

With Bagging, some instances are sampled several times for any given predictor while others not sample. Default RandomForestClassifier( ) samples training dataset with replacement (Bootstrap=True ). This means only about 63% of the training sample on average for each predictor. The remaining 37% of the training instances that not sample are called Out-Of-Bag (OOB) instances.

Since predictor or decision tree never sees the OOB instances during, it can evaluate on these instances called OOB Score. In Scikit-learn you can set oob_score=True in RandomForestClassifier( ) to request an automatic OOB-Evaluation. OOB Score useful to validate new instances of how it performs.

One more popular term is there called Out-Of-Bag Error, In each predictor have one instance which predicts. this prediction and actual datapoint have some difference between them if we subtract them actual minus predictions is called OOB Error.

Implementation of Random Forest Classifier in Scikit-Learn.

data is looking like the below image.

Image by Author

Plot the Decision Tree of Random-Forest:

You can plot any number of trees from the forest where I’m using only 7 trees for reference.

Decision Tree 1

Decision Tree 2

Decision Tree 3

Decision Tree 4

Decision Tree 5

Decision Tree 6

Decision Tree 7

Reference :

Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow :

The ensemble technique, bagging and pasting, bootstrapping, hard voting, soft voting, Bootstrapping, and more

Footer