“Storytelling is the most powerful way to put ideas into the world.” -Robert Mckee
In this article, using small stories, I will try to explain the concepts of ensemble machine learning.
In recent times, I haven’t found any Kaggle competition-winning solution which doesn’t have ensemble machine learning. So, it might be a good way to understand the basic concepts of ensemble machine learning using some examples.
Ensemble machine learning
Suppose you want to buy a house. To understand if this is the perfect house for you or not, you will ask questions to your friends who have bought a house, real-estate brokers, neighbors, colleagues, and your parents. You will give weights to each of the answers and try to arrive at the final answer to your question. Exactly, this is ensemble learning.
Ensemble machine learning is an art to create a model by merging different categories of learners together, to obtain better prediction and stability.
Naive Ensemble machine learning techniques are:
- Max voting — Based on the previous example, if you have asked 10 people about the house and 7 people told not to buy the house. Your answer is not to buy the house based on max voting.
- Averaging — If each of these people gives the probability that you should buy this house or not (Like your parents say that this house will be 70% suitable for you), you take an average of all these probabilities and take the decision to buy the house.
- Weighted averaging — Suppose, you have a trust issue and you trust more your parents and close friends than any other. You give some higher weights(suppose 60%) to the probabilities given by these people and lower weights(40%) to others. Then you will take the weighted average and take the final probability.
Advance Ensemble machine learning techniques are:
- Bagging — Also known as Bootstrap Aggregating.
I have a set of multi-colored balls in my bag. I asked a kid to pick 5 balls. Then again, I put the balls back and asked the kid to pick 5 balls again and again. This repetitive task is known as Bootstrapping or sampling with replacement.
“Bootstrapping is any test or metric that uses random sampling with replacement, and falls under the broader class of resampling methods.” -Wikipedia
Now, based on every 5 balls drawn, I will find the probability of a white ball. Suppose, I get 2 white balls out of 5, then I have a probability of 2/5 i.e. 40% and if I get 0 white balls out of 5, then I have a probability of 0/5 i.e. 0%.
In the end, I will take the average probability of all the time the ball is drawn and conclude what is the probability of getting a white ball getting drawn from the bag?
So basically, I am creating a small model out of each sample of balls withdrawn and then the balls are put back. In the end, I combined the predictions of each of the models to obtain the final solution — probability. This is bagging.
ML version of bagging:
- From the original dataset, randomly multiple samples are generated with replacement
- A weak learner (a base model like a decision tree) is created on each of these subsets, such that all these weak learners are independent of each other and run in parallel
- Finally, combine the predictions obtained from each of these weak learners to create a prediction for the strong learner(final bagging model)
- One of the most popular examples of bagging is Random Forest
a. Less Overfitting — Many weak learners aggregated typically outperform a single learner over the entire set, and has less overfit
b. Stable — Removes variance in high-variance low-bias data sets
c. Faster — Can be performed in parallel, as each separate bootstrap can be processed on its own before combination
a. Expensive — computationally it will be expensive if the data set is quite big
b. Bias — In a data set with high bias, bagging will also carry high bias into its aggregate
c. Complex — Loss of interpretability of a model.
One day, I thought why not cook food on your own. So, I cooked food. But I found I have added extra salt. So, it was not tasty. The next time when I cooked, I put lesser salt. But I found it was too spicy. So, the next time when I cooked I put in adequate spice and salt, but the food got burnt. So, the next time when I cooked, I put adequate spice and salt with food getting cooked at low flame and I was watchful. Finally, I cooked tasty food. At the inception, I was a weak learner. But I keep on learning from my own mistakes and in the end, I became a strong learner.
In boosting, the weak learner (decision tree) with a relatively high bias are built sequentially such that each subsequent weak learner aims to reduce the errors (mistakes) of the previous learner. Each learner learns from its predecessors and updates the residual errors. Hence, the learner that grows next in the sequence will learn from an updated version of the residuals. Each of these weak learners contributes some vital information for prediction, enabling the boosting technique to produce a strong learner by effectively combining these weak learners. The final strong learner brings down both the bias and the variance.
There are two types of boosting:
A. Weight based Boosting
i. A sample dataset is taken to train the model with a weak learner. For example, I have three independent variables X1, X2, and X3, and the dependent variable Y which I have to predict.
ii. We get the following result based on the weak learner prediction. Then we calculate the absolute error from the prediction. Every algorithm like AdaBoost and LogitBoost. Here just for reference purposes, I have taken weight to highlight that higher weight is given to the row, which is having a higher error.
iii. Again a weak learn model is created to predict Y, by giving the weight of miss-classification. So, the model will try to learn from the previous model that it will create lessor error where weight is high and we get a better model each time.
B. Residual based Boosting
i. Step i. is the same as above. We will train our random sampled data with a weak learner.
ii. In this case we will obtain errors in each row from misclassification
iii. Based on the error as my dependent variable, now I am predicting using another weak learner with the same independent variables.
iv. Now, my new boosted model will be weak learner 1 + weak learner 2 with prediction would be the normalized sum of both the model
You can see that it is learning so fast and has reduced the bias(error) significantly.
This is just an example of a residual-based boosting algorithm. In a real scenario, every residual-based algorithm has its own way to reduce bias.
Some examples of residual-based algorithms are
This time I want to learn batting techniques in cricket. But, I know different good batsmen who play cricket near to my place. When I asked the players, they told me one batsman is good at playing leg glance, while one other batsman is good at playing hook and pull and one other batsman is good at playing sweep and drive shot. Rather than focus on learning from a single batsman, I tried to learn from each batsman their specialist shots. Indeed it has helped me to become a really good batsman than all these batsmen at the end.
What is different about this ensemble technique as compared to bagging and boosting?
- Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset).
- Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models).
What is stacking exactly?
Wolpert in 1992, introduced this term stacking to the data science world. “Stacking” is a technique in which the predictions of a collection of models are given as inputs to a second-level learning algorithm. This second-level algorithm is trained to combine the model predictions optimally to form a final set of predictions.”
- Take the training data set and create k fold splits. In our example, let’s take it 5 fold.
2. Based on the K-1 fold training by the first of three algorithms (we can take more or less than 3 algorithms for training) algorithms, we will predict Kth fold training data. 3 algorithms can be SVM, KNN, and Random Forest. The point is unlike boosting and bagging, it is not necessary to use Decision trees only.
3. We will repeat K-1 times step 2. It means we will take another K-1 fold for training and predict the Kth fold using the first algorithm(SVM).
4. Now, since the first model is trained, we will predict the validation dataset C.
5. We will repeat steps 2 to 4 for each of the remaining algorithms to obtain the predictions.
6. Now, we will train the B dataset using the predictions obtained from each of the algorithms, and based on this training we will predict the C dataset.
In this way, we perform stacking. Let’s move towards the last ensemble technique.
The story of blending is the same as stacking. Here also, I will learn from different batsmen to become a better batsman.
What is then the difference between stacking and blending?
It follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions. In other words, unlike stacking, the predictions are made on the holdout set only. The holdout set and the predictions are used to build a model that is run on the test set.
- Take the training data set, validation set(generally 10–30% of training dataset), and test dataset.
- We will perform training using several algorithms (can be decision tree or SVM or Random Forest or any other) on the training dataset.
- Based on the above training, predict the validation dataset and test dataset by all the algorithms
- Based on the prediction of step 3 as input, train the output of the validation dataset, and predict the test dataset.
Is it confusing?
Yes!!! then follow the pictorial explanation:
I am going to train dataset(A) using 3 algorithms (Algo_1(SVM), Algo_2(Random Forest), and Algo_3(KNN). Based on the training, I have got the model parameters.
Now based on the above training, I am going to predict validation dataset(B) and test dataset(C) using the three algorithms. We can see above the prediction result of three algorithms for both datasets B and C.
The next step would be to use these predictions as input. I am going to again train dataset B based on the three algorithms’ prediction. Based on this training, I am going to predict dataset C.
Stacking and Blending are relevant when the prediction made by the different algorithms or the errors of the predictions are uncorrelated to each other. Just simple example was my learning (batting technique). I used different batsmen with different expertise, not the same expertise.
This article is about ensemble machine learning, bagging, boosting, stacking, and blending. Using stories and pictures, complex algorithms like boosting and bagging are explained in a lucid manner.
In this article, you have learned:
- What is ensemble machine learning?
- What are different types of ensemble machine learning — both naive and advanced?
- You have got details about each ensemble machine learning techniques
Hope you have liked this article. Please clap for encouraging me to write more such articles, and please do share this article if you find it is helpful to your friends or colleagues.