
Boosting is an ensemble method that combines several weak learners into a strong learner sequentially. In boosting methods, we train the predictors sequentially, each trying to correct its predecessor.
Gradient Boosting is the grouping of Gradient descent and Boosting. In gradient boosting, each new model minimizes the loss function from its predecessor using the Gradient Descent Method. This procedure continues until a more optimal estimate of the target variable has been achieved
Unlike other ensemble techniques, the idea in gradient boosting is that they build a series of trees where every other tree tries to correct the mistakes of its predecessor tree.
For understanding gradient boosting, try thinking about a golfer whacking a golf ball towards the hole, covering a certain ground distance on every shot. He repeatedly hits the ball, working his way towards the hole by reassessing the direction and magnitude after each stage.
After each stroke, the golfer determines the appropriate nudge by computing the difference between actual hole position and the approximation made by him. This is known as residual vector. This golfer case is similar to what we perform in gradient boosting, trying to get as close to target variable as possible.
The objective of the Gradient Boosting algorithm is to minimize the loss function i.e. the difference between the actual class and the predicted class. Classification algorithms frequently use logarithmic loss function whereas regression algorithms use squared errors. They are many standard loss functions supported by gradient boosting classifiers on the condition that the loss function should be differentiable.
Typically, Decision trees are used for weak learners. These classify our data poorly which means they have a higher error rate. They are individually insufficient to make predictions.
As more learners are added into the model, the output of trees can be added together to minimize the loss in the prediction. This process employs a similar procedure to gradient descent on the calculated loss, thereby reducing the loss iteratively.
STEPS TO GRADIENT BOOSTING CLASSIFICATION
STEP 1: Fit a simple linear regression or a decision tree on data [𝒙 = 𝒊𝒏𝒑𝒖𝒕, 𝒚 = 𝒐𝒖𝒕𝒑𝒖𝒕]
STEP 2 : Calculate error residuals by subtracting predicted target value from actual target value. [𝒆𝟏 = 𝒚𝒕𝒓𝒖𝒆 −𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏]
STEP 3 : Fit a new model on the error residuals as the target variables keeping the input variables same.[𝒆𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏]
STEP 4 : Add the predicted residuals to previous predictions [𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟐 = 𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏 + 𝒆𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟏 ]
STEP 5 : Fit the next model on the remaining residuals. [𝒆𝟐 = 𝒚𝒕𝒓𝒖𝒆 − 𝒚𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅𝟐]
Repeat steps 2 to 5 until the model starts overfitting or there is no change in the residuals sum
Let us try to understand gradient boosting classification by considering a sample dateset. It has several input features 𝑥1, 𝑥2, 𝑥3, and tries to predict the target variable 𝑦, which is a binary output.
Step 1: Make initial guess using log of the odds of target variable.
To do classification, we apply softmax transformation.
Step 2: Calculate error residuals or pseudo residuals by subtracting prediction from the observed values. Hence the table looks like following :
Step 3: Compute Classification tree.
This is an example of classification tree with just two leaves. However, Gradient Boosting often has more than 5 leaves and many leaves can have multiple values. Therefore, Gradient Boosting uses transformation for Classification. Consider the following tree:
Therefore, the value of second leaf is given by the following transformation
Step 4: Make the prediction.
Learning rate defines the contribution of the new tree. Now new log odds prediction can be converted to probability using softmax function.
As you can see, the probability has diminished from previous log-odds ratio.
Step 5: Repeat steps until the model starts over-fitting or there is no change in the residuals sum.
One of the main features of gradient boosting is that it offers tuning of several parameters. However, it can be a challenging task to find an optimal combination of parameters and is often time- consuming. The most common hyper parameters are as follows:
It controls the contribution of trees and how quickly the algorithm converges using gradient descent. Smaller values reduce the chance of overfitting but the algorithm takes longer times to converge. Larger values make the algorithm fast but may never reach an optimal fit.
The objective here is to find the optimal number of trees using cross validation so as to minimize the loss function. Early stopping can be employed to find the optimal number of trees.
The number of splits in the trees as known as the depth of the trees (d). Often algorithm works well d=1, however, it can be greater than 1 but always less than 10.
It specifies the fraction of training instances that are used for the training of each tree. If subsample = 0.20, then 20% of training instances selected randomly are used for training.
Gradient Boosting can support various loss functions and provides a number of hyperparameters tuning options which it makes it very flexible
It works great with numerical as well as categorical features as it is.
No data imputation is required
1. Overfitting
It minimizes all errors, hence prone to overfitting. One must use cross- validation to neutralize
2.Computationally Expensive
This technique often requires many trees; hence it can be time and memory exhaustive
Gradient Boosting is one of the most powerful ensemble algorithms that is most appropriate for both regression and classification tasks. However, they are prone to overfitting but various methods discussed above can be employed for dealing with overfitting.
They are more computationally demanding than other machine learning algorithms, but a successful data scientist or a machine learning engineer essentially should know the working of a gradient boosting algorithm.