Entropy and Information Gain
What is Entropy and Information Gain metrics? These metrics are two separate metrics that we use to build our decision tree but differ from the Gini Index.
As a starter, Entropy is defined as the measurement of impurity or uncertainty within a dataset. This means that entropy measures the number of class observations mixing within the data. Let me show you by using an image sample below.
When the group is impure (class mixing), the entropy would be close to one, but when the group is pure (only one class within data), the entropy would close to zero.
Entropy could be expressed in the following equation.
Where S is the dataset and p(c) is the probability of each class.
Well, then what is Information Gain? This metric is related to entropy because, by definition Information Gain is a difference of entropy before and after the split by a feature. In other words, Information Gain measure the impurity reduces after splitting.
Information Gain could be expressed in the following equation.
A is the feature used for splitting, T is the whole dataset subset from splitting in feature A, p(t) is the probability of the class in the data subset T, and H(t) is the entropy of the subset.
Information Gain measures the impurity reduces after splitting; it means what we want is the highest Information Gain score because the highest information gain means that the splitting resulted in a more homogenous result.
So, with these metrics, how we calculate which feature gives us the best splitter? There are several steps we need to do, and I would outline each step in detail below.
First, we need to compute the entropy for the data set. Let’s use our previous example dataset (16 green and 13 blue) to calculate our entropy.
After we acquired our dataset entropy, the next step is to measure the Information gain after splitting the feature. In our example above, we have X1 = 2 and X1 = 3 as splitting features. Let’s try to measure Information gain by calculate the entropy subset X1 = 2.
We have acquired both subset entropy, but we still need to sum both of our subset entropy with the subset probability if you remember the previous Information Gain equation. If we put all the number into the Information Gain equation, the Information Gain we have for X1 = 2 is:
With X1 = 2 feature as a splitter, we reduce the impurity by 0.208. Next, we also calculate the Information gain for the X1 = 3 feature. I would show the Information Gain result right now, and what we get for X1 = 3 is 0.168. Because X1 = 2 has the highest information gain, it means we choose the X1 = 2 as our best splitter.
The steps were done until no more separation are possible, or some criteria stopped the calculation we set beforehand (Similar to Gini Index).
Anyway, both the Gini Index and the Entropy and Information gain metrics are the metrics to use in the algorithm to create a decision tree. Both algorithms use a greedy function to find the best features, which means it could take a lot of time, the bigger the dataset.