Let’s understand with an example:
One of the most popular nursery rhyme where rain decides whether Johny/Arthur would play outside is our sample today. The only change is that not just rain but any bad weather impacts the child’s play and we would use a DT to predict his presence outside.
The data looks like this:
‘Temperature’, ‘WindSpeed’, ‘Outlook’, and ‘Humidity’ are the features, and ‘Play’ is the target variable. Only categorical data and no missing values mean we can use ID3.
Let’s go through the splitting criteria before jumping onto the algorithm itself. For simplicity, we will discuss each criterion only for the binary classification case.
Entropy: This is used to calculate the heterogeneity of a sample and its value lies between 0 and 1. If the sample is homogeneous the entropy is 0 and if the sample has the same proportion of all the classes it has an entropy of 1.
S = -(p * log₂p + q * log₂q)
where p and q are respective proportion of the 2 classes in the sample.
This can also be written as:
S = -(p * log₂p + (1-p)* log₂(1-p))
Information Gain: It is the difference in entropy of a node and the average entropy for all the values of a child node. The feature which gives maximum Information Gain is chosen for the split.
Let’s understand this with our example:
The entropy of root node: (9 — Yes and 5 — No)
S = -(9/14)*log(9/14) — (5/14) * log(5/14) = 0.94
There are 4 possible ways to split the root node. (‘Temperature’, ‘WindSpeed’, ‘Outlook’, and ‘Humidity’). Thus we compute the weighted average entropy of the child node if we choose any one of the above:
I(Temperature) = Hot*E(Temperature=Hot) + Mild*E(Temperature=Mild) + Cool*E(Temperature=Cool)
where Hot, Mild, and Cool represent the proportion of the 3 values in the data.
I(Temperature) = (4/14)*1 + (6/14)*0.918 + (4/14)*0.811 = 0.911
Here Entropy for each value is computed by filtering the sample using the value of the feature and then using the formula for entropy. For example, E(Temperature=Hot) is computed by filtering the original sample where the Temperature is Hot (in this case we have an equal number of Yes and No which means Entropy equals 1).
We compute the Information Gain of splitting on Temperature by subtracting the Average Entropy for Temperature from the root nodes Entropy.
G(Temperature) = S — I(Temperature) = 0.94–0.911 = 0.029
Likewise, we compute the Gain from all the four features and choose the 1 with maximum Gain.
G(WindSpeed) = S — I(WindSpeed) = 0.94–0.892 =0.048
G(Outlook) = S — I(Outlook) = 0.94–0.693 =0.247
G(Humidity) = S — I(Humidity) = 0.94–0.788 =0.152
Outlook has the maximum Information Gain, thus we would split the root node on Outlook and each for the child nodes represents the sample filtered by one of the values of Outlook i.e. Sunny, Overcast, and Rainy.
Now we would repeat the same process by treating the new nodes formed as root nodes with filtered samples and compute entropies for each and testing the splits by computing Average Entropy for each further split and subtracting that from the current node Entropy to get the Information Gain. Please note that ID3 doesn’t allow using already used feature for splitting child nodes. Thus each feature can only be used once in a tree for the split.
Here is the final tree formed by all the splits:
Simple implementation with python code can be found here
I tried my best to explain the ID3 working but I know you might have questions. Please let me know in the comments and I would be happy to take them all.
Thanks for reading!