In this blog I am capturing some key points about Decision Tree. When we are preparing for data science interview, it is very obvious to brush up our knowledge. Here you will find those necessary definitions about Decision Tree.
Interpretation and Construction
It resembles an upside-down tree.
A decision tree splits the data into multiple sets. Then, each of these sets is further split into subsets to arrive at a decision.
If a test splits the data into more than two partitions, this is called a multiway decision tree.
Decision trees make it very easy to determine the important attributes. It requires to perform tests on attributes in order to split the data into multiple partitions.
So the decision trees can go back and tell us the factors leading to a given decision. (Difference with SVM: In SVMs, if a person is diagnosed with heart disease, we cannot figure out the reason behind the prediction. However, a decision tree gives us the exact reason.)
The difference between decision tree classification and decision tree regression is that in regression, each leaf represents a linear regression model, as opposed to a class label.
Try to split the nodes such that the resulting nodes are as homogeneous as possible.
Go step-by-step, picking an attribute and splitting the data such that the homogeneity increases after every split. Stop splitting when the resulting leaves are sufficiently homogeneous.
Measure Homogeneity
Methods to measure homogeneity, namely the Gini index, entropy and information gain (for classification), and R-squared (for regression).
Gini Index:
It uses the probability of finding a data point with one label as an indicator for homogeneity — if the data set is completely homogeneous, then the probability of finding a data point with one of the labels is 1 and the probability of finding a data point with the other label is zero.
Choose the attribute that has the higher Gini index.Entropy and Information Gain:
Entropy quantifies the degree of disorder in the data, and like the Gini index, its value also varies from 0 to 1.
If a data set is completely homogeneous, then the entropy of such a data set is 0, i.e. there’s no disorder.
Information Gain measures how much the entropy has decreased between the parent set and the partitions obtained after splitting
Advantages
Predictions made by a decision tree are easily interpretable.
It can seamlessly handle all kinds of data — numeric, categorical, strings, Boolean, and so on.It does not require normalization since it has to only compare the values within an attribute.
Decision trees often give us an idea of the relative importance of the explanatory attributes that are used for prediction.
Disadvantages
Decision trees tend to overfit the data. If allowed to grow with no check on its complexity, a tree will keep splitting till it has correctly classified (or rather, mugged up) all the data points in the training set.
Decision trees tend to be very unstable, which is an implication of overfitting. A few changes in the data can change a tree considerably.
To control overfitting in decision trees: truncation and pruning.
Truncation
Stop the tree while it is still growing so that it may not end up with leaves containing very few data points.
Limit the minimum size of a partition after a split
o Enough samples for split
o Not enough samples, do not splitMinimize change in the measure of homogeneity
Limit the depth of the tree
Set a minimum threshold on the number of samples that appear in a leaf.
Set a limit on the maximum number of leaves present in tree
Pruning
Let the tree grow to any complexity. Then, cut the branches of the tree in a bottom-up fashion, starting from the leaves. It is more common to use pruning strategies to avoid overfitting in practical implementations.
In pruning, we chop off the tree branches; this results in a decrease in tree complexity. It also helps in reducing overfitting.
Data set is divided into three parts: the training set, the validation set and the test data. The validation set is used to tune hyperparameters, i.e. after deciding on a set of hyperparameters for a tree, you check the accuracy of the tree on the validation set.
Check the performance of a pruned tree on a validation set. If the accuracy of the pruned tree is higher than the accuracy of the original tree (on the validation set), then you keep that branch chopped. Remember that the validation set is the third part of the data set, the first and second being the training and test set.
There are various hyperparameters in the rpart() library that let you truncate the tree, namely minsplit, minbucket, max_depth, etc.