This post will serve as a high-level overview of decision trees. It will cover how decision trees train with recursive binary splitting and feature selection with “information gain” and “Gini Index”. I will also be tuning hyperparameters and pruning a decision tree for optimization. The two decision tree algorithms covered in this post are CART (Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3).
Decision trees are very popular for predictive modeling and perform both, classification and regression. Decision trees are highly interpretable and provide a foundation for more complex algorithms, e.g., random forest.
The structure of a decision tree can be thought of as a Directed Acyclic Graph, a sequence of nodes where each edge is directed from earlier to later. This graph flows in one direction and no object can be a child of itself. Take a look at the DAG above, we can see it starts with a root node, the best attributes become interior nodes, i.e., decision nodes. Then, the internal nodes check for a condition and perform a decision, partitioning the sample space in two. The leaf nodes represent a classification, when the record reaches the leaf node, the algorithm will assign the label of the corresponding leaf. This process is referred to as recursive partitioning of the sample space. Terminology when working with decision trees:
Parent Node - a node divided into sub-nodesChild Node - sub-nodes from a parent nodeRoot Node - represents the sample space/population that will be split into two or more sets (sub-nodes)Decision Node - sub-node that splits into more sub-nodes
Leaf Node - nodes without splits (children)Branch - a subsection of a decision treePruning - reducing the size of a decision tree by removing nodes
Decision trees use some cost function in order to choose the best split. We’re trying to find the best attribute/feature that performs the best at classifying the training data. This process is repeated until a leaf node is reached and therefore, is referred to as recursive binary splitting. When performing this procedure all values are lined up and the tree will test different splits and select the one returning the lowest cost, making this a greedy approach.
Something to note, since the algorithm repeatedly partitions the data into smaller subsets, the final subsets (leaf nodes) consist of few or only one data points. This causes the algorithm to have low bias and high variance.
Entropy & Information Gain
A widely used metric with decision trees is entropy. Shannon’s Entropy, named after Claude Shannon provides us with measures of uncertainty. When it comes to data, entropy tells us how messy our data is. A high entropy value indicates less predictive power, think of the entropy of a feature as the amount of information in that feature. Decision trees work to maximize the purity of the classes when making splits, providing more clarity of the classes in the leaf nodes. The entropy is calculated before and after each split. If the entropy increases, another split will be tried or the branch of the tree will stop, i.e., the current tree has the lowest entropy. If the entropy decreases, the split will be kept. The formula for calculating entropy of an entire dataset:
where 𝑛 is the number of groups and (𝑝𝑖) is the probability of belonging to the ith group. Let’s say we have a dataset containing 462 positive (1) labels and 438 negative (0) labels. We can calculate the entropy of the dataset by:
Information gain uses entropy as a measure of impurity. It is the difference in entropy from before to after the split, and will give us a number to how much the uncertainty has decreased. It is also the key criterion used in the ID3 classification tree algorithm. To calculate the information gain:
When performing classification tasks, the Gini index function is used. From Corrado Gini, this function informs us of how “pure” the leaf nodes in the tree are. The gini impurity will always be a value from 0 to 0.5, the higher the value, the more disordered the group is. To calculate the gini impurity:
where (𝑝𝑖) is the probability of belonging to the ith group. The equation above states the gini impurity is 1 minus the sum of the different probabilities in each split.
Pruning Decision Trees
When decision trees train by performing recursive binary splitting, we can also set parameters for stopping the tree. The more complex decision trees are, the more prone they are to overfitting. We can prune the tree by trimming it using the hyperparameters:
- max_depth– determines how deep we want the tree to be
- min_samples_leaf– minimum number of training samples in each leaf node
- max_leaf_nodes– maximum number of leaf nodes
- min_impurity_decrease– threshold to determine if a node will split or become a leaf
There are more parameters that can be changed, for a list and a more detailed explanation, take a look over the documentation.
Decision Trees with Scikit-Learn
Let’s build a decision tree classifier with sklearn. I will be using The Titanic dataset, with the target being the
Survived feature. The dataset I’m loading in has previously been cleaned. For a description on the features in the dataset, see the data dictionary below.
Importing the necessary libraries