This post will serve as a high-level overview of decision trees. It will cover how decision trees train with recursive binary splitting and feature selection with **“****information gain****”** and **“****Gini Index****”**. I will also be tuning hyperparameters and pruning a decision tree for optimization. The two decision tree algorithms covered in this post are **CART (Classification and Regression Trees) **and **ID3** **(Iterative Dichotomiser 3).**

Decision trees are very popular for predictive modeling and perform both, classification and regression. Decision trees are highly interpretable and provide a foundation for more complex algorithms, e.g., random forest.

The structure of a decision tree can be thought of as a **Directed Acyclic Graph**,** **a sequence of nodes where each edge is directed from earlier to later. This graph flows in one direction and no object can be a child of itself. Take a look at the DAG above, we can see it starts with a root node, the best attributes become interior nodes, i.e., decision nodes. Then, the internal nodes check for a condition and perform a decision, partitioning the sample space in two. The leaf nodes represent a classification, when the record reaches the leaf node, the algorithm will assign the label of the corresponding leaf. This process is referred to as recursive **partitioning of the sample space**. Terminology when working with decision trees:

Parent Node- a node divided into sub-nodesChild Node- sub-nodes from a parent nodeRoot Node- represents the sample space/population that will be split into two or more sets (sub-nodes)Decision Node- sub-node that splits into more sub-nodes

Leaf Node- nodes without splits (children)Branch- a subsection of a decision treePruning- reducing the size of a decision tree by removing nodes

## Splitting Criteria

Decision trees use some cost function in order to choose the best split. We’re trying to find the best attribute/feature that performs the best at classifying the training data. This process is repeated until a leaf node is reached and therefore, is referred to as **recursive binary splitting**. When performing this procedure all values are lined up and the tree will test different splits and select the one returning the lowest cost, making this a greedy approach.

Something to note, since the algorithm repeatedly partitions the data into smaller subsets, the final subsets (leaf nodes) consist of few or only one data points. This causes the algorithm to have **low bias and high variance**.

## Entropy & Information Gain

A widely used metric with decision trees is entropy. Shannon’s Entropy, named after *Claude Shannon *provides us with measures of uncertainty. When it comes to data, entropy tells us how messy our data is. A high entropy value indicates less predictive power, think of the entropy of a feature as the amount of information in that feature. Decision trees work to maximize the purity of the classes when making splits, providing more clarity of the classes in the leaf nodes. The entropy is calculated before and after each split. If the entropy increases, another split will be tried or the branch of the tree will stop, i.e., the current tree has the lowest entropy. If the entropy decreases, the split will be kept. The formula for calculating entropy of an entire dataset: