By printing the variable data, the heart disease data set can be looked at and understood by using the heart_disease.names file.
Preprocessing the data
The variable data contains the 4 dataframes of the 4 files. Printing data should result into noticing unexpected values, namely question marks. These values indicate missing values, which are not conveniant to have amidst your data. There are different ways to solve this problem, like replacing the missing values, or handling them in a separate way. The easiest way is to just remove all the rows with the missing values. The handiest approach for this is to concatenate the 4 separate dataframes to 1 big dataframe first and delete all the rows with missing values in them afterwards:
With Counter from the collections library the occurence of each label in the target variables can be retrieved:
The number of each of the labels differ a lot, which is not advantageous for any classification task, for the classifier will be biased towards the most common class (in this case 0). In the heart_disease.names file the definition of these labels are given. For the sake of the distribution, the focus lies on classifiying the difference between 0 and the rest of the higher values. So now the target column can be simplified by changing the labels to boolean values:
Categorical Decision Tree
Construncting a Decision Tree using the ID3 algorithm is a bottom-up process, also known as pruning to improve accuracy, where first the feature containing the most information is sought out. When the most important feature is found, the data set gets splitted along the rows of that feature that have the same value. This process keeps repeating itself until a stopping criterium is achieved. The implementation discussed in this article consists of 4 steps.
Step 1: Split
First the split process of the data must be defined as a function and for this the unique values of each column must be determined, using the unique_values function:
Now this function can be used in the create_split function, which returns a dictionary given a dataframe and a column name. The unique column values are being used as keys and the split dataframes as values. So the dataframe rows are grouped by each of the different unique values from that column:
Step 2: Entropy
Second, the entropy must be computed. Entropy has different defenitions, but for decision trees the information theoretic entropy is applicable. The information theoretic entropy, also known as Shannon entropy, says something about the amount of information contained in a distribution of the data. So the less intricate the distribution, the less bits are needed on average to express the distribution.
This measure of entropy will be used to compare the decisions of the decision tree and establish which contains the most information. After simplifying the data, by changing the labels to boolean values, only 2 classes are left and for a problem with 2 classes the entropy is defined as follows:
To construct the entropy function and implement this equation in Python, the equation is separated into 3 parts:
- calculate the ratio p:
2. Compute all values of the log product and return 0 in the case p = 0:
3. Finally, combine the functions of part 1 and part 2 to implement the entropy equation:
To check if the entropy function works correctly a plot_entropy function is written to plot the entropy. The plot should look like a parabola with the opening downwards:
Step 3: Information Gain
Next, a metric must be computed to assess how good the splits are in the decision tree. In this inplementation the Information Gain is used, which is defined as the entropy of the original distribution Φ(p) minus the entropy of the split distribution Iₘ resulting from the split on m:
This metric keeps track of how much the entropy changes after making a specific split, so it measures the gain in predictability of the data as a result of making a distribution based on a specific variable. When a set of target labels is split on a specific variable, 2 or more lists are created with their own entropy. Combining the resulting entropies from a split into new sets afterwards is defined as:
Implementing this equation in Python to compute Iₘ, using a function called split_entropy looks like:
The get_split_labels function is needed to convert the dictionary, that create_split returns, into a list of lists containing the target labels of each of the splits. The format should be preserved so it can be used directly as input for the split_entropy function:
Now the Information Gain function can be implemented using the entropy, split_entropy and split_labels functions:
Step 4: ID3
All the tools for constructing the ID3 algorithm are provided now. First, to make things more clear, a namedtuple Node is created, that carries the selected split column, the most common label in this node and a dictionary mapping to results of the split:
The steps of the ID3 algorithm can be simply explained in 7 steps:
- Use unique_values function to return the pure label.
- Return the most common label mode() if there are no more possible columns to split.
- Calculate the Information Gain for each possible split in the dataset.
- Pick split with highest Information Gain.
- Create a Node for new/current branch of the tree
- Recursively call ID3 with the sub-dataset to create sub-tree and use it as branches for the tree of the step before
- Return Decision Tree
First the test dataframe is manually programmed and a test target is set, like so:
Now all the 7 steps can be computed to create the ID3-algorithm:
Putting the test dataframe into the ID3 function thus results in the following “tree”:
Now that a tree is built, it is important to realise that the ID3-algorithm does NOT search the entire space of possible trees to find the best possible decision tree. Instead looks at the best decision per node, making locally optimizing choices, which are not always globally optimal.
Classifying
The tree can be now used to classify a given row of the dataframe. Per branch there are 3 choices:
- When there is a Node attached to this branch, look further.
- When a leaf node (last node of a branch) is encountered, return it.
- If the value from the column in the row doesn’t have a corresponding branch in that node, then return the most common label in that node.
The function classify was written for this, which takes a tree and a row which has to be classified:
Validation
Now that everything is complete, the ID3-algorithm can be applied on the heart disease dataset. First, a validation split must be created to create a training and test set. This can easily be done by using train_test_split from the sklearn.model_selection library:
Next, a function is needed to split the categorical from the numerical data and create training and test sets from both datasets:
Finally, a validate function is written that takes a decision tree, a dataframe and the target column name. It should first classify all rows in the dataframe using the decision tree and then return the percentage that was classified correctly:
The Categorical Decision tree is officially done! In the next part of this series I will be discussing the Numerical Decision Tree.