By Zachary Galante — Senior Data Science Student at Bryant University
In Machine Learning, a very popular algorithm is the Decision Tree Classifier. In this article, the Banknote dataset will be used to illustrate the capabilities of this model.
Decision Tree
A decision tree is a basic machine learning algorithm that can be used for classification problems. From a high level, a decision tree starts with a basic statement at the top of the tree, and then based on if that statement is True or False, it will then move down a different path to the next condition. This will then continue throughout the duration of the model.
In the image below, a basic decision tree is shown. This particular tree is showing the decision making process of if a offer is going to be accepted by a particular candidate. In the first statement, it shows that the offer is between $50,000 — $80,000 which will decide the path for the rest of the tree. If the candidate declines the offer based off of the base salary, then the tree ends and the decision has been made, but if the salary is acceptable, then the tree continues. This basic structure can then become much more complex with different types of classification problems.
Now, using the basic understanding of how a decision tree works it can now be implemented in Python. The Bank Note dataset from UC Irvine (Linked in the references) is going to be used to make predictions if an image of a bank note is authentic or fraudulent.
Data Dictionary / Description:
Shown below, the dataset is made up of 1,372 records, each containing 5 features.
Features:
In this dataset, the features are created based off the numerical values from the 400 x 400 pixel values of the images.
Variance: This is a statistical measure of how much the data varies around the mean.
Skewness: This refers to the visual distribution of the data. If the data is shown as being closer to either side of the axis, then the data may be described as being ‘skewed to the left or right’. This is also shown further by the image below.
Kurtosis: Another measure of the distribution of data. This measure refers to the tails of the data and how much they differ from the tails of a Normal Distribution.
Entropy: A measure of randomness for the model.
Class: The class label of the data, 0 for an authentic bank note, 1 for a fraudulent bank note.
To gain a deeper understanding of the target variable, ‘Class’ the following plot has been created to visualize the distribution of values.
Now with a basic understanding of the dataset, the model can now be created. The target variable, ‘class’ will be assigned to the variable ‘y’. While the other 4 features will be assigned to the variable ‘x’.
The data is then split into testing and training sets, with the default 75% training, 25% testing split below.
Now that the data has been split, it can now by used by the model for both training and testing. In the first line of code of the image below, the DecisionTreeClassifier is being assigned to the variable ‘tree’. This is then fit to the training data and then scored on the testing data. This results in a 97% accuracy score, indicating that a Decision Tree was the correct model to run on the data.
Conclusions
As this was a basic dataset to predict a binary target variable, it showed great results with a 97% accuracy score. Now having an understanding of how decision trees work, further models such as a Random Forest can be run as well, this will be covered in the next article.
References
https://archive.ics.uci.edu/ml/datasets/banknote+authentication