Machine Learning # 2 — Correlation Matrix, Feature Selection, Class Imbalance, Decision Trees…

Source: https://www.medicalnewstoday.com/articles/322300

Hello, in the second post of my Machine Learning series

Will use a new classification model
Will examine the relationship between the predictor variables and the target variable.
Will see new methods for data preprocessing
We’ll talk about new methods we can use to measure the performance of our model.

It will also be an article where we will talk about class imbalances, mistakes we can make while evaluating model performance and what can be done to fix them.

We start by importing our core libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(‘winequality-red.csv’)

In this article, we will develop a wine tasting model that tries to predict the quality of wine samples based on their chemical value. I obtained the data set from the source we used in the first article. You can find it here.

We will not elaborate on the methods we mentioned in the first article of the series. First, we start with the Exploratory Data Analysis part. You can get information about this section by reviewing this article.

Numerical EDA

df.info()

Our dataset consists of 1599 rows and 12 columns. We will try to predict the quality property by learning the first 11 properties.

df.describe()

You can find the original of the table in my GitHub repository, whose link I will share at the end of the article, or you can run the code in your local Jupyter-notebook and view the output. Since the terms in the screenshot and the left column do not fit in the same square, I did not include the quality values in the screenshot.

Here we see some summaries of our property values. Besides mean, max, std and min values of each column, we have expressions like 25%, 50% and 75%.

Min and max, the smallest and largest number values in that column,
mean value is the average of column values, std standard deviation value,
25%, 50% and 75% values also show what percentage of the values in the column are equal to or below that value.

When we examine the quality column in the table, we see that the values vary between 3 and 8 (min value is 3, max value is 8). And, a quarter of the data in this column is 5 and less (3, 4, 5), while three quarters are 6 and less (3, 4, 5, 6). Our average value is 5.63.

These values show us that most of the quality values in our data set have average values (such as 5 and 6) and the distribution is unbalanced.

Problem: This imbalance will cause our model to be able to distinguish between medium quality wine, but not low quality or high quality wine. We will try to develop a solution to this problem called Class Imbalanced in the related title.

df.head(10)

Column values of the first 10 rows of our data set

We see that values such as volatile acidity, citric acid are quite close to 0, while free sulfur dioxide and total sulfur dioxide values are generally much higher than 0.

Remember that classifiers like kNN, which we mentioned in the first article, work by calculating distance. In this case, numerically large values will be located at the outer corners of the analytical plane, and our model may make incorrect predictions due to the large values in the distance calculation. We will mention the solution of this situation.

Another situation we will examine is the increase / decrease relationship between our predictor variables. Feature selection is one of the first and most important steps taken when solving any machine learning problem. Each feature in our dataset is represented simply by a column. The effect of each column (attribute) on the target variable may not be the same.

Adding a predictor variable with a weak relationship with the target variable to the model has some negative effects on performance. More predictive variables lead to longer training times, increased computational complexity, and reduced impact of some potentially important features.

This situation creates the need for feature selection.

We have created a correlation matrix using the seaborn library. This matrix shows us the relationship of all properties to each other in a color palette.

In the correlation matrix, values are located between -1 and +1. Values close to -1 are interpreted as negative correlations, values close to +1 are interpreted as positive correlations.

The values of two variables with positive correlation increase or decrease together.

While the value of one of the two variables with negative correlation increases, that of the other decreases.

If the value is close to 0, it indicates that there is no connection between these two variables. Our goal is to find the properties with a correlation value close to 0 with the quality property and eliminate them.

We assign the properties with a correlation value between -0.1 and +0.1 to the variable we created with the name of relevant_features. One of these variables is our quality variable, which normally has a correlation value of 1. We will use the remaining 8 variables as predictors.

to_drop = cor_target[cor_target<0.1]

Residual sugar, free sulfur dioxide and pH variables are also features that we detect low correlation and will not use. We also collected them in the to_drop variable.

When we examine the to_drop variable, two more information is returned to us called Name and dtype. So this variable is not the Python list we know now. However, in order to drop it from our training set, we have to list it. Let’s learn the data type with the type() command.

type(to_drop)

We first convert the data type to frame, which is another pandas data type.

to_drop_frame = to_drop.to_frame()

Finally, we turn these variables into a list and finalize the list of variables that we will drop by adding the variable quality.

After determining our estimator and target variables and assigning them to X and y variables, there is one more operation to do in this section. As df.describe() has shown us, our quality attribute takes the values 3, 4, 5, 6, 7 and 8. For the test set, we would expect it to divide approximately 320 samples into 6 different classes, considering that we will separate about 1 out of 1599 data in the total set. This can reduce the performance of our model.

As we mentioned in the df.describe() section, the examples we have are generally of medium quality.

Therefore, in order to recognize high and low quality wines, we consider the scenario that 3 and 4 point wines are low, 5 and 6 point wines are medium, 7 and 8 point wines are high quality wines. Let’s drop to.

If the data set we have is larger, it is possible to develop more effective algorithmic solutions that have a place in the literature.

As an advanced reading on these solutions: https://medium.com/quantyca/how-to-handle-class-imbalance-problem-9ee3062f2499

A decision tree is a structure used to divide a data set into smaller sets by applying a set of decision rules.

Decision tree is one of the predictive modeling approaches used in statistics, data mining and machine learning. It uses a decision tree (predictive model) to navigate from observations about an item (predictive variables represented in branches) to conclusions about the item’s target value (target variable represented in leaves). Tree models in which the target variable can take a separate set of values are called classification trees; In these tree structures, leaves represent class tags and branches represent combinations of traits that lead to these class tags.

Representation of a decision tree

The top points in decision trees are called roots. Each observation is classified according to the condition at the root.

There are nodes under the stem cells. Each observation is classified using nodes. As the number of nodes increases, the complexity of the model increases.

At the bottom of the decision tree are leaves. Leaves give us the result.

While creating the pipeline, we will use two parameters for grid search, the criterion and max_depth parameters of scikit-learn’s DecisionTreeClassifier. Basically, we will look for the answers to the questions, which information criteria will our model use when branching data, and how many steps will optimize the information at most.

For detailed information on this and all other decision tree parameters; You can examine the scikit-learn document and try this example for different parameters and different values.

Before we start coding our decision tree, we have one more favor to do for our model.

In the introduction part of the article, we mentioned that some of our predictor variables have values fairly close to 0 and some are considerably larger than 0. This causes the variables with large values to affect the model more (negatively). We would like to represent all of our predictive variables in similar scales.

For this, we will use the StandardScaler method of scikit-learn.

Next, we have to code all of these preprocessing processes we’ve described.

In this step, if we briefly talk about the concepts of gini and entropy,

Gini (or Gini Impurity): The probability of incorrectly classifying a randomly selected item in the data set.

Entropy: The entropy of a variable is the average level of “information”, “surprise” or “uncertainty” inherent in the possible outcomes of the variable. In the root part of the decision tree, classification starts with the feature with the least entropy amount.

If we explain the concept of entropy with an example, the entropy of the information whether it is sunny or rainy is lower than the entropy of the information of the movies shown in the cinema. This means that you are more likely to start from the weather when deciding how to spend the day. If it’s rainy, you’d rather sit at home rather than go to the movies, so you don’t have to think about what movies are in the cinema.

For further reading on entropy and information theory concepts: https://en.wikipedia.org/wiki/Entropy_(information_theory)

Using the Grid Search technique we mentioned in the first article, we came to the conclusion that our best criterion is gini and our optimum depth is 4. When we used these criteria, we saw that our model achieved 86% success. So does our model really work that well?

The accuracy of the model alone is not enough, especially for uneven data sets. In order to fully measure the performance of our model, some different metrics are required.

For this, let’s first refer to the concept of Confusion Matrix.

Confusion Matrix

True Positive and True Negative are the fields that contain our correct predictions, False Positive and False Negative are the fields that contain our false predictions. As an example, let’s say our job is to identify spam mails in a set of mails we have.

True Positive: Our model predicted the email as spam, and that’s correct.
True Negative: Our model predicted the email as not spam, and that’s correct.
False Positive: Our model predicted the email as spam and this is wrong.
False Negative: Our model predicted the email as not spam, and this is wrong.

Accuracy

The accuracy of the model is found by dividing the True predictions by the total number. Suppose there are 10 cancer patients and 90 healthy people in a population of 100 people. If our model predicts that this whole group is healthy, it will achieve 90% success. However, since it also predicts those 10 patients as healthy, it becomes a fatal model as well as successful.

Therefore, we need different metrics to measure the performance of the model.

Precision gives the ratio of how many of the values we estimate as Positive are actually Positive.

Precision is very important in situations where the cost of False Positive predictions is high. For example, if your e-mails that need to be sent to the Inbox fall into the Spam Box as a result of the incorrect estimation of your model, then you will not be able to see the important e-mails you need to see and you will be in a loss-making situation. In this case, the high Precision value is an important criterion for evaluating the performance of the model.

Recall, on the other hand, shows how many of the samples we should guess as Positive we guess.

Regarding the mail sample, it may seem harmless that some of the e-mails that we need to capture as spam fall into the Inbox, but having a high Recall score can be of vital importance, especially in data sets on Health and Banking.

F1 Score value shows us the harmonic average of the Precision and Recall values. F1 Score is used in order not to make an incorrect model selection in non-uniform data sets and to have a measurement metric that includes all error costs.

Now, with the best parameters we have obtained, let’s rerun the training / prediction process by assigning our decision tree to a variable named best_tree and examine the model performance based on the performance metrics we learned.

We see that our model cannot recognize any of the poor quality wines (17 samples). So it may be more accurate to have a model that recognizes accuracy a few points less, but also low-quality wines.

One of the reasons why these examples cannot be distinguished may be that they were not trained with enough prefixes. So in the next step, let’s start the training process again by shrinking our test_size.

Let’s assign our new model to best_tree variable with its best parameters and examine our success metrics again.

This time, we see that 75% (precision) of the wines he marks as low quality is indeed low quality, and he correctly predicts 30% (recall) of poor quality wines.

In the previous experiment, we could not accurately predict any of the 17 samples, but this time we see an increase in performance in the low-quality wine class, even though there were 10 samples.

As there were more samples labeled 0 in our training set, we were able to increase the predictive score of the poor quality wines remaining in our test set.

You can also examine the Random Sampling technique to solve imbalance problems like in this dataset (but can be applied to larger datasets to see its effect). There are basically two ways to apply this technique:

Oversampling, by multiplying examples of minority classes
Undersampling, reducing the instances of the majority of classes

You can get more balanced classes.

There is a library in Python language developed for this problem. You can examine this library with examples in the link,

For a detailed reading of this approach, you can review this post.

Finally, we end our article by drawing the shape of our decision tree with the tree method of the scikit-learn library. See you in the next articles.

Source Code

Numerical EDA

Footer