Logistic Regression is the most basic and widely used classification algorithm in machine learning. It is used in many real-world scenarios like spam detected, cancer detection, IRIS dataset, etc. Mostly it is used in binary classification problems. But it can also be used in multiclass classification.
Logistic Regression predicts the probability that the given data point belongs to a certain class or not. In this article, I will be using the famous heart disease dataset from Kaggle:
In this dataset, the main goal is to predict whether the given person has heart disease or not. In this case, logistic regression will predict the probability of whether the given person has heart disease or not.
For understanding logistic regression, first, you need to understand linear regression, as it forms the base for logistic regression.
As you probably know earlier, linear Regression is a linear model that assumes that a relationship exists between the given input variables(independent variables) and the output variable(dependent variable).
It suggests that the output value can be predicted as a linear combination of input variables. It basically tries to fit a straight line to the data:
If the variables are x1, x2, x3, and x4 then the linear regression suggests that the y value can be predicted as:
y = w0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
The main goal of the linear regression model is to predict the optimal values of w0, w1, w2, w3 and w4
such that the cost function value is as low as possible.
What is the cost function? Cost Function quantifies the error between predicted values and expected values. For linear regression, the cost function used is the mean squared error.
The formula for mean squared error is given by:
where
m
stands for the number of test samples;
h0(x)
represents the predicted value;
y
represents the actual value.
Now let’s move onto logistic regression.
What logistic regression does, is the same as the linear regression, but after finding the optimal line it fits the output value to an activation function.
What is an activation function? The activation function is a non-linear transformation that we do over the input value.
Why do we need an activation function? Linear regression can only fit the variables to a linear relationship output. That is why an activation function is used to create more complex, nonlinear models.
Let’s look at an example of a dataset — x-axis containing the values of the number of hours the student has studied and y axis whether the student has passed or failed.
Do you think that a linear model will fit this data better? Nahhh
What is the activation function used in logistic regression? Sigmoid Function
A sigmoid function is a bounded, differentiable, S-shaped curve ranging from 0 to 1. The mathematical formula for the sigmoid function is given by:
This is how a sigmoid function looks like:
Now you may wonder, this is not anyway related to linear regression. Why did I give an introduction to that? See that z in the formula of the sigmoid function, it is actually the predicted y value we got in linear regression using this formula:
y = w0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
So the sigmoid function with linear regression basically goes like this:
So does this fit the data better, yes it does.
So what does the y value in this represent? It represents the probability that the student has passed the exam i.e. whether the data point belongs to a certain class.
How is the classification made after this? Using the sigmoid function, we will get the probability values of the dataset belonging to a certain class. Then what? We have to set a decision boundary line, below which it will be considered as class 0 and above which will be considered as class 1.
In the pass/ fail case, most of the time in logistic regression, the decision boundary will be set for p=0.5.
If p>0.5
, then it will be predicted as Pass.
If p<0.5
, then it will be predicted as Fail.
Cost Function quantifies the error between predicted values and expected values. For linear regression, the cost function used is the mean squared error.
Mean Squared Error cannot be used because in this case if we use gradient descent, then there will be many local minimums. The algorithm may get stuck at a local minimum rather than at the global minimum.
The cost function used for logistic regression is CROSS ENTROPY
. The formula is given by:
The pi represents the actual value of y and qi represents the predicted probability value.
So pi
will be 1 for CLASS-1(PASS) and pi
will be 0 for CLASS-0(FAIL).
And gradient descent is used to reduce this cost function.
I am not going to be diving more into cross-entropy loss and gradient descent into this article. I will be discussing them in a separate blog post. Subscribe to get notified of my upcoming blog posts.
Just giving a note on what gradient descent does:
It tries to reduce the cost function and find the optimal local minimum by iteratively going through different values of parameters. In this case - w0, w1, w2, w3 and w4.
I will be using the infamous heart disease dataset from Kaggle. First, we need to import all the required libraries — pandas
, numpy
, matplotlib
and seaborn
.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv("heartdisease.csv")
df.head()
See that the dataset contains many columns like age, sex, cp, trestbps etc.. and the goal is to predict the target class.
sns.pairplot(df)
Then split the dataset into training and test set using the train_test_split()
function.
y = df['target']
df.drop("target",axis=1,inplace=True)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df,y,test_size = 0.2)
For using logistic regression, you need to import the LogisticRegression
function from sklearn.linear_model
library.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter = 1000)
lr.fit(X_train,y_train)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)pred = lr.predict(X_test)
1. Accuracy Value
For calculating the accuracy of the score, use the model_name.score()
function.
print(lr.score(X_test,y_test))0.8688524590163934
See that the model has 86.8% accuracy which is pretty high.
If you want to improve this accuracy, logistic regression contains many hyperparameters that you can tune to improve it.
Some of the most common hyperparameters used — max_iter
, n_jobs
, multi_class
.
2. Confusion Matrix
Confusion Matrix is also a common technique used to evaluate a classification model. For drawing a confusion matrix, use the confusion_matrix()
function from sklearn.metrics
library.
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, pred)
print(confusion_matrix)[[25 6]
[ 2 28]]
Ok, so what does this 25,6,2 and 28 represent? Check out my article on How to get insights from confusion matrix
to understand this.
3. Precision and Recall
Recall: Gives information about how much we have predicted correctly for a real positive class.
Precision: Gives information about how much positive class we have predicted as positive are actually positive.
from sklearn.metrics import precision_score, recall_score
print("Precision score:",precision_score(y_test, pred))
print("Recall score:",recall_score(y_test, pred))Precision score: 0.8235294117647058
Recall score: 0.9333333333333333
4. Roc_Auc_Score
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier across different threshold values.
It helps in understanding the trade-off between true positive rate
and false positive rate
across different threshold values.
For plotting roc curve
in python, first, you need to import the roc_curve
module from sklearn.metrics
library.
from sklearn.metrics import roc_curve,roc_auc_score
y_pred_proba = lr.predict_proba(X_test)[::,1]
fpr, tpr, _ = roc_curve(y_test, pred)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
Let’s conclude the post with the advantages and disadvantages of Logistic Regression.
- Does not need high computation power
- Simple and easy to understand
- Efficient and straightforward
- Vulnerable to overfitting
- Failure of convergence
Thanks for reading through the complete article. Subscribe to get notified of future content like this.