One of the most famous definition by Tom Mitchell states machine learning as “a computer program of performance P is said to learn from a set of tasks T and experience E when the performance P of improves with task T over experience E”. Building and training algorithms that can learn the problem in hand is basically the whole idea of machine learning. They are divided into regression and classification problems. When the output is in a continuous range for eg. Price of a car, amount of rainfall etc. then it is a regression problem. Whereas when the output is categorical say, it is a fraudulent transaction or not then it is called classification problem. In the series of articles, I will be giving intuitions on the different type of algorithms that are used extensively to solve problems. We will be discussing about one of the most used classification algorithm Logistic Regression in this article. Let’s start classifying! ☺
What does a linear regression algorithm do? It tries to get an output that is numerical in nature so that the loss or residual when compared to the actual value is as low as possible. Logistic regression almost works on the principle. But instead of output being any numeric value, we want our output between 0 and 1. To tackle this, let’s change the form for our hypotheses to satisfy the condition 0≤hθ(x)≤1. Sigmoid function is a function that helps to transform a linear function to a value between 0 and 1. Since the value is between 0 and 1 it can be related to the probability value associated with a particular class. Let’s consider a linear function having n variables x1to xn. Let theta be the coefficient or weight associated with the variable in the linear function.
The below graph interprets how a sigmoid curve looks like.
The gradient descent algorithm is the most commonly used approach in linear regression to arrive at optimal weights associated with the independent variables. Gradient descent calculates the loss associated with the predicted value from the actual value. The algorithm works to manipulate the coefficients in such a manner that the loss associated is as low as possible. Graphically, the gradient descent algorithm works in a such a fashion to reach the global minima as close as possible. But this approach is a dead end when it comes to logistics regression. The non linearity of the loss makes gradient descent approach inconvenient. They form non convex function as shown in the left figure below. So to get our loss in convex form, we use log loss function to arrive at suitable coefficients to reduce the loss.
The log loss function of logistic regression is divided into two parts. One for the label y=0 and the other one for y=1.
So why a log function? From the graphs below it is clear that the penalty that is imposed for a wrong assignment is very huge. The value for log loss approaches infinity when the predicted value approaches 0 instead of 1. Similarly the penalty incurred when the predicted value approaches 1 instead of 0 is also huge. Thus log loss function clearly satisfies the purpose and thus works well for a classification algorithm.
The above two log loss functions when y equals 0 and 1 can be combined together to form a single log loss function.
- Logistics regression algorithm is simple and can be easily interpreted.
- They are can be easily scaled to problem with multiple classes.
1. Logistic regression performs poorly when data cannot be linearly separated. Thus decision boundaries will be poorly constructed.
Hope you had a good read. Give a clap to show your support and follow me for more articles ☺