It might sound laborious ..but it’s not a big deal. As the name implies it is a function but what is the term “cost” ? Cost simply means error or lost. So lets get into the more deeper side of cost function.
For the sake of understanding better I’m going to explain cost function in terms of simple linear regression. For those who don’t have a basic understanding of simple linear regression check out previous post https://arjun-s.medium.com/simple-linear-regression-ground-level-understanding-e278ebf028d3.
Before starting let me give you a quick recap on hypothesis.
The hypothesis of a simple linear regression model is given below.
Don’t worry, it’s just a straight line equation you studied in school algebra classes. But then we studied it as y=m x+ c.
hθ(x) is the hypothesis function, also denoted as h(x) sometimes.
θ0 and θ1 are the parameters of the linear regression that need to be learnt. We need to find θ0 and θ1 for hθ(x) with minimum error/loss (cost function — I will get into it in a minute) so that the straight line(hypothesis) best fits our data set.
What’s best fit means?
Choose the parameters of hypothesis θ0 and θ1 such that it is close to y for the training examples (x, y).Guess you didn’t get it. Let me explain ,take a look below.
Let me start with a sample data set.
We are given a bunch of training examples and we have to fit the best possible line into it. Wait a second! Why are we doing so? Why needed a line? Because for a new value of “x” we can find the corresponding “y” or result with the help of the line/hypothesis. And that’s why we are training our model for. From the datasets given above(red points) we can see that it tends to follow a linear fashion. And we can use a simple linear regression model to plot the approximation line(hypothesis).
We have to find the best possible parameters θ0 and θ1 for the proper fixation of the hypothesis.
In the graph above ,grey line represents the hypothesis(grey line).Ok we got the line ,but how exactly?
FINDING THE BEST FIT PARAMETER VALUES FOR θ0 and θ1(just a hint ,will discuss in later articles).
For best fit parameter values ,the error or cost function should be minimum. Look at the yellow dotted line in the below graph connecting the training set and the approximation line. These denote the error in the predicted value vs real value. And we need this error to be minimal, that is we need our predictions to be more and more accurate. Mathematically speaking, the error / distance between the points in our dataset and the line should be minimal.
Lets get the equation of cost function initially.
DERIVING EQUATION OF COST FUNCTION
We have to calculate all error for all these points to get the overall error. Error for a single point can be calculated as f(x)-y.
Now you might think ,OK ,That’s easy to get the overall error just sum all of the individual errors .But its not just the case here .You need to square the individual errors before summing it. This has several intentions like,
- ) The squaring is necessary to remove any negative signs. If we sum up some positive and some negative errors, we may get 0.We don’t want that to happen.
- ) With squaring the errors, we get a much higher value for points that are far away from the approximation line. Therefore, if our approximation line misses some points by a far distance, the resulting error will be quite large. This ensures the accuracy .
This sum of the squares of all errors are further divided by 2m (m denotes the total number of training examples). But why?
Dividing by m ensures that the cost function doesn’t depend on the number of elements in the training set. This allows a better comparison across models. 1/2 is a constant that helps cancel 2 in derivative of the function when doing calculations for gradient descent (will discuss in upcoming article).
And the final equation will look like this,
Where h(x) is the predicted value and y is the real value.
=> m denotes the total number of training examples
This cost function is also called squared error function. It is the most commonly used cost function for linear regression as it is simple and performs well.
As I told you earlier our aim is minimize this function . This can be done using various methods. And one of the most common used algorithm for finding the minima of cost function is the gradient descent algorithm (will be stressing this method more in the coming articles).
This will be our first algorithm in machine learning to be learnt.
And this is getting even more excited now.