Simple linear regression is fittingly simple. It is the first algorithm one comes across while venturing into the machine learning territory. However, its genesis lies in statistics. Practitioners later smuggled its applications to machine learning and several other spheres like business and economics as well. Anyone who has taken a first-year undergraduate course in probability and statistics can do simple linear regression. All it entails is finding the equation of the best fit line through a bunch of data points. To do so, you follow a standard protocol, calculate the differences between the actual target values and the predicted values, square them, and then minimize the sum of those squared differences. On the surface, there’s no transparent link between regression and probability. It’s more to do with calculus. But, there’s a far more stirring side to regression analysis concealed by the gratification and ease of importing python libraries.

**Zooming in on Regression**

Let’s ponder a simple regression problem on an imaginary dataset where X and Y hold their customary identities-explanatory and target variables. The holy grail with regression, in a nutshell, is to disinter a line adept at approximating target variables(y values) with minimal error. But, hold back. Instead of hounding for the line, think of all x values plotted on the x-axis. Consider parallel lines to the y axis passing through each x. Draw it on paper if it helps, something like shown below.

What do the grey lines represent? Well, if you account for turbulent factors present in the real world, then Y could be any value for a given X. For instance, despite studying for the same number of hours, you may score differently on two separate monthly tests. All lines thus designate the range of Y, all real numbers, for each X.

If I now challenge you to estimate the target variable for a given x, how would you proceed? The answer will unveil the probabilistic panorama of regression. Would it not help if I provided you with a conditional probability distribution of Y given X-P(Y|X)? Of course, it would, but there are no means to extract an accurate distribution function. So, we make an assumption, the first of many. Assume the probability of Y given X, P(Y|X), follows a normal distribution. Why normal? Depending on the prior knowledge of the dataset you’re working on, you are free to choose any appropriate distribution. However, for reasons that’ll soon be clear, we’ll resort to normal distribution.