Step by Step Discussion and Workout with Examples, Implementation manually and in R
A linear relationship between two variables is very common. So, a lot of mathematical and statistical models have been developed to use this phenomenon and extract more information about the data. This article will explain the very popular methods in statistics Simple Linear Regression (SLR).
This Article Covers:
Development of a Simple Linear Regression model
Assessment of how good the model fits
Hypothesis test using ANOVA table
That’s a lot of material to learn in one day if you are reading this to learn. All the topics will be covered with a working example. Please work on the example by yourself to understand it well.
Developing the SLR model should not be too hard. It’s pretty straight forward. Simply use the formulas and find your model or use the software. Both are straightforward.
The assessment and the hypothesis testing part may be confusing if you are totally new to it. You may have to go over it a few times slowly. I will try to be precise and to the point.
Simple Linear Regression(SLR)
When linear relation is observed between two quantitative variables, Simple Linear Regression can be used to take explanations and assessments of that data further. Here is an example of a linear relationship between two variables:
The dots in this graph show a positive upward trend. That means if the hours of study increase, exam scores also increase. In other words, there is a positive correlation between the hours of study and the exam scores. From a graph like that, the strength and direction of the correlation of two variables can be assumed. But it is not possible to quantify the correlation and how much the exam score changes with each additional hour of study. If you can quantify that, it will be possible to forecast the exam scores, if you know the hours of study. That will be very useful, right?
Simple Linear Regression(SLR) does just that. It uses this old school formula of the straight line that we all learned in school. Here is the formula:
y = c + mx
y is the dependent variable,
x is the independent variable,
m is the slope and
c is the intercept
In the graph above, the exam Score is the ‘y’ and the Hours of Study is the ‘x’. Exam score depends on the hours of study. So, Exam Score is the dependent variable, and Hours of Study is the independent variable.
Slope and intercept are to be determined using the Simple Linear Regression.
Linear regression is all about fitting the best fit line through the points and find out the intercept and slope. If you can do that you will be able to quantify the exam score if you have the hours of study data available. Now, how accurate that estimation of exam scores will depend on some more information. We will get there slowly.
In statistics, beta0 and beta1 is the term commonly used instead of c and m. So, the equation above looks like this:
The red dotted line in the graph above should be as close as possible to the dots. The most common way of doing that is the least square regression method.
The red dotted line in the graph above is called the Least Squares Regression line. The line should be as close as possible to the dots.
Here y_hat is the estimated or predicted value of the dependent variable(exam scores in the example above).
Remember, predicted values can be different from the original values of the dependent variables. In the graph above, the original data points are scattered. But the predicted or expected values from the equation above will be right on the red dotted line. So, there will be a difference between the original y and the predicted values y_hat.
The beta0 and beta1 can be calculated using the least squared regression formulas as follows:
y_bar is the sample mean of the ‘y’ variable.
x_bar is the sample mean of the ‘x’ variable.
Sx is the sample standard deviation of the ‘x’ variable
Sy is the sample standard deviation of the ‘y’ variable
Example of Developing a Linear Regression Model
I hope the discussion above was clear. If not, that’s ok. Now, we will work on an example that will make everything clear.
Here is the dataset to be used for this example:
This dataset contains arm lengths and leg lengths of 30 people. The scatter plot looks like this:
Please feel free to download this dataset and follow along:
There is a linear trend here. Let’s see if we can develop a linear regression equation using data that may reasonably predict the leg length using the arm length.
Arm length is the x-variable
Leg length is the y-variable
Let’s have a look at the formulas above. If we want to find the calculated values of y based on the arm length, we need to calculate the beta0 and beta1.
Required parameters to calculate the beta1: correlation coefficient, the standard deviation of arm lengths, and the standard deviation of the leg lengths.
Required parameters to calculate the beta0: mean of leg lengths, beta1, and the mean of arm lengths.
All the parameters can be calculated very easily using the dataset. I used R to calculate them. You can use any other language, you are comfortable with.
First, read the dataset into RStudio:
al = read.csv('arm_leg.csv')
I already showed the whole dataset before. It has two columns: ‘arm’ and ‘leg’ which represent the length of the arms and the length of the legs of people respectively.
For the convenience of calculation, I will save the length of the arms and the length of the legs in separate variables:
arm = al$arm
leg = al$leg
Here is how to find the mean and the standard deviation of the ‘arm’ and ‘leg’ columns:
arm_bar = mean(arm)
leg_bar = mean(leg)s_arm = sd(arm)
s_leg = sd(leg)
R also has a ‘cor’ function to calculate the correlation between two columns:
r = cor(arm, leg)
Now, we have all the information we need to calculate beta0 and beta1. Let’s use the formulas for beta0 and beta1 described before:
beta1 = r*s_leg/s_arm
beta0 = leg_bar - beta1*arm_bar
The beta1 and beta0 are 0.9721 and 1.9877 respectively.
I wanted to explain the process of working on a linear regression problem from the scratch.
Otherwise, R has the ‘lm’ function to where you can simply pass the two variables and it outputs the slope(beta1) and intercepts(beta0).
m = lm(leg~arm)
lm(formula = leg ~ arm)Coefficients:
Plugging in the values of slope and intercept, the linear regression equation for this dataset is:
y = 1.9877 + 0.9721x
If you know a person’s arm length, you can now estimate the length of his or her legs using this equation. For example, if the length of the arms of a person is 40.1, the length of that person’s leg is estimated to be:
y = 1.9877 + 0.9721*40.1
It is 40.99. This way, you can get the length of legs of other people with different arm lengths as well.
But remember this is just an estimate or a calculated value of the length of that person’s legs.
One caution though. When you use the arm length to calculate the leg lengths, remember not to extrapolate. That means be aware of the range of the data you used in the model. For example, in this model, we used the arm lengths between 31 to 44.1 cm. Do not calculate the leg lengths for an arm’s length of 20 cm. That may not give you a correct estimation.
Interpreting the slope and estimate in plain language:
The slope of 0.9721 represents that if the length of arms changes by one unit, the length of legs will increase by 0.9721 unit on average. Please focus on the word ‘average’.
Every person who has an arm length of 40.1, may not have a leg length of 40.99. It could be a little different. But our model suggests that on average, it is 40.99. As you can see not all the dots are on the red line. The red dotted line is nothing but the line of all the averages.
The intercept 1.9877 means, if the length of the arms is zero, still the length of legs will be 1.9877 on average. The length of arms is zero is not possible. So, in this case, it is only theoretical. But in other cases, it is possible. For example, think of a linear relationship between the hours of study vs the exam score. There might be a linear relationship such that exam score increases with the hours of study. But even if a student did not study at all, s/he still may obtain some score.
How good this estimate is?
This is a good question, right? We can estimate. But how close this estimate is to the real length of that person’s leg.
To explain that we need to see the regression line first.
Using the ‘abline’ function a regression line can be drawn in R:
plot(arm, leg, main="Arm Length vs Leg Length",
xlab="Length of Arms", ylab = "Length of Legs")
abline(m, lty = 8, col="red")
Look at this picture. The original points (black dots) are scattered around. The estimated points will fall straight on the red dotted line. In that case, a lot of times the estimated length of legs will be different than the real length of legs for this dataset.
So, it is important to check how well the regression line fits the data.
To find that out we need to really understand y-variables. For any given data point, there might be three y-variables to consider.
- There are real or observed y-variable (that we get from the dataset. In this example the length of the legs). Let’s call each of these ‘y’ data as ‘y_i’.
- The predicted y-variable (the leg length that we can calculate from the linear regression equation. Remember that might be different than the original data point y_i.). We will call it ‘y_ihat’ for this demonstration.
- The sample average of y-variable. That we already calculated and saved it in a variable ‘y_bar’.
For assessing, how well the regression model fits the dataset, all these y_i, y_ihat and y_bar will be very important.
The distance between y_ihat and y_bar is called the regression component.
regression component = y_ihat — y_bar
The distance between the original y point y_i and the calculated y point y_ihat is called the residual component.
residual component = y_i — y_ihat
A rule of thumb is the regression line that fits the data well will have a regression component bigger than the residual component across all data points. In contrast, a regression line that does not fit the data well will have the residual component larger than the regression component across all data points.
Make sense, right? If the observed data points are too different than the calculated data points then the regression line did not fit well. If all the data points fell on the regression line, then the residual component will be zero or close to zero.
If we add the regression component and the residual component:
Total = y_ihat — y_bar + y_i — y_ihat = y_i — y_bar
How to quantify this? You can simply deduct the mean ‘y’ (y_bar) from the observed y values(y_i). But that will give you some positive and some negative values. And negative values and positive values will cancel each other. That means, this will not represent the real differences of the mean ‘y’ and observed y values.
One popular way to quantify this is to take the sum of squares. That way, there won’t be any negatives.
The total sum of squares or ‘Total SS’ is:
The regression sum of squares or ‘Reg SS’ is:
The residual sum of squares or ‘Res SS’ is:
Total SS can also be calculated as the sum of ‘Reg SS’ and ‘Res SS’.
Total SS = Reg SS + Res SS
Everything is ready! Now it’s time to calculate the R-squared value. As discussed before, R-squared is the measure that represents how well the regression line fits the data. Here is the formula for R-squared:
R-squared = Reg SS / Total SS
If the R-squared value is 1, that means, all the variation in the response variable (y-variable) can be explained by the explanatory variable (x-variable).
On the contrary, if the R-squared value is 0, that means, none of the variations in the response variable can be explained by the explanatory variable.
This is one of the most popular ways of assessments of the fit of the model to the data.
Here is the general form of the ANOVA table. You already know some of the parameters used in the table. We will discuss the rest after the table.