Linear regression is one of those old-school statistical modeling approaches that are still popular. With the development of new languages and libraries, it is now in a much-improved version and much easier to work on. In my last article, I explained the development of a Simple Linear Regression method and analysis in detail. If you haven’t seen it, here is the link:
Multiple linear regression is an extension of simple linear regression. In simple linear regression, we worked on the relationship between one independent variable or explanatory variable and one dependent variable or response variable. Simple linear regression uses this very common general formula:
y = mx + c
where,
y = dependent variable or response variable
x = independent variable or explanatory variable
m = slope
c = intercept
If x and y share a linear relationship, you can predict ‘y’ if you have the ‘x’ data available.
In statistics, beta0 and beta1 are used instead of c and m. So, the formula becomes:
This equation is good enough when you are establishing a relationship between profit and sales, arm length and leg lengths, systolic blood pressure and diastolic blood pressure, etc. That means when there is only one explanatory variable and one response variable.
But in the real world scenario, we often want to analyze the relationship between one response variable and several explanatory variables. When the response variable is exam score, there might be several explanatory variables such as study time, attendance in school, playtime, and sleep hours. We want to analyze the relationship between all the possible explanatory variables with the response variable(exam score).
In this case, the equation of the linear regression becomes:
Equation 1
If we think of the exam score in the example mentioned before, y is the exam score. x1, x2, and x3 are the study time, attendance at school, playtime. We need to determine the values of beta0, beta1, beta2, beta3 …..
Calculating the values of betas are very easy and straightforward in R. Let’s work on an example.
Model Development, Interpretation, and Assessment
For this demonstration, we will use a dataset that contains age, weight, body mass index(BMI), and systolic blood pressure. We will consider the systolic blood pressure as the dependent variable and weight, BMI, and age as the independent or explanatory variables. I will start with age as the only explanatory variable in the beginning. Then add weight and BMI later one by one to understand the effect of each one of them on the model and on the response variable(systolic blood pressure).
Please feel free to download the dataset and follow along, if you want to practice:
Let’s import the dataset first.
data = read.csv("health_data.csv")
head(data)
As we will examine the relationship between age and systolic blood pressure first, it will be interesting to see a scatter plot of age and systolic blood pressure. Here is the scatter plot:
plot(data$Age, data$Systolic_blood_pressure,
main= "Systolic Blood Pressure vs Age",
xlab = "Age", ylab = "Systolic Blood Pressure")
It shows a linear trend. Though a lot of noise around. In R, we can directly find the linear regression model using the ‘lm’ function. I will save this model in a variable ‘m’.
m = lm(data$Systolic_blood_pressure ~ data$Age)
m
Output:
Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age)Coefficients:
(Intercept) data$Age
94.872 0.635
The output shows that the intercept(beta0) is 94.872 and the slope is 0.635(beta1). We considered x1 as age. So the linear regression equation becomes:
y = 94.872 + 0.635*Age
As we considered only one explanatory variable, no x2, x3, or beta2, beta3.
Here, intercept 94.872 means that if the age is zero or very close to zero systolic blood pressure will still be 94.872. In this dataset, the minimum age in the dataset is 18(feel free to check on your own). So, talking about zero age is far out of the range of this dataset. That’s why it is not so reasonable in this case.
The slope of 0.635 means that if the age increases by 1 unit the systolic blood pressure will increase by 0.635 unit on average.
Using this equation you can calculate the systolic blood pressure of a person’s age if you know the age. For example, if a person is 32 years old, the calculated systolic blood pressure will be:
y = 94.872 + 0.635*32 = 115.192
Now, how correct this estimate is, we will determine that later in this article. This is time to add one more variable.
Add weight to the model:
This is pretty simple. In the model ‘m’, we considered only one explanatory variable ‘Age’. This time we will have two explanatory variables: Age and Weight. It can be done using the same ‘lm’ function and I will save this model in a variable ‘m1’.
m1 = lm(data$Systolic_blood_pressure ~ data$Age + data$Weight)
m1
Output:
Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age + data$Weight)Coefficients:
(Intercept) data$Age data$Weight
84.2799 0.6300 0.1386
Here, intercept(beta0) is 84.28. If you notice it is different than the intercept in ‘m’(94.87). This time slope(beta1) for Age variable becomes 0.63 which is not so different than the beta1 in model ‘m’. This slope means if Age increases by 1 unit systolic blood pressure will increase by 0.63 unit on average when the Weight variable is controlled or fixed.
On the other hand, the slope for the Weight variable(beta2) is 0.1386 means that if weight increases by 1 unit, systolic blood pressure will increase by 0.1386 unit on average when the Age variable is controlled or fixed.
The linear regression equation becomes:
y = 84.2799 + 0.63* Age + 0.1386 * Weight
If you know a person’s Age and Weight you will be able to estimate that person’s systolic blood pressure using this formula.
Adding BMI to this Model
Lastly, we add BMI to this model to see if BMI changes the dynamic of this model. Let’s use the ‘lm’ function again and save this model in a variable named ‘m2’.
m2 = lm(data$Systolic_blood_pressure ~ data$Age + data$Weight+data$BMI)
m2
Output:
Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age + data$Weight +
data$BMI)Coefficients:
(Intercept) data$Age data$Weight data$BMI
89.5218 0.6480 0.3209 -0.7244
Notice the output carefully. The intercept changed again. It is 89.52 this time. The slope for Age is 0.648 now. It was 0.63 in the previous model. The slope for weight is 0.3209 while it was 0.1386 in the previous model. So, After adding the BMI in the model the value beta0, beta1 and beta2 changed pretty significantly. The slope of the BMI variable is -0.7244.
The linear regression equation becomes:
y = 89.5218 + 0.648*Age + 0.3209*Weight — 0.7244*BMI
Woo! Our multiple linear regression model is ready! Now if we know the age, weight, and BMI of a person, we will be able to calculate the systolic blood pressure of that person!
How accurate that systolic blood pressure calculation from this equation is?
Let’s find out. One very common and popular way to assess the fit of the data in multiple linear regression is the coefficient of variation (R-squared). The formula for R-squared is the same as the simple linear regression:
Here,
y_calc is the calculated value of the response variable. In this case, the values of systolic blood pressure that are calculated using the linear regression model
y_mean is the mean of original systolic blood pressure values
y is the original systolic blood pressures from the dataset
The R-squared value represents the proportion of the response variable that can be explained by the explanatory variables.
I will use R to calculate R-squared. It is very simple in R. We have three models, and we saved them in three different variables m, m1, and m2. It will be good to see the fit of each model. I will calculate the R-squared value for all three models. Here is the R-squared value for the first model ‘m’ where the explanatory variable was only the ‘Age’.
R_squared1 = sum((fitted(m) - mean(data$Systolic_blood_pressure))**2) / sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)
R_squared1
Output:
0.3795497
That means 37.95% of the systolic blood pressure can be explained by Age.
The R-squared value for the second model ‘m1’ where explanatory variables were ‘Age’ and ‘Weight’:
R_squared2 = sum((fitted(m1) - mean(data$Systolic_blood_pressure))**2) / sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)
R_squared2
Output:
0.3958562
39.58% of the systolic blood pressure can be explained by ‘Age’ and ‘Weight’ together. Look R-squared improved a bit after adding the Weight to the model.
Lastly, the R-squared value for the model m2 where the explanatory variables were ‘Age’, ‘Weight’, and ‘BMI’.
R_squared3 = sum((fitted(m2) - mean(data$Systolic_blood_pressure))**2) / sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)
R_squared3
Output:
0.4099555
40.99% of the systolic blood pressure can be explained by the ‘Age’, ‘Weight’, and ‘BMI’.
The next steps might be a bit hard for you to understand totally if you are not familiar with confidence intervals or hypothesis test concepts. Here is an article to learn confidence interval concepts:
Here is an article on the hypothesis tests. Please check:
Inference
In this section, we will work on an F-test to see if the model was significant. That means if at least one of the explanatory variables has a linear association with the response variable.
There is a five-step process to perform this hypothesis test:
step 1:
Set the hypothesis and select the alpha level:
We set a null hypothesis and an alternative hypothesis. The null hypothesis is that the slope of all the variables is zeros. That means there is no association between any of the variables and the response variable. Here is the null hypothesis:
If we do not find enough evidence that the null hypothesis is true then we will reject the null hypothesis. That will give us the evidence that at least one of the slopes not equal to zero. That means at least one of the explanatory variables has a linear association with the response variable. This is the alternative hypothesis:
I am setting the alpha level as 0.05. That means the confidence level is 95%.
Step 2:
Select the appropriate test statistic. Here we will use F-test. The test statistic is:
Here, df is the degrees of freedom. That is the number of explanatory variables. In this example that is 3 (Age, Weight, and BMI).
n is the number of rows or the number of data points. In this dataset, there are 100 rows. So, n = 100. Feel free to check it using the ‘nrows(data)’ function
We will discuss how to calculate the F-stat in a little bit.
Step 3:
State the decision rule:
We need to determine the appropriate value from the F-distribution with df = 3, n-k-1 = 100–3- 1 = 96, and the alpha = 0.05.
There are 2 ways to find the appropriate value. You can use the F distribution table. But I prefer using R. So, here is the value from the F-distribution calculated in R:
qf(0.95, 3, 96)
Output:
[1] 2.699393
The appropriate value from F-distribution is 2.699.
The decision rule is:
Reject the null hypothesis if F ≥ 2.699
Otherwise, do not reject the null hypothesis.
Step 4:
Compute the F-Statistic.
Here is the table that calculates the F-statistic.