Borrowed from the domain of statistics, linear regression is a handy model with emerging popularity in machine learning algorithms. Particularly useful for predictive analytics, the goal is to make the most accurate predictions possible based on historical data. Linear regression models the relationship between independent and dependent variables.
When one dependent variable is being evaluated, the process is termed simple linear regression; when more than one is considered, the process is called multiple linear regression. Thankfully, the process to run both scenarios is possible on datasets imported in R.
In this particular case study, I wanted to see if there was a significant linear relationship between the number of fish meals consumed per week and the total mercury levels found amongst fishermen. The dataset used in this analysis is attached as an appendix item at the end of the article. Since we have data between two variables only, I looked at applying a simple linear regression model to the dataset in question.
This article focuses on practical steps for conducting linear regression in R, so there is an assumption that you will have prior knowledge related to linear regression, hypothesis testing, ANOVA tables and confidence intervals. In case you require additional background on these topics, I recommend checking out the tutorials listed at the end of this article on the prior-mentioned topics.
Step 1: Save the data to a file (excel or CSV file) and read it into R memory for analysis
This step is completed by following the steps below.
1. Save the CSV file locally on desktop
2. In RStudio, navigate to “Session” -> “Set Working Directory” ->“Choose Directory” -> Select folder where the file was saved in Step 1
3. In RStudio, run the commands:
data <- read.csv(“fisherman_mercury_levels.csv”)
Step 2: To get a sense of the data, generate a scatterplot. Consciously decide which variable should be on the x -axis and which should be on the y-axis. Using the scatterplot, evaluate the form, direction, and strength of the association between the variables.
Looking at the plot, there is a noticeable positive, linear association between the number of fish meals consumed per week and the total mercury levels found amongst the fishermen. As the number of fish meals increase per week, the total mercury levels similarly increase. The association is strong since most points are compacted towards each other (if a line was drawn to depict linear regression, most of these data points will be close in proximity instead of scattered away from this line).
Step 3: Calculate the correlation coefficient. What does the correlation tell us?
Correlation coefficient is a statistical measure that evaluates the strength and direction (positive or negative) of the relationship between two variables. By running the cor() function in R between the number of fish meals per week and the mercury levels for the 100 fishermen, the value was calculated to be 0.78.
Since the correlation coefficient values range from -1 to 1, with negative values showing a negative correlation and positive values showing a positive correlation, it can be concluded that these values are both positively correlated. In terms of strength of correlation, higher values indicate stronger correlations between the two variables compared to lower calculated values for correlation coefficient. In this case, 0.78 indicates a strong correlation, especially when considering that the closer the correlation coefficient is to 1, the stronger the correlation.
Step 4: Find the equation of the least-squares regression equation and write out the equation. Add the regression line to the scatterplot you generated above.
Equation of the least-squares regression line is evaluated using the formula below.
Step 5: What is the estimate for β1 beta_1 ? How can we interpret this value? What is the estimate for β0 beta_0 ? What is the interpretation of this value?
The estimate for β1 is 0.4841, which is the value for the slope of the least-square regression line. This value indicates a positive, linear increase in the responsive variable when the explanatory variable increases. There is approximately 1 unit increase in mercury level amongst the fishermen when the number of meals per week that include fish increases by 2.
The estimate for β0 is 1.3339, which reflects the value of the y-intercept of the least-square regression line. It is of particular interest because this value reflects the expected mercury level when no meals containing fish are being consumed by the fishermen per week. The beta_0 value shows that there is a baseline level for mercury level that is not null (or zero), and that not consuming fish in the weekly meals does not guarantee a zero-level of mercury in the fishermen.
Step 6: Calculate the ANOVA table and the table which gives the standard error of β^ 1 (hat beta 1) . Formally test the hypothesis that beta_1 = 0 using either the t-test at the alpha level a=0.10.
Formal Test for Linear Association
- Specify the Null Hypothesis
2. Specify the Alternative Hypothesis
3. Set the Significance Level
i. Determine the appropriate value from the t-distribution with n-2 = 100–2 = 98 degrees of freedom and associated with a right-hand tail probability of α = 0.10.
ii. Using R, the t-value associated with α = 0.10 and df = 98 is 1.2903.
iii. Reject H0 if t ≥ 1.2903
Otherwise, do not reject H0
4. Compute the t-value
Reject H0 since 12.26 > 1.29.
There is a linear association between the number of meals including fish per week and the levels of mercury in the fishermen. As a result, it can be concluded that eating more meals with fish per week increases the levels of mercury amongst fishermen.
The ANOVA table can be used to find the R^2 value for this linear association.
The R^2 value above indicates that 61% of the variability in mercury levels amongst the fishermen can be explained by the number of fish meals consumed by the fishermen per week.
Additionally, the 90% confidence interval for β1 is calculated below.
The 90% confidence interval for β1 is 0.41 to 0.55. This value indicates that there is a 90% probability of the slope of the regression line lying within this range.
- Source Data
2. R Code
3. Statistical Tutorials
Five-Step Hypothesis Testing:
Hope this article helps you on your data analytics journey!