
What is Regression Analysis?
Regression is the process of predicting the dependent variable based on independent variables in the hand. For example, predicting the revenue for next year in a retail company based on the sales of the products / How much the sales achieved throughout the globe? / Customer satisfaction growth / Any loss in sales? , all those parameters takes in place to prediction. So, this can be predicted by using Linear Regression model. Let’s see in detail.
Linear Regression
To determine the linear relationship between the dependent and independent variables. The linear regression expressed in the form of the equation of the straight line as
Y=mx+cWhere,Y as the dependent variablex as the independent variablem as slope / Gradientc as intercept
Slope / Gradient
Intercept
To find the value of y when x=0.
Simple Linear Regression
Taking a single independent variable. For instance, when we predict rent based on square feet alone that is simple linear regression.
Multiple Linear Regression
Taking more than two independent variables. When we predict rent based on square feet and the age of the building that is an example of multiple linear regression. This can be expressed in the below format
y=b+m0x0+m1x1+m2x2+.....mnxn
The simple linear regression model can be represented graphically as a best-fit line between the data points, while the multiple linear regression model can be represented as a plane (in 2-dimensions) or a hyperplane (in higher dimensions).
How Do you Know this is the best fit line?
The best fit line is obtained by minimizing the residual. Residual is the distance between the actual Y and the estimated Y’, as shown below:
Let’s start implementing the Simple Linear Regression model using sample data.
Import the libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plot
import statsmodels.api as sm
Reading the salary dataset
df = pd.read_csv(r'Salary_Data.csv')
Reading the data
df.head(5)
Salary Dataset
Predicting the independent/dependent variable
The independent variable is years of experience, and the dependent variable is salary.
x= df.iloc[:,:-1].values
Y= df.iloc[:,1].values
Splitting to training data & testing data
we will split both variables into the test set and training set. We have 30 observations, so we will take 20 observations for the training set and 10 observations for the test set. We are splitting our dataset so that we can train our model using a training dataset and then test the model using a test dataset. The code for this is given below:
from sklearn.model_selection import train_test_split
x_train,x_test,Y_train,Y_test= train_test_split(x,Y,test_size = 1/3,random_state=0)
Apply Linear Regression model
Next step is to fit our model to the training dataset. To do so, we will import the LinearRegression class of the linear_model library from the scikit learn. After importing the class, we are going to create an object of the class named as a regressor. The code for this is given below:
from sklearn.linear_model import LinearRegression
regression = LinearRegression()regression.fit(x_train, Y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent and an independent variable. We have fitted our regressor object to the training set so that the model can easily learn the correlations between the predictor and target variables.
Find Intercepts & Coefficients
In other words, determine the slope & intercepts which can be determined by object referenced of LinearRegression()
print(regression.intercept_)
print(regression.coef_)
In a mathematical way, it can be expressed as below,
y = 26723.9 + 9339 * x To predict the salary for the 5 years experiencey = 26723.9 + 9339 * 5 = 73418
Based on the intercept and coefficients the prediction can be done.
Prediction
Prediction of dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set respectively.
y_pred= regression.predict(x_test)
x_pred= regression.predict(x_train)
Visualizing the Training set results
Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library, which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function, we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name (“Salary vs Experience (Training Dataset)”.
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
plot.scatter(x_train, Y_train, color="green")
plot.plot(x_train, x_pred, color="red")
plot.title("Salary vs Experience (Training Dataset)")
plot.xlabel("Years of Experience")
plot.ylabel("Salary(In Rupees)")
plot.show()
In the above plot, we can see the real values observations in green dots and predicted values are covered by the red regression line. The regression line shows a correlation between the dependent and independent variable.
The good fit of the line can be observed by calculating the difference between actual values and predicted values. But as we can see in the above plot, most of the observations are close to the regression line, hence our model is good for the training set.
Visualizing the Test set results
In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate between the two plots, but it is optional.
plot.scatter(x_test, Y_test, color="blue")
plot.plot(x_train, x_pred, color="red")
plot.title("Salary vs Experience (Test Dataset)")
plot.xlabel("Years of Experience")
plot.ylabel("Salary(In Rupees)")
plot.show()
In the above plot, there are observations given by the blue color, and prediction is given by the red regression line. As we can see, most of the observations are close to the regression line, hence we can say our Simple Linear Regression is a good model and able to make good predictions.
Determine the accuracy of the model
The accuracy can be determined by R-Squared & Adj R-Squared value. This can be calculated using stats model API. The R-squared statistic provides a measure of fit. As we increase the number of independent variables the R-square value gets increased. This does not mean the new variable has a relation with the output variable. Even with the addition of new features in our model, it is not necessary that our model will yield better results but R2 value will increase. To rectify this problem, we use Adjusted R2 value which penalises excessive use of such features which do not correlate with the output data
model=sm.OLS(Y_train,x_train)
result = model.fit()
result.summary()
This model has a higher value of R-squared (0.976), which means that this model explains more variance and provides a better fit to the data.
Next article discusses the multiple linear regression.
sit tight 🙂
Happy learning 🙂