Here comes what you have been looking for!!!!!
An end to end explained, beginner-friendly simple linear regression model using Scikit-learn .I have done my best to explain even minute details to make this article comprehensible for every one.
FIRST AND FORMOST LET ME GIVE YOU A QUICK UNDERSTANDING ABOUT SCIKIT-LEARN :
SciKit–learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. It comprises of many features including feature extraction, cross validation, feature selection etc. And moreover its an opensource library.
You can download the data set from my Github repository — here. Or you can download any datasets of simple linear regression from Kaggle.
Lets begin,
We need to import some libraries like
Scikitlearn or sklearn library- Most machine learning beginners starts learning their model building using this powerful library.
Pandas —which provides easy-to-use data structures and data analysis tools for the Python programming language.
We’re using a library called the ‘matplotlib’ which helps us plot a variety of graphs and charts so that we can visualise our results easily.
Import these libraries.
Using the pandas to load the dataset ‘train.csv’ to a variable ‘data’
Lets look how this data looks like,
The data contains 700 training sets .There is only a single feature ‘x’ as we have taken a univariate linear regression dataset. And the corresponding solution of each ‘x’ is given in the ‘y’ column. Now lets plot this dataset to understand the fashion of the training set. Then only we can decide the corresponding model to fit data and choose the required algorithms .
PLOTTING THE DATASET
The dataset looks like this graphically,
From the above dataset we can conclude the data follows a linear fashion. And we can easily apply linear regression models to this.
Now we can use pandas Dataframe to load the datasets from the ‘data’ and can align the datasets in an tabular rows and columns.
But why ? You will get to it quickly….
Changing to data frame variable
We have now split the features (x) and solutions(y) in dataset using Pandas Dataframe and stored the feature values to ‘X-axis’ and solutions to ‘Y-axis’.
You can think of it like a spreadsheet or SQL table .It is generally the most commonly used pandas object.
Look at the figure given.
The datasets are now organized into two 2-D datastructures —
- X-axis- One consisting the feature values ,x=24,50,15..and the other
- Y-axis – consisting of solutions to these corresponding ‘x’, that is y=21.549452,47.464463etc.
Each dataframe consist of 700 rows and 1 column. Or its a 700×1 matrix or simply a vector.
(POINT TO NOTE: Usually we split the datasets into 2 ,that is to training data and test data usually in the ratio 70:30 or 75:25. Test data is used after the training of the model to check the real world inputs are working in the model or not or to ensure the model accuracy. In this code I’m not splitting the dataset further to test data also. Instead I’m giving the inputs directly in the code itself without checking . Note its not a good way to practice, and this program is only for the sake of understanding linear regression using Scikit learn. I will be uploading articles in future in that way also where we will split the data and find the solution.
Now lets build the regression model:
LinearRegression() fits a linear model with coefficients(weights or parameters) w = (w1, …, wn) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. We know that the error should be minimized in order to get the best fit of data. After that we fit the linear model using fit() with the minimum loss and stored it in variable ‘model’.
Lets look how the predicted values look like,
And model.coef_ gives the coeffecient or parameter of the hypothesis such that it best fits the data.
model.intercept_ gives the intercepting value of hypothesis in Y-axis.
Now our hypothesis equation looks like this
Y=(1.00077619)X + -0.11596738.
Now lets evaluate the model
model.score() returns the coefficient of determination R² of the prediction.
It takes a feature matrix X_axis and the expected target values Y_axis. Predictions for X_axis are compared with Y_axis and either accuracy (for classifiers) or R² score (for regression) estimators is returned. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). Here we got the score as,
which is pretty good and we can consider this as a good model.
Now lets predict solution for a new value X_axis_new = [[24]]
We go the solution to be 23.9026612.
Now lets predict solutions for a group of new values a= [6,78,91]
The dataframe of ‘a’ looks like above given table. But the column name is automatically given to be ‘0’ or their exist no column name.Change the column name to any string value .Here the column name is changed to ‘x’ like this,
Now we can predict the solutions using model.predict which gives us
We got the solution as [[ 5.88868977][77.94457549][90.95466597]]
Now lets visualize the results for better understanding:
The hypothesis Y=(1.00077619)X + -0.11596738 is represented as a red line in the data set. Below given is the best way we can fix that line in the dataset.
Plotting the regression line.
Plotting the predicted value for X_axis_new = [[24]]
The new solution is denoted by an yellow dot in the graph below,
Plotting the predicted value for sample a = [6,78,91]
The new solutions are denoted by green dots in the graph below,
Ok, Now lets construct a line connecting the solutions of sample a = [6,78,91]
The overall figure looks like below for all the predictions we have done.
Your linear regression model is ready. Ok, let me make this clear, there are lot of limitations for this model. But as a beginner for the sake of understanding this is enough. We can find a lot of other algorithms to make better models than this(that doesn’t mean this one is bad). But through this we can learn the limitations of this model and make a better on next time. Understanding the base concepts can make our journey smoother.
I will be uploading more articles in scikit learn in future. Hope you like this article.
You can download the code from my Github repository — Here