Everything I did to create my first machine learning model.
To learn machine learning concepts, there are tons of amazing videos, blogs, and courses online. You could even find how to learn data science in one month,3 months,6 months, and so. Those are excellent ones if you follow as they suggest. I was confused about which one to chose, and then I found my way of learning. To make real use cases and build the model from scratch by learning each step. And this is working for me so far.
There are few steps to be carried out before and after building a model. We will understand each step while working on the problem.
- Data Exploration.
- Data Cleaning.
- Model building.
- Model evaluation.
- Prediction.
Thanks to python, we can do all these steps easily.We will be using Pandas ,Numpy and matplot libraries. Let us take simple dataset here to build our project. The problem statement is to predict the happiness of the person based on the given dataset.
We need to read and understand each column in the dataset and find any patterns or correlations between them.
Input:
#store the dataset in Dataframe ‘income_data’
income_data = pd.read_csv(“Data/income_data.csv”)
data = income_data.copy() #take a backup of Dataframe
print(data) #display the data
Output:
Unnamed: 0 income happiness
0 1 3.862647 2.314489
1 2 4.979381 3.433490
2 3 4.923957 4.599373
3 4 3.214372 2.791114
4 5 7.196409 5.596398
.. ... ... ...
493 494 5.249209 4.568705
494 495 3.471799 2.535002
495 496 6.087610 4.397451
496 497 3.440847 2.070664
497 498 4.530545 3.710193[498 rows x 3 columns]
So we have 498 records and 3 columns in the dataset. We have two variables — income and happiness. We will check the basic statistics of this dataset.
data.describe()Unnamed: 0 income happiness
count 498.000000 498.000000 498.000000
mean 249.500000 4.466902 3.392859
std 143.904482 1.737527 1.432813
min 1.000000 1.506275 0.266044
25% 125.250000 3.006256 2.265864
50% 249.500000 4.423710 3.472536
75% 373.750000 5.991913 4.502621
max 498.000000 7.481521 6.863388
Please ignore the first column stats; we will remove that in the cleanup section. We could see the mean, mode, median, and standard deviation of income and happiness.
Let us find the correlation between both the variable.
data.corr()
Unnamed: 0 income happiness
Unnamed: 0 1.000000 0.024831 0.029269
income 0.024831 1.000000 0.865634
happiness 0.029269 0.865634 1.000000
It’s obviously visible the happiness and income is 86.5% related. We will visualize the data relation.
data.plot(x=’income’, y=’happiness’, style=’o’)
plt.title(‘Income vs Happiness’)
plt.xlabel(‘Income’)
plt.ylabel(‘Happiness’)
plt.show()
This shows that the income value is directly proportional to happiness.
Before building our model, we need to transform the variables into numbers, remove non-essential fields and check no null values are present.
data.isnull().sum() # Check all variables are having nullUnnamed: 0 0
income 0
happiness 0
dtype: int64
There are no null values present, and the variables are present as numeric already.
But we have the first column, which is not necessary for our model. we will drop the column.
data.drop(data.columns[[0]], axis = 1, inplace = True) #Remove based on first index
data.head() #Display first five rows
Output:
income happiness
0 3.862647 2.314489
1 4.979381 3.433490
2 4.923957 4.599373
3 3.214372 2.791114
4 7.196409 5.596398
We analyzed and cleaned our dataset. since our target data is continuous and proportional to the input data. We can use linear regression to build the model.
What is Linear regression?
Imagine we have two variables as below.
Here we can easily say that the price is directly proportional to the number of apples. We can write a simple code to find the price for the other data. But we will not deal with much more straightforward data, so we can’t find plot slope in them.
How did we find the pattern in the above example? By reading and processing the data. Similarly, we will feed the data to the machine and teach the machine to plot the slope.
We can go to Canada from India in many ways, but we need to use the optimum route to make our journey more comfortable.
Similarly, the values are scattered, and we can draw the slope anywhere. But we need to find the optimum slope where the sum of squares of (distance between the slope and the target value) is minimum. This method is called least square regression.
Don’t worry !! We have the existing linear regression model function in python to do that for us.
#Data preparation
input_value = data.iloc[:, :-1].values
target_value = data.iloc[:, 1].values#Splitting the data for test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(input_value, target_value,
test_size=0.2, random_state=0)#Building the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(input_value, target_value)# Plotting the regression line
line = regressor.coef_*input_value+regressor.intercept_# Plotting for the test data
plt.scatter(input_value, target_value)
plt.plot(input_value, line,'red');
plt.show()
This is again the same way how we have been assessed in schools. There will be 10 problems in the chapter, and the teacher knows the answers for all the 10 problems. They will teach us 8 problems ( here it’s called training the model) and gives us 2 problems for assignment to test whether we understood or not. Then they will evaluate the assignment by comparing it with the actual answer.
Similarly, here we split the data based on test and train function. Trained the model on train data and will evaluate the result on test data by comparing it with the actual result. This will provides the model performance.
We will use the Mean Absolute Error method to evaluate the model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_scorepred_cv = regressor.predict(X_test)
from sklearn import metrics
print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, pred_cv))
Output :
Mean Absolute Error: 0.6174050608886752
We can now run our model on test data to compare our model predicted value and the actual value.
predict_data = pd.DataFrame({‘Actual’: y_test, ‘Predicted’: pred_cv})
predict_dataActual Predicted
0 1.775933 3.033184
1 1.877147 2.045445
2 2.465761 1.530116
3 1.560355 2.281021
4 0.898733 1.840929
.. ... ...
95 3.615471 3.718798
96 4.802092 4.503831
97 4.328417 4.414682
98 5.498147 5.176406
99 1.095999 2.381832
We have the model built; we will provide our random value as input to predict the target value.
income_input = [[1.775933]]
happiness_pred = regressor.predict(income_input)
print(“Income :”,income)
print(“Happiness :”,happiness_pred)Income : [[1.775933]]
Happiness : [1.41746106]
Github: https://github.com/Karthik1693/Self_Learning_Project/tree/master/Income%20Happiness