Statistical linear regression approach and normal distribution curve
Multiple linear regression is a wonderful algorithm for predicting continuous values. The house price is also a continuous value. The data we will use in this article is downloaded from the Kaggle. The challenge is to predict the price of the house based on their independent features given in the data set.
We will explore the features given in the data set are shown below:
id 21613 non-null int64 - Discrete type
date 21613 non-null object - Continuous type
price 21613 non-null float64 - Continuous type
bedrooms 21613 non-null int64 - Discrete type
bathrooms 21613 non-null float64 - Discrete type
sqft_living 21613 non-null int64 - Continuous type
sqft_lot 21613 non-null int64 - Continuous type
floors 21613 non-null float64 - Discrete type
waterfront 21613 non-null int64 - Discrete type
view 21613 non-null int64 - Categorical type - nominal
condition 21613 non-null int64 - Categorical type - ordinal
grade 21613 non-null int64 - Categorical type - ordinal
sqft_above 21611 non-null float64 - Continuous type
sqft_basement 21613 non-null int64 - Continuous type
yr_built 21613 non-null int64 - Discrete type
yr_renovated 21613 non-null int64 - Discrete type
zipcode 21613 non-null int64 - Discrete type
lat 21613 non-null float64 - Continuous type
long 21613 non-null float64 - Continuous type
sqft_living15 21613 non-null int64 - Continuous type
sqft_lot15 21613 non-null int64 - Continuous type
From the column’s name, the target variable is the price column and the rest is the independent variable.
First, we need to import the libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Importing the CSV data with pandas.
house_df = pd.read_csv("housedata.csv")
house_df.head(5)
To know the shape of the data set i.e. number of rows and columns.
print(house_df.shape)#output:
(21613, 21)
To know the names of the columns.
print(house_df.columns)
To know the info about the data in terms of type and null values.
print(house_df.info())
To check if any missing values or not in the variable columns. To check we use the isna() function and taking a sum of all missing values for each column.
#finding missing values
house_df.isna().sum()
The above code shows that there are two missing values in the “sqft_above” column. To know the statistical information of the data frame we use the describe() function. This function gives count, mean, quartiles values, etc values.
house_df.describe().T
To know the outliers in column features with the help of box plot visualization. For this, I create a for loop in which I used columns that have the type of int64 and float64. We can also observe outliers from the describe() function. If there are differences between the 75% to the maximum value and the difference between the minimum and 25% then there are outliers.
#outliers with boxplot
for column in house_df:
if house_df[column].dtype in ['int64', 'float64']:
plt.figure()
house_df.boxplot(column = [column])
These are some box plots but in actual it will show all column features box plots. We also need to know if there are any outliers in the target column or not.
house_df.boxplot(column = ['price'])
After checking all these, we don’t need to take all the features as the independent variable and the price will be the target variable. So, we will choose some features that are most important for house prediction and making new data for our model.
house_feat_data = house_df[[“price”,”date”,”bedrooms”,”bathrooms”,
“sqft_living”,”floors”,”waterfront”,
“view”,”condition”,”grade”]]
Now if we see in our new data formed, the date is an object type and we will divide the date in year and month because the day is not important. The year and month give a valuable sense of the prediction. There are some categorical features in the new data and we will use a dummy method to make separate encoded columns as a part of feature engineering.
#using slicing to extract year and month from date
house_feat_data["year"] = house_df["date"].str[0:4]
house_feat_data["month"] = house_df["date"].str[4:6]#to remove date column
house_feat_data = house_feat_data.drop(columns = ["date"])#Encoding categorical cloumns
features = ["bedrooms","bathrooms","floors","waterfront","view",
"condition","grade", "year","month"]
house_en = pd.get_dummies(house_feat_data, columns = features)
print(house_en.columns)
Now we will import the regression model for training the data. Divide the data set into train and test set. In splitting the test size is the ratio in which the data is separated the “0.2” means the test data is 20% and training data is 80%.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_splittrain_house, test_house = train_test_split(house_en, test_size= 0.2)
print(train_house.shape, test_house.shape)#output:
((17290, 89), (4323, 89))
Split the data into independent and dependent variable.
house_features = house_en.columns.drop("price")
target = ["price"]
Now we will use the model to fit the training data, at this stage out model is ready. The score is an R2 score of the model.
model = LinearRegression()
model.fit(train_house[house_features],train_house[target])
model.score(train_house[house_features],train_house[target])#output:
0.6191940799329386
This score tells how much good this line fit on the model. This value is not good enough but it’s just an introduction concept. The value can be improved by working out missing values and outliers.
We will now find the mean squared value for both train and test prediction.
from sklearn.metrics import mean_squared_errortrain_predict = model.predict(train_house[house_features])
mean_squared_error(train_house[target], train_predict)**0.5#output:
0.32567881525600667test_predict = model.predict(test_house[house_features])
mean_squared_error(test_house[target], test_predict)**0.5#Output:
0.32780719274631726
These values are the sum of residuals between actual points and predicted points. The mean squared error and R2 value are inversely proportional.
We can also see the skewness of the target variable to check if it is right or left-skewed. SO, we can make the skewed distribution to normal distribution with mean and standard deviation.
import seaborn as sns
from scipy import statsplt.subplots(figsize=(12,9))
sns.distplot(house_df['price'], fit=stats.norm)# fit the mean and standard deviation(mu, sigma) = stats.norm.fit(house_df['price'])# plot with the distributionplt.legend(['Normal dist. ($mu=$ {:.2f} and $sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')#Probablity plotfig = plt.figure()
stats.probplot(house_df['price'], plot=plt)
plt.show()
From the above distribution, it is observed that the target column is right-skewed and one more observation we can conclude from this is that the outliers are on the right side. To make the target variable in a normal distribution we used log function so that the mean and standard deviation becomes to make as a standard normal distribution after standardizing the values in the target column.
#This target variabale is right skewed. Now, we need to transform this variable and make it normal distribution.#using log function to normalize
house_df['price'] = np.log1p(house_df['price'])#Check again for more normal distributionplt.subplots(figsize=(12,9))
sns.distplot(house_df['price'], fit=stats.norm)# fit the mean and standard deviation(mu, sigma) = stats.norm.fit(house_df['price'])# plot with the distributionplt.legend(['Normal dist.($mu=$ {:.2f} and $sigma=$ {:.2f})'
.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')#Probablity plotfig = plt.figure()
stats.probplot(house_df['price'], plot=plt)
plt.show()
Conclusion:
This article gives you the working behavior of the algorithm. This model prediction is done with a linear regression model. Still, some cleaning is needed to be done to make the model better.
Reach me on my LinkedIn
- NLP — Zero to Hero with Python
2. Python Data Structures Data-types and Objects
3. MySQL: Zero to Hero
4. Basics of Time Series with Python
5. NumPy: Zero to Hero with Python