Regression Algorithm to Predict House Prices in Python

Statistical linear regression approach and normal distribution curve

Photo by Harmen Jelle van Mourik on Unsplash

Multiple linear regression is a wonderful algorithm for predicting continuous values. The house price is also a continuous value. The data we will use in this article is downloaded from the Kaggle. The challenge is to predict the price of the house based on their independent features given in the data set.

We will explore the features given in the data set are shown below:

id               21613 non-null int64 - Discrete type
date             21613 non-null object - Continuous type
price            21613 non-null float64 - Continuous type
bedrooms         21613 non-null int64 - Discrete type
bathrooms        21613 non-null float64 - Discrete type
sqft_living      21613 non-null int64 - Continuous type
sqft_lot         21613 non-null int64 - Continuous type
floors           21613 non-null float64 - Discrete type
waterfront       21613 non-null int64 - Discrete type
view             21613 non-null int64 - Categorical type - nominal
condition        21613 non-null int64 - Categorical type - ordinal
grade            21613 non-null int64 - Categorical type - ordinal
sqft_above       21611 non-null float64 - Continuous type
sqft_basement    21613 non-null int64 - Continuous type
yr_built         21613 non-null int64 - Discrete type
yr_renovated     21613 non-null int64 - Discrete type
zipcode          21613 non-null int64 - Discrete type
lat              21613 non-null float64 - Continuous type
long             21613 non-null float64 - Continuous type
sqft_living15    21613 non-null int64 - Continuous type
sqft_lot15       21613 non-null int64 - Continuous type

From the column’s name, the target variable is the price column and the rest is the independent variable.

First, we need to import the libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Importing the CSV data with pandas.

house_df = pd.read_csv("housedata.csv")
house_df.head(5)

To know the shape of the data set i.e. number of rows and columns.

print(house_df.shape)#output:
(21613, 21)

To know the names of the columns.

print(house_df.columns)

Column names. A photo by Author

To know the info about the data in terms of type and null values.

print(house_df.info())

Types of variables in the data set. A photo by Author

To check if any missing values or not in the variable columns. To check we use the isna() function and taking a sum of all missing values for each column.

#finding missing values
house_df.isna().sum()

Record of missing values. A photo by Author

The above code shows that there are two missing values in the “sqft_above” column. To know the statistical information of the data frame we use the describe() function. This function gives count, mean, quartiles values, etc values.

house_df.describe().T

Statistical information. A photo by Author

To know the outliers in column features with the help of box plot visualization. For this, I create a for loop in which I used columns that have the type of int64 and float64. We can also observe outliers from the describe() function. If there are differences between the 75% to the maximum value and the difference between the minimum and 25% then there are outliers.

#outliers with boxplot
for column in house_df:
if house_df[column].dtype in ['int64', 'float64']:
plt.figure()
house_df.boxplot(column = [column])

Box plots of column features. A photo by Author

These are some box plots but in actual it will show all column features box plots. We also need to know if there are any outliers in the target column or not.

house_df.boxplot(column = ['price'])

Box plot of target column. A photo by Author

After checking all these, we don’t need to take all the features as the independent variable and the price will be the target variable. So, we will choose some features that are most important for house prediction and making new data for our model.

house_feat_data = house_df[[“price”,”date”,”bedrooms”,”bathrooms”,
“sqft_living”,”floors”,”waterfront”, 
“view”,”condition”,”grade”]]

Now if we see in our new data formed, the date is an object type and we will divide the date in year and month because the day is not important. The year and month give a valuable sense of the prediction. There are some categorical features in the new data and we will use a dummy method to make separate encoded columns as a part of feature engineering.

#using slicing to extract year and month from date
house_feat_data["year"] = house_df["date"].str[0:4]
house_feat_data["month"] = house_df["date"].str[4:6]#to remove date column 
house_feat_data = house_feat_data.drop(columns = ["date"])#Encoding categorical cloumns 
features = ["bedrooms","bathrooms","floors","waterfront","view",
"condition","grade", "year","month"]
house_en = pd.get_dummies(house_feat_data, columns = features)
print(house_en.columns)

Encoded categorical columns after dummy. A photo by Auhtor

Now we will import the regression model for training the data. Divide the data set into train and test set. In splitting the test size is the ratio in which the data is separated the “0.2” means the test data is 20% and training data is 80%.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_splittrain_house, test_house = train_test_split(house_en, test_size= 0.2)
print(train_house.shape, test_house.shape)#output:
((17290, 89), (4323, 89))

Split the data into independent and dependent variable.

house_features = house_en.columns.drop("price")
target = ["price"]

Now we will use the model to fit the training data, at this stage out model is ready. The score is an R2 score of the model.

model = LinearRegression()
model.fit(train_house[house_features],train_house[target])
model.score(train_house[house_features],train_house[target])#output:
0.6191940799329386

This score tells how much good this line fit on the model. This value is not good enough but it’s just an introduction concept. The value can be improved by working out missing values and outliers.

We will now find the mean squared value for both train and test prediction.

from sklearn.metrics import mean_squared_errortrain_predict = model.predict(train_house[house_features])
mean_squared_error(train_house[target], train_predict)**0.5#output:
0.32567881525600667test_predict = model.predict(test_house[house_features])
mean_squared_error(test_house[target], test_predict)**0.5#Output:
0.32780719274631726

These values are the sum of residuals between actual points and predicted points. The mean squared error and R2 value are inversely proportional.

We can also see the skewness of the target variable to check if it is right or left-skewed. SO, we can make the skewed distribution to normal distribution with mean and standard deviation.

import seaborn as sns
from scipy import statsplt.subplots(figsize=(12,9))
sns.distplot(house_df['price'], fit=stats.norm)# fit the mean and standard deviation(mu, sigma) = stats.norm.fit(house_df['price'])# plot with the distributionplt.legend(['Normal dist. ($mu=$ {:.2f} and $sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')#Probablity plotfig = plt.figure()
stats.probplot(house_df['price'], plot=plt)
plt.show()

Right skewed distribution of target column. A photo by Author

From the above distribution, it is observed that the target column is right-skewed and one more observation we can conclude from this is that the outliers are on the right side. To make the target variable in a normal distribution we used log function so that the mean and standard deviation becomes to make as a standard normal distribution after standardizing the values in the target column.

#This target variabale is right skewed. Now, we need to transform this variable and make it normal distribution.#using log function to normalize
house_df['price'] = np.log1p(house_df['price'])#Check again for more normal distributionplt.subplots(figsize=(12,9))
sns.distplot(house_df['price'], fit=stats.norm)# fit the mean and standard deviation(mu, sigma) = stats.norm.fit(house_df['price'])# plot with the distributionplt.legend(['Normal dist.($mu=$ {:.2f} and $sigma=$ {:.2f})'
.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')#Probablity plotfig = plt.figure()
stats.probplot(house_df['price'], plot=plt)
plt.show()

A standard normal distribution of target variable. A photo by Author

Conclusion:

This article gives you the working behavior of the algorithm. This model prediction is done with a linear regression model. Still, some cleaning is needed to be done to make the model better.

Reach me on my LinkedIn

NLP — Zero to Hero with Python

2. Python Data Structures Data-types and Objects

3. MySQL: Zero to Hero

4. Basics of Time Series with Python

5. NumPy: Zero to Hero with Python

Statistical linear regression approach and normal distribution curve

Footer