A few days ago I found myself researching the internet in search of winning code for a data science competition. The answer I was given on Quora was that people who win coding competitions are not going to show their code because, in addition to it being winning code, it will be a mesh of code scattered all over the program.
During my quest for winning code, however, I did come across a Towards Data Science article where the writer used feature selection to produce what he called winning code. The tactic this person used was to discard columns of data that have a low correlation. I have in the past experimented on a past post were I used feature selection, and the link to that post is found here:- Foray into Feature Selection: How Accuracy Improved by Selecting Features in Kaggle’s Ames House Price Competition Question | by Tracyrenee | AI In Plain English | Medium
Because I have never discarded columns of data that have a low correlation, I decided to give this technique a try. I therefore decided to use the columns that the writer deemed to have a high correlation and see how the level of accuracy I could achieve using this methodology.
I decided to use Kaggle’s House Price competition because I have worked on this dataset in the past and I wanted to see how my post work compared to this idea of discarding columns of data that have a low correlation.
The problem statement for :-
“Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.”
I created the program in Kaggle’s free Jupyter Notebook, which has Python and several libraries already installed on it.
Once I created a new notebook I imported the four main libraries that I would be using in the program, being numpy, pandas, matplotlib and seaborn:-
I then imported the csv files that are stored on Kaggle’s directory and read them into the system where I would be working:-
When the files had been imported into the notebook and read, I began an exploratory data analysis. I looked at the sale prices of the homes in the train dataset and found the prices ranged from $34,900 to $755,000 with a mean of $180,921:-
I II
I then created a heatmap, which will be used to determine which columns of data have a high correlation to the sale price of the homes in Ames, Iowa:-
I decided to use the features that were recommended in the article I had previously read to see if discarding columns would have a huge impact on the house price predictions:-
I then appended the test dataset to the train dataset and dropped the “SalePrice” column, thus creating one dataset that encompassed both train and test, being X_tot:-
I created the test_Id variable from test.Id, which will be used at the end of the notebook to prepare the submission.
I created the y variable, which is train.SalePrice and is the target.
I revised the X_tot variable by putting only the feature selection in it. This dataset is only a fraction of the original dataset and hopefully it will be less noisy:-
I then checked for null values and found there were some that needed to be imputed:-
Because all of the columns that need to be imputed are numeric, I decided to use sklearn’s IterativeImputer() function, which is an experimental function that imputes the values by looking at the other values in the dataset before replacing the null values:-
Once all of the null values had been imputed, I normalised the data by converting all of the cells to a value between zero and one:-
Once the data had been normalised, I split X_tot into two datasets, one the length of the train dataset and another the length of the test dataset:-
I then split X_Train and y up into datasets for training and validation using sklearn’s train_test_split() function. In this instance I decided to make the validation set only 5% of the X_Train dataset:-
I then selected the model I would be using to make the prediction. I experimented with several models before deciding upon sklearn’s BaggingRegressor(). I achieved a 97.65% accuracy using this model:-
I
I then used the model to predict on the validation set and achieved a 93.44% accuracy:-
The graph below shows a visual display of the predictions of the model against the validation set:-
When I was satisfied with the predictions that had been made on the validation set, I fit the test data into the model and made predictions on it:-
Once the program had been fully executed, I saved it and submitted the submission to Kaggle. I achieved 0.16063, which was not my best score but not bad either:-
In conclusion, the methodology recommended to me did not give me a winning score, but it certainly was not bad. I did however have to experiment on the estimators I would be using because I achieved varying results based upon which model I used.
The code for this post can be found in its entirety in my personal Kaggle account, being here:- House Prices — Reduce Columns — Bagging | Kaggle