Boosting Performance by Generating Features from External Data with Python.
In this article, we’ll create a forecasting model to predict housing prices in Seattle. We will first make a model using the properties’ attributes such as sqft, rooms, bedrooms, bathrooms, view, etc.
Then we’ll significantly improve that model by generating features from external data such as proximity of cultural spaces, parks, public art spots, golf courses, swimming beaches, picnic tables, etc.. measuring the improvement from each added feature.
- Step 1: Explore the Seattle Housing Prices Data
- Step 2: Create a Price Prediction Model
- Step 3: Add Features from External Data
- Step 4: Compare and Analyze Results
To make the model, we’ll use a ‘Seattle Housing Prices’ taken from this House Prediction Project. It contains 21,613 registers of prices from 2015, with several features from the houses.
It has many columns, one of them is the ‘price’ which will be our target variable for prediction.
Let’s take a look at the correlation heatmap of price vs. the rest.
It’s important to mention that the source doesn’t provide a description of what each feature is and some of them are a bit unclear. For the sake of this tutorial, I assume ‘sqft_living15’ is some variation of ‘sqft_living’ and so forth. Also, how they quantify ‘condition’, ‘view’ or ‘waterfront’ is a bit of a mystery but it doesn’t bother me that much because none of them are of fundamental relevance.
Let’s create a standard model to get the prediction score, training (with 90%) and testing (with 10%).
The author of the cited article got the best result by using GradientBoostingRegressor, so let’s keep the exact same experiment with the exact same parameters to be able to know how much isolated improvement comes from the features.
The score is 0.79 (from a 0–1 range).
We want to understand how much improvement each feature brings to the table. Let’s do an experiment, let’s start predicting with one randomly chosen variable and then increment by one to see the journey of how the score gets to 0.79.
To be clear, the first row is the score the model got by only using ‘yr_renovated’, the second one is the score it got by using both ‘yr_renovated’ and ‘sqft_living’, and so forth.
This is the roadmap of how every new feature improved the model for this particular experiment. Of course, not all features bring value (some of them even subtract it) and some of them are very relevant.
However, this doesn’t mean that the features are going to help the same way if chosen in a different order, so let’s run this experiment 30 times and gather the data to get a more realistic insight of feature importances.
Let’s take a look at the average improvement by feature after 30 experiments.
It would be interesting to explore why sqft_lot seems to damage the model but nothing particularly outstanding with these results.
Now we want to generate useful features by adding relevant data by location.
To start, let’s use this ‘Picnic Tables in Seattle’ I gathered from data.seattle.gov which has (shockingly) the location of all picnic tables in Seattle.
The hypothesis is that proximity of picnic tables can have a meaningful relationship with housing prices (a property is more valuable if there’s a park nearby).
To generate a useful numerical feature and add it to our dataset we use the Location Blend Algorithm. How it works is, we part from our original data (Seattle housing prices) and add the number of nearby observations from the ‘external’ dataset (Picnic Tables) in a delimited radio.
The ‘number of picnic tables on a 1 km radio’ is added as a new column on our Seattle Housing Dataset as seen below:
Now, we location blend the ‘count’ of picnic tables on a 1km radio as a new feature on our Dataframe using the OpenBlender API.
Now, our Seattle Housing dataframe has a new numerical feature of picnic tables count in a 1 km radio.
Now, let’s add a lot of other features:
And let’s create features for both 300 meters radio and 1 km radio.
Now we have a 47 column dataframe with numerical features.
Now let’s run the new data through our model again.
The score is 0.914!!
Let’s randomly add variables again and look at the roadmap comparing with the model without external data from before.
The maximum score without external data was 0.79, the new score with the data is 0.91 using the same test set and the same model.
This is an enormous improvement and we only added a few features, there‘s a practically infinite universe of features one could add to improve the score.
Let’s run the experiment 30 times again to gain insights of the relevance of each feature.
Cultural Spaces seems to have had a particularly important role. Many of the new features have provided significant improvement.
Here’s the Github link to this repo.