In this post, I discuss how my previous feature engineering and modeling methods didn’t work and how I corrected them. This post is the first part of “Redoing Projects” series. I try making discussion technical and thus you won’t need much context to follow it.
Brief Background
This project aims at predicting house prices in Ames, Iowa with 79 features (2006–2010). The training set had 1460 observations and the test set had 1459 observations.
Mistake I: missing value imputation
My first mistake is replacing missing values with summary statistics before checking whether they’re missing at random completely. There are three types of missing values:
- Missing at random: whether a variable is missing or not is correlated with other variables, but not correlated with itself. For example, the variable “total areas of houses” may be missing only in a few neighborhoods. In this case, you might want to use summary statistics (mean/median/max) within these neighborhoods to fill in missing values of total areas.
- Missing completely at random (ideal): whether a variable is missing or not has nothing to do with any variables, either observed or not observed. If the number of this type of missing values is small, just drop them.
- Missing not at random (worst): whether a variable is missing or not depends on this variable itself. For example, the variable “areas of parking lots” is not missing when the parking area is small. In other words, all the large parking lots don’t have records of their areas. Part of reasons for this type of missing values is selection bias.
How could you check which type your missing values fall into? The answer is data exploration! Admittedly, the type of missing completely at random is almost impossible to be 100% sure, however, we can check other two types with relatively straightforward approaches.
- To check “missing at random,” draw plots and create tables to compare the count of missing values in different groups (e.g., “Neighborhoods”). If a group has a very high number of missing values, or missing values only appear in a few groups, you might want to generate summary statistics within groups and fill in missing values.
- To check “missing not at random,” choose a highly correlated variable with few missing values. In the above example of missing areas of parking lots, we can choose the variable “total areas on the ground floor,” because when the total areas on the ground floor are larger, areas of parking lots are very likely to be larger (You can create a plot and compute correlation to check that). By looking at the total areas on the ground floor of houses with missing areas of parking lots, and comparing them to the total areas on ground floor of houses with non-missing areas of parking lots, we’ll get a sense of what types of parking lots have missing areas — are they parking lots associated with larger houses, i.e., they’re probably larger parking lots? If you do have this type of missing values, you might want to drop this variable if it’s not important, or use a two-stage approach: first predict areas of parking lots for houses with large ground floor areas (using regression or KNN), then use the estimated areas of parking lots as part of inputs in your model to predict house prices.
Mistake II: tree algorithms with one-hot encoder
My second mistake is using sparse data with tree algorithms (random forest and decision tree). Once categorical features are one-hot encoded, especially when a feature has many categories, the sparse data will confuse tree algorithms and make them bias towards zero, i.e., trees are more likely to split a node at zero. To understand it, think about a simple and extreme case after one-hot encoding categorical features:
Without even computing Gini impurity scores, your tree will keep splitting at A = 0, B = 0, C= 0 and D= 0 because once the tree hits 1 at a node, it doesn’t have to grow anymore: all the other variables are 0. This was the case in my previous version of this project. By removing the one-hot encoder part and only using label encoded categorical features, the root mean squared error of Random Forest decreased from 0.19 to 0.06. Tree algorithms are not the best choice, however. A simple linear regression reduced the error to 0.006.
Mistake III: trusting PCA blindly
My third mistake was using PCA without checking its impact on model performance. PCA is a common technique to reduce feature dimensionality and collinear features (collinear features are combined into one principle component). However, it decreases prediction accuracy when features have non-linear dependence. PCA is flexible as it allows users to choose how much variance of the original features to be retained by creating principle components. Well, the downside of such flexibility is important features with small variance might get ignored. Think about an extreme case:
You have three predictors x, w, z and want to predict y. Assume w = y and y has very little variance. If you use PCA on features and specify the proportion of variance explained, you probably will end up losing w in your model. This is not ideal.
Indeed, tree algorithms and linear regression all have less error (Root Mean Squared Error) without PCA in this project.
If you’re interested, my code is on Github: https://github.com/QingchuanLyu/Predicting-House-Prices
Thanks for reading:)