Now onto the main purpose of this article. In this section, we will look at 3 different imputation techniques using the Scikit-learn library in Python.
- Simple imputer
- Iterative imputer
- KNN imputer
To demonstrate the differences between the 3 techniques, I have created a sample data frame as follows.
Notice the missing value on row 6 in the Age column of the data frame. Our goal here will be to use different imputation techniques to replace the missing value with a substituted value and subsequently study the differences between the techniques.
Simple imputer
Simple imputer is an example of a univariate approach to imputing missing values i.e. it only takes a single feature into account when performing imputation.
Some of the most common uses of simple imputer include:
- Mean
- Median
- Most frequent (mode)
Here, our simple imputer has filled the missing value in the Age column with the average age of the first 5 rows which is 31.2.
Although easy and straightforward, simple imputer is a rather blunt approach to imputing missing values. As we have seen earlier on, Age is positively correlated with Fare so it would be worthwhile to also consider the values in the Fare column when performing imputation.
This is where multivariate imputation comes in where we take into account multiple features in a dataset during imputation.
Iterative imputer
Iterative imputer is an example of a multivariate approach to imputation. It models the missing values in a column by using information from the other columns in a dataset. More specifically, it treats the column with missing values as a target variable while the remaining columns are used are predictor variables to predict the target variable.
In our sample data frame, the Age column has one missing value on row 6 and is therefore assigned as the target variable in this scenario. This leaves the SibSp and Fare columns as our predictor variables.
Iterative imputer will use the first 5 rows of the data frame to train a predictive model. Once the model is ready, it will then use values in the SibSp and Fare columns on row 6 as inputs and predict the Age value for that row.
This is what the result of our iterative imputer looks like.
KNN imputer
KNN is short for k-nearest neighbour which is a machine learning algorithm in its own right but we are using it here in the context of imputation. KNN imputer is another multivariate imputation technique. The algorithm that is underlying the KNN imputer is different from that of an iterative imputer. Concretely, KNN imputer scans a dataset for k nearest rows to the row with missing values. It will then proceed to fill those missing values with the average of those nearest rows.
To illustrate this, here I have set k to equal to 2. In other words, I want KNN imputer to impute the missing Age value on row 6 with the average age of the 2 observations that are closest to that row.
As a result, KNN imputer has decided that row 3 and row 5 are the closest to row 6. Therefore, the average age between those two rows is (26 + 35) / 2 = 30.5.