This imputation technique is similar to the missing category imputation where missing data is marked by a new label such as “Missing”. The main difference is for numerical data you need to select a value not among the numbers in your data column. Pick something not likely to be a value in your column or within the range of your column. For example, we would not use anything greater than or equal to zero at least to represent a missing value in the Salary column. When I have to do this I prefer to use a value such as -1 since Salary will never be negative.
This is a simple technique to implement outside of picking a proper arbitrary value to use. Using this form of imputation for numerical values can however alter the distribution of the data in that column. It can also cause an issue also if the arbitrary value you select is too similar to the mean, median, or mode, thus altering these values. One other drawback is the disruption of covariance with other data columns. Lastly, applying this technique when there is a large percentage of missing data increases the disruption of the consequences above.
Note: If you read the MCAR article you can feel free to skip this part since it is a repeat of the same technique.
This technique is more of an extension of the imputation techniques above. It can be applied to any data type. For each column with missing data, you create a column indicating if the entry in that column is missing or not. This can be a dichotomous column where 0 = “not missing” and 1 = “missing”. You would do this before applying your imputation technique. In our sample, we would create the columns Salary_Missing and Occuptation_Missing to show where the data is missing in those respective columns
This technique can be easy to perform and help you keep track of what data was originally missing once you fill in those null values with an imputation technique. One trade-off of this technique is the expansion of your data set. You are adding a new data column for every column of missing data. This can add up quickly and create a wider data set to train your machine learning model than you intended, thus increasing the training time.