The real-world datasets consist of missing values and large amount of time of a data scientist is spend on data preparation which also includes data cleaning. Missing Value can be a result of unrecorded observations ,data corruption,non-response or skip patterns in the surveys.They can be in the form of an empty string, NA, N/A or None.Pandas in python identifies all the NA or blank values in the data as NaN values.However, it doesn’t identify na, ?, n.a., n/a.The missing values in various columns of our data set can be counted using df.isnull().sum() command in python.
Types of Missing Data
- Missing at Random (MAR) — It means that there is a relationship between the proportion of missing values and the observed data. For example,in the below graph we can see that the proportion of missing values in the mileage column is correlated to the car’s manufacturing year.Therefore,whenever we have missing data of this type in our data set then it implies that the missing data can be predicted by other features.
- Missing Completely at Random (MCAR)-It means that the proportion of missing values is unrelated to any observation in the data.An example of it is a weighing scale that ran out of batteries and as a result some of the data will be missing .
- Missing Not at Random (MNAR)- It means that the missing data is related to the factors that are unknown to us. For example, the weighing scale mechanism may wear with time, thereby producing more missing values as time progresses, but we might not note the same.
Approaches to Handle Missing Values
1 Drop Columns and Rows Containing Missing Values
Remove the columns and rows containing missing values in MCAR data. However, the problem with this approach is the loss of information.It is recommended to delete a particular column if the number of missing values in the data is more than 70–75 percent.Also, when we have large datasets ,then we can delete a particular row if it contains null value for a particular feature. Although,it doesn’t works well if the percentage of missing values in the data set is greater than 30 percent.
2 Imputing missing values in the data with mean,median,and mode
We can replace the missing value in the data set with mean, median or mode of that particular feature but this method can add variance and bias in the data .This approach is ideal when the data size is small ,and it helps prevent information loss due to removal of the rows and columns.
3 Assigning a new category for missing values in categorical features
The third approach to handle missing values is by creating another categorical class . This method works only for categorical features containing missing values as they consist of a definite number of classes.This technique results in low variance ,and it also prevents the loss of data.
4 Predicting the Missing Values
We can predict the missing values in the data with the help of a ML algorithm using the features which do not have missing values. This technique tends to increase the accuracy .
Conclusion
Missing data can lead to invalid results due to absence of relevant information .They need to be handled as training a ML model on a data set consisting of missing values can result in an error as the python’s library including Scikit learn library doesn’t support them.
Click 💚 if you like the article. If you have any questions, you can write them in the comments section below, and I will do my best to answer them.