Dealing with Missing Data Properly Using the missingno Package in Python
Recently, Kaggle started a playground competition Categorical Feature Encoding Challenge II. This competition built on a previous competition by adding a twist… missing data. I did a short analysis of that missing data and built a notebook you can see here, but I thought I’d do a more thorough explanation in a blog!
One of the first things you’ll ever want to do with a dataset is to deal with missing data. In the world of geology, mining, and oil and gas, missing data is very common. Machine learning pipelines require that something be done with these missing values. NaN (Not a number), is not acceptable. We could blindly drop any columns or rows with missing data, but we could lose a large amount of valuable data that way. We could impute a mean, but this may not be the best way to represent that data either (or even possible… “What is the mean value of granite and shale?”).
Dealing with missing data will usually require 4 steps:
- Check for missing values. Are the NaN, or some other representation of NULL? This can be subtly misleading as well. Sometimes -999 can be used to represent a NULL value, which can have obvious, harsh, unintended consequences.
- Analyze the amount of missing values, and the type. If it’s small enough, maybe we can get away with dropping the features or instances that contain the missing data. How random is the missing data?
- Either delete or impute these values. We can use mean, median, most frequent, or use other domain specific information to help impute these values. We can even build another machine learning algorithm using the feature with the missing values as the target to try and “guess” the value.
- Evaluate and compare the performance of each imputed option. The iterative process, as always.
This specific blog will focus on step 2.
The 3 types of missing data you can come across are:
- Missing Completely At Random (MCAR) In this case the probability of one variable missing does not correlate to another variable that is also missing. This implies there is complete randomness in the missing data, however real data is rarely like this.
- Missing At Random (MAR): In this case the probability of one variable missing is related to another measured variable. As an example, an exploration company only sends assay samples testing for uranium, if radiometric data is over 200 cps.
- Not Missing At Random (NMAR): The probability of a missing values is correlated to that variable itself. Hypothetically, if a geochemical database has replaced values that are lower than detection limit with NaN, this would be NMAR.
When trying to classify the type (or types) of missing data in a feature, common sense and knowledge of the data will always prevail. How and why it was collected and stored will be your first task as a “data detective”.
If that fails, we can do some statistical testing, using typical t-tests and using the python package missingno, which allows us to use some nice visualizations. Using these tests, we can determine two cases, either the data is MCAR, or it is MAR or NMAR. Unfortunately, we cannot determine between MAR and NMAR using statistical methods.
3 ways missingno can visualize missing data is through a matrix, a heatmap, and a dendrogram. Below are examples from the Cat in the Dat II dataset and Missing Migrants dataset both available in Kaggle.
Cat in the Dat Matrix – MCAR
Missing Migrant Matrix – MAR or NMAR
Cat in the Dat Heatmap – MCAR
Missing Migrant Heatmap – MAR or NMAR
Cat in the Dat Dendrogram – MCAR
Missing Migrant Dendrogram – MAR or NMAR
Comments
Tags: kaggle, missing data