## Dealing with Missing Data Properly Using the missingno Package in Python

Recently, Kaggle started a playground competition Categorical Feature Encoding Challenge II. This competition built on a previous competition by adding a twist… missing data. I did a short analysis of that missing data and built a notebook you can see here, but I thought I’d do a more thorough explanation in a blog!

One of the first things you’ll ever want to do with a dataset is to deal with missing data. In the world of geology, mining, and oil and gas, missing data is very common. Machine learning pipelines require that something be done with these missing values. NaN (Not a number), is not acceptable. We could blindly drop any columns or rows with missing data, but we could lose a large amount of valuable data that way. We could impute a mean, but this may not be the best way to represent that data either (or even possible… “What is the mean value of granite and shale?”).

##### Dealing with missing data will usually require 4 steps:
1. Check for missing values. Are the NaN, or some other representation of NULL? This can be subtly misleading as well. Sometimes -999 can be used to represent a NULL value, which can have obvious, harsh, unintended consequences.
2. Analyze the amount of missing values, and the type. If it’s small enough, maybe we can get away with dropping the features or instances that contain the missing data. How random is the missing data?
3. Either delete or impute these values. We can use mean, median, most frequent, or use other domain specific information to help impute these values. We can even build another machine learning algorithm using the feature with the missing values as the target to try and “guess” the value.
4. Evaluate and compare the performance of each imputed option. The iterative process, as always.

This specific blog will focus on step 2.

##### The 3 types of missing data you can come across are:
• Missing Completely At Random (MCAR) In this case the probability of one variable missing does not correlate to another variable that is also missing. This implies there is complete randomness in the missing data, however real data is rarely like this.
• Missing At Random (MAR): In this case the probability of one variable missing is related to another measured variable. As an example, an exploration company only sends assay samples testing for uranium, if radiometric data is over 200 cps.
• Not Missing At Random (NMAR): The probability of a missing values is correlated to that variable itself. Hypothetically, if a geochemical database has replaced values that are lower than detection limit with NaN, this would be NMAR.

When trying to classify the type (or types) of missing data in a feature, common sense and knowledge of the data will always prevail. How and why it was collected and stored will be your first task as a “data detective”.

If that fails, we can do some statistical testing, using typical t-tests and using the python package missingno, which allows us to use some nice visualizations. Using these tests, we can determine two cases, either the data is MCAR, or it is MAR or NMAR. Unfortunately, we cannot determine between MAR and NMAR using statistical methods.

3 ways missingno can visualize missing data is through a matrix, a heatmap, and a dendrogram. Below are examples from the Cat in the Dat II dataset and Missing Migrants dataset both available in Kaggle.

Cat in the Dat Matrix – MCAR

Missing Migrant Matrix – MAR or NMAR

Cat in the Dat Heatmap – MCAR

Missing Migrant Heatmap – MAR or NMAR

Cat in the Dat Dendrogram – MCAR

Missing Migrant Dendrogram – MAR or NMAR

Tags: ,
Posted in machine learning, statistics | No Comments »

## From Cholera to Kardashians

Here is a link to some of the various projects I’ve done with DataCamp. These are guided projects with a variety of topics, with some of my favorites being, extracting stock sentiment from the news, predicting honey bees from bumble bees from images using deep learning, the discovery of the importance of handwashing, and recreating John Snow’s map of the Cholera outbreak in London 1854.

DataCamp Python Projects

Tags: ,
Posted in machine learning, programming | No Comments »

## Orange: A Data Mining Tool

To continue from my previous post, I will introduce a great tool for basic data mining and machine learning that absolutely any geologist can use with no programming knowledge needed. That tool is Orange. It is free, open source, and intuitive. It can be used to simply visualize your data, and can even go as far as applying some of the most common machine learning algorithms that are used today.

Above we see the initial set-up of a project in Orange. On the left are the various widgets that are available to us. In the middle is our workflow. Typically we would start off with a file, and link that to a data file on our PC (e.g. CSV file). In this case the data was provided to us was 1729 samples of various rock types, locations names, and a few geochemical assay results.

We can also attach a Data Table to the file, to allow us to view the data in a familiar table format.

And as with any exploratory data analysis, some visualization is good to do. Here we can just connect a scatter plot (in visualize tools), to our file.

With any machine learning algorithm, and important step is to normalize the data. In this case, we will center by mean, and scale by standard deviation.

From here we continue on with our machine learning workflow. Below is an example of a basic, completed project that predicts the name of the rock on the remaining data. In this case Random Forest performed the best and was used in the prediction.

As you can see, this is just scraping the surface of Orange (or I suppose the peel!). There are numerous tutorials online that would do a much better job at getting into the nitty-gritty, as I myself am just starting to use this. Python itself is still more powerful, and more flexible. In fact Orange uses Python as it’s backend. However, I expect you can go very far with Orange. And although you can quickly start playing around with some machine learning, knowing how to set up training data and test data, and how to interpret the results, still requires careful thought.

Tags: ,
Posted in geology, machine learning | No Comments »

## Course Introduction and Reasons Why Machine Learning Projects Fail

I just had the privledge of attending the short course titled above at PDAC, 2019. I would like to thank the course instructors:

First I will give a quick overview of the first day were we went into the history of machine learning, and some of the basics. First to clearly define what Artificial Intelligence is versus Machine Learning. AI involves building machines that react like humans. To give an example, the new “Turing Test” would be to ask it “Can you go into the house and make me a cup of coffee?”. True AI should be able to do this, and we are nowhere near that point. Machine learning on the other hand, is a subset of AI that involves using algorithms to make predictions and classifications based on a large set of training data. A single algorithm can adpat and change it’s own parameters to solve a number of problems. Machine learning can be supervised, where we provide the labels for the data (e.g. rock names, ore, waste, etc.), or it could be unsupervised, where data is clustered based on similarities. Reinforcement learning is another field which is focused on performance and involves finding a balance between exporation and exploitation (e.g. multi-arm bandit problems). A humourus quote that captured the difference.

“If you’re seeing it in PowerPoint, it’s artificial intelligence. If you are seeing it in python, it’s machine learning.”

Cases were we will see machine learning perform the best will be automating menial tasks (e.g. core logging, autonomous driving, and drilling), dealing with highly complex data that humans are not capable of seeing trends in (e.g. exploration with layers of data over past 3D), and cases where rapid reaction time is necessary (e.g. real-time geometallurgy).

One important thing to keep in mind, this will always be a tool for the geologist to use, and not something to replace the geologist entirely. Data must be collected and curated competently, and must be interpreted properly afterwards.

However, this tool has the potential to greatly enhance the ability for the geologists to do both of these things.

A number of other key terms were discussed like cost functions, precision, recall, F-Scores, ROC curves, overfitting and underfitting, all of which deserve their own discussions, which I will do in later posts.

We also went over reasons why machine learning projects fail, which I believe deserves some specific attention:

• Asking the wrong questions: A specific goal should be delineated before the process begins. This allows you to focus resources on what kinds of data needs to be collected. Aimlessly looking through data is a dangerous endeavor as well. We are notorious as humans, in seeing patterns that don’t exist.
• Lack of firm support by key stakeholders: Data science projects often have impacts across many departments in an organization. Defining the strategy keeps the project on track, and prevents stakeholder apathy.
• Data problems: This is a problem I’m particularly familiar with. Quality, consistency, and incompleteness of data is frequently a major problem (A PDF is not a geophysical survey). If there is not enough data, a data scientist should reserve the right to ask for more data. And data collection and data wrangling is often going to be a large part of the job.
• Lack of the right data science “team”: Even within pure data science teams, you are rarely going to find one person that does everything. There are data engineers, data scientists, data analysists, with experience in Exploratory Data Analysis, Statistics, Coding, Feature Engineering, Visualization, and storytelling. This on top of the absolutely essential domain knowledge that the geologists can provide. Finding that unicorn can also set you up for a failed project should that person become unavailable in the middle of your project.
• Overly complex models: As often the case, keeping it simple can often lead to better results.
• Over-promising: Particularly with the increased interest in this area of research, keeping expectations reasonable is important. Often improvements don’t occur right away as each project requires it’s own solutions and refinements as time goes on.

That’s it for now, but I’ll post again shortly about a great new tool for geologists that requires no coding-savvy at all… Orange!

Tags: ,
Posted in geology, machine learning | No Comments »