## Dealing with Missing Data Properly Using the missingno Package in Python

Recently, Kaggle started a playground competition Categorical Feature Encoding Challenge II. This competition built on a previous competition by adding a twist… missing data. I did a short analysis of that missing data and built a notebook you can see here, but I thought I’d do a more thorough explanation in a blog!

One of the first things you’ll ever want to do with a dataset is to deal with missing data. In the world of geology, mining, and oil and gas, missing data is very common. Machine learning pipelines require that something be done with these missing values. NaN (Not a number), is not acceptable. We could blindly drop any columns or rows with missing data, but we could lose a large amount of valuable data that way. We could impute a mean, but this may not be the best way to represent that data either (or even possible… “What is the mean value of granite and shale?”).

##### Dealing with missing data will usually require 4 steps:
1. Check for missing values. Are the NaN, or some other representation of NULL? This can be subtly misleading as well. Sometimes -999 can be used to represent a NULL value, which can have obvious, harsh, unintended consequences.
2. Analyze the amount of missing values, and the type. If it’s small enough, maybe we can get away with dropping the features or instances that contain the missing data. How random is the missing data?
3. Either delete or impute these values. We can use mean, median, most frequent, or use other domain specific information to help impute these values. We can even build another machine learning algorithm using the feature with the missing values as the target to try and “guess” the value.
4. Evaluate and compare the performance of each imputed option. The iterative process, as always.

This specific blog will focus on step 2.

##### The 3 types of missing data you can come across are:
• Missing Completely At Random (MCAR) In this case the probability of one variable missing does not correlate to another variable that is also missing. This implies there is complete randomness in the missing data, however real data is rarely like this.
• Missing At Random (MAR): In this case the probability of one variable missing is related to another measured variable. As an example, an exploration company only sends assay samples testing for uranium, if radiometric data is over 200 cps.
• Not Missing At Random (NMAR): The probability of a missing values is correlated to that variable itself. Hypothetically, if a geochemical database has replaced values that are lower than detection limit with NaN, this would be NMAR.

When trying to classify the type (or types) of missing data in a feature, common sense and knowledge of the data will always prevail. How and why it was collected and stored will be your first task as a “data detective”.

If that fails, we can do some statistical testing, using typical t-tests and using the python package missingno, which allows us to use some nice visualizations. Using these tests, we can determine two cases, either the data is MCAR, or it is MAR or NMAR. Unfortunately, we cannot determine between MAR and NMAR using statistical methods.

3 ways missingno can visualize missing data is through a matrix, a heatmap, and a dendrogram. Below are examples from the Cat in the Dat II dataset and Missing Migrants dataset both available in Kaggle.

Cat in the Dat Matrix – MCAR

Missing Migrant Matrix – MAR or NMAR

Cat in the Dat Heatmap – MCAR

Missing Migrant Heatmap – MAR or NMAR

Cat in the Dat Dendrogram – MCAR

Missing Migrant Dendrogram – MAR or NMAR

## From Cholera to Kardashians

Here is a link to some of the various projects I’ve done with DataCamp. These are guided projects with a variety of topics, with some of my favorites being, extracting stock sentiment from the news, predicting honey bees from bumble bees from images using deep learning, the discovery of the importance of handwashing, and recreating John Snow’s map of the Cholera outbreak in London 1854.

DataCamp Python Projects

## Comparing Distribution. Chi-square and other robust methods.

Most of the discussion previously has assumed normal distributions, and given some options for nonparametric situations, but we need tests to determine what type of distribution we are dealing with.

The chi-square goodness-of-fit test is the first, and simplest way to do this. It compares a known distribution to the empirical data set. However, it does require a large enough sample size, and can be influenced by binning choices.

For example, we first assume is that two distributions are equal, and we reject this if we have sufficient proof they are different. The test statistic for a chi-square goodness-of-fit is:

Let’s do a simple example of a 6-sided dice toss. We expect an even distribution from 60 tosses, 10 for each possibility. That is 1 through 6 should have the same frequencies. From 60 tosses we get, 10 ones, 14 twos, 6 threes, 18 fours, 2 fives, and 10 sixes.

This is true, so we reject or null hypothesis. This die is loaded (with a significance level of 0.05). However, if we wanted to use a significance level of 0.01, our critical value would be 16.81, and we would not be able to reject our null hypothesis. Maybe this die isn’t so loaded after all.

### Robust methods to estimate distribution

With the chi-square test only applicable to large data sets (n>30), we need a test to use for smaller samples. The Kolmogorov-Smirnov (K-S) test fills this need. The test statistic relies on the maximum distance between the two cumulative probability curves.

K-S example from Wikipedia

CDF is the theoretical cumulative distribution function, and EDF is the empirical distribution function.

The two major advantages of this test is the statistic itself doesn’t depend on the CDF being tested. Also, this is an exact test, that doesn’t depend on an adequate sample size.

We reject our null hypothesis, if the test statistic D, is greater than the critical value we get from our table. There are a variety of ways to get the D value but they should be equivalent. However, you should ensure that the statistic is calculated in a way that matches the critical value table used.

## Univariate Statistics Part 3, How do you comparing central tendency

These are tests used to compare the averages between to distributions. Comparing two means, we could compare the mean to a predefined value (e.g., Are these sandwiches on average different that Subway’s 12” claim?). Or we could compare two means from two separate sample sets together (e.g., Do the sandwiches at this Subway differ from the sandwiches at another?). Both tests use similar methods, however if we are comparing to a predefined value, we use a normal distribution, and if we are comparing two sample sets, we use a t-distribution.

We also must consider sample size, as with any sample set under 30, we cannot assume our standard deviation is accurate. Therefore, with sample sizes under 30 we use the t-distribution. We therefore have a few scenarios to consider:

• One sample vs. a predefined population mean
• Large sample size
• Small sample size
• Two samples against each other
• Large sample size
• Small sample size
• Paired experiments

### One sample vs. a predefined population mean

Our t-scores or z-scores are calculated as:

Let’s go back to our sandwiches example. We want to know whether our sandwiches are really 12” in length. We have 10 sandwiches and measure them (11.9, 11.8, 12.1, 11.9, 12.0, 12.1, 11.9, 11.8, 12.2, and 12.0 inches). We have the following calculations for our test statistic:

We also must determine our significance level α, which we will say is 0.05. And state our hypotheses. We use a two-tailed test, so our hypotheses are:

We have a small sample size, so we use the t-distribution.

I used a t-distribution calculator to calculate the p-value of -1.108 and found that p(two-tailed) is 0.2966. As this is greater than our significance level of 0.05, we cannot reject the null hypothesis.

One thing to note, as sample size increases, the t-distribution begins to approximate normal distribution, so a general approach is to use the t-distribution.

### Means of two different sample groups

Now when using the t-test on two different sample groups, with a small size, we can obtain a better estimate by pooling our sample variances. As long as our population variances are equal between the samples, the samples are independent to each other, and the samples follow a normal distribution.

For this example, we are testing to see if one granite (Granite A) is elevated in U. We have 30 samples of each granite (Granite A, and Granite B). Granite A has a mean U of 923 ppm, and a standard deviation of 433 ppm. Granite B has a mean U of 723 ppm, and a standard deviation of 399 ppm.

We define our significance level at α=5%. That is, we want less than a 5% probability of getting 200 ppm difference between the two samples, or less than a 5% chance of incorrectly rejecting the null hypothesis.

We look at the critical Z value on a normalized normal distribution chart and find that for a 95% confidence level on our one-tailed test, we need a critical value of 1.65 (or 1.65 standard deviations from the mean).

The distance from the mean to the critical value of 1.65 is:

Based on this information, we know that there is a 5% chance of obtaining a difference of 177 ppm or higher. Our observed difference is 200 ppm. As this is higher than our significance level, we can reject our null hypothesis and state that Granite A is elevated in U compared to Granite B.

### Paired experiments

Once we begin comparing different methods to the same population, we lose the requirement that our samples are independent. The variance between samples is greater than the variance between methods (i.e. Rock A analyzed by method A, will be close to Rock A analyzed by method B, however Rock B to Z will show much more variability in either method). Because of this, we cannot pool our variances as before. We instead use pairwise differences, di, and consider this to follow a t-distribution. d̄1, and sD, are the mean and standard deviation. nD is the number of pairs. D0, is the value of our null hypothesis, and μD is the observed value we are testing against. We also make our assumption that the distribution of the differences is normal, and the differences are a random sample. Our test statistic is:

This paired experiment is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.

Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.

Let’s do an example of a paired experiment using the Wilcoxon Sign rank test. We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

In this sample data set, I used a Shapiro-Wilk test to test for normality (α=0.05). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.

As usual we state our hypotheses:

We state our level of significance at α=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculated some new information for our test, the absolute difference, and the rank.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore ?stat=13. We use a table to find that the critical value for ?=10,??? ?=0.05, is ?crit=8. Therefore, ?stat > ?crit , and we cannot reject the null hypothesis. If our critical value was greater than our test statistic, we would be able to reject the null.

### Some robust methods for comparing two medians

The two most powerful tests when normality is not certain, are the Mann-Whitney and the Terry-Hoeffding test.

The Mann-Whitney, Wilcoxin rank-sum, or the U-test ranks the two groups into one combined set and calculates the sum of ranks for each group R1 and R2.

The smaller of these two values is taken. If the null hypothesis is true, both values should be equal. For n1 and n2 ≤ 20, the critical values table is consulted. A significant value is observed if the calculated U is less than the critical value. Larger data sets are converted to standard normal by:

Where t is the number of tied values for a given rank. In this case, we have the opposite scenario where a significant value is observed with the Z is calculated larger than the critical value.

The Terry-Hoeffding test is performed similarly but uses the sum of the rankits with the smaller group of data instead of the raw ranks.

Where E(N,Ri ) is the rankit for the Rth rank of N values. Resulting in a test with greater power. T is used to do a one-sided test to determine if the smaller group has a larger median, -T to test if the smaller group has a smaller median, and the absolute value of T for two-sided tests. For sample sizes greater than 20, the approximated critical value is:

In this case, r is the critical value of Pearson’s r, given that N and n2 are sufficiently large.

If doing a two-tailed test, testing the null hypothesis for equality, we reject the null when:

If doing a one-tailed test, we reject the null when:

## Measures of spread and scale

Often, we may want to compare our sample variance, with a hypothetical population variance, in order to ensure it doesn’t exceed a certain value. Back to our sandwiches, perhaps Subway is okay with footlongs varying in length by no more than 1”. The test statistic we use is the chi-squared distribution with n-1 degrees of freedom.

In this test we hypothesize that H0: s2 = σ02, or that our hypothesized value is equal to our sample value. Our alternative hypothesis could be Ha: s2 > σ02, for a one-tailed test. This would imply that χ2, would get larger and larger as s2 increases in comparison to our hypothesized σ2. Alternatively, the opposite scenario would imply small values show evidence against our null hypothesis. With two-tailed tests, we reject our null with very large, and very small values of χ2.

As an example, let’s imagine a quality control experiment to ensure that 500 gram sand samples taken in the field do not exceed a variance of 50 grams2. Our hypothesis could be:

We take 9 samples, and each have the weights 455, 460, 473, 496, 503, 512, 523, 527, 540 grams. Our s2is calculated to be 818.6172. Now we calculate our χ2:

Using R, or a chi-squared table, we find that the p-value for our chi-square score is <0.00001. Although we haven’t stated an acceptable α, we can say there is strong evidence with p-value <0.0001 that the true variance of our sample weights is greater than 50 grams2.

If we are comparing two sample variances, we use the F distribution. In this case we have a sample variance s12, from n1 observations with a variance of σ12. And another sample variance s22, from n2 observations with a variance of σ22. Both have independent observations, and both are from normally distributed populations. As usual we test: H0: σ12= σ22, against either:

Let’s do another example. Does a sample method that collects 2000 gram samples have a greater variability than one that collects 500 gram samples?

Using a computer, we find the p-value of F = 2.785 with 10, 6 degrees of freedom is 0.111225. Our p-value is greater than our α(0.05), and therefore the result is not significant, and we cannot reject our null hypothesis.

We must mention a few caveats. Some sample sizes are very small, and larger sample sizes would be ideal. And secondly, normality is very important for both chi-squared and F distributions. If normality is not certain, non-parametric tests should be done.

The most powerful non-parametric test (Rock, 1988a), would be the Klotz test, based on the squares of normal scores. Normal scores are, Ai, where Ai= φ-1(Ri/(N+1)), and φ-1 is the percent point (cumulative distribution) function of the standard normal distribution, Ri is the rank of the i-th observation, and N is the sample size. The test statistic is calculated as follows:

As usual, if the calculated value is below the test value from the normal distribution, we accept the null hypothesis, if not we reject in favor of the alternative. That is, where α is the significance level:

## Univariate Statistics Part 1, Formulating your hypothesis, and the p-value

We often would like to compare a population parameter to our sample. We may know this parameter in advance, or we may want to compare it to another sample. Suppose we want to know for sure Subway is giving us a real footlong sandwich when we ask for one. We buy 30 12″ sandwiches and measure their lengths to be 11.5″ on average. Is this close enough to 12″ to accept Subway’s claim? Or we could have 30 samples from 2 different granites. Are these two granites geochemically equivalent or not?

In order to do this we follow a 5-step process

1. Formulate a null hypothesis, H0 (what we assume to be true), and our alternative hypotheses, Ha (which we only accept with sufficient evidence to reject the null)
2. Specify our level of significance. How important is it to avoid Type I errors. That is rejecting the null hypothesis, when it was in fact true. This could be very important when discussing the legal system and whether or not a person is innocent of a crime. It may be of less or equal importance in Subway sandwiches depending on your love of sandwiches.
3. Calculate our test statistic.
4. Define the region of rejection.
5. Either accept our null hypothesis, or reject it in favour of our alternative hypothesis.

In our hypothesis testing, we can make two errors.

• Type I errors occur when we reject our null hypothesis, but the null is true. For example, our justice system assumes a person is innocent. If we reject this, and put an innocent person in jail, we have made a type I error.
• Type II errors occurs when we accept the null, but our alternative was true. In this case, a guilty person gets away without punishment.

Although often type I errors are considered worse, this is not always true. Knowing the difference and the consequences for each error can impact your decisions. These errors are correlated, if we try to decrease the number of type II errors, the number of type I errors will increase.

This is the probability that the observed relationship between two sample sets is due to randomness, and that no difference exists between them. The p-value helps us quantify this significance if it exists. It is a decreasing index of the reliability of the results. The smaller the p-value, the more reliable we believe that relation. Or in other words, a p-value of 0.05 indicates that we have a 5% probability of rejecting our null hypothesis, when in fact it was correct (i.e. making a type I error). Or we have a 1/20 chance that our test would result in a relationship that is equal to or stronger than what we have observed.

We can also look at it as all hypothesis tests are in a sense “negative”. We can never 100% reject the null hypothesis, but we can disprove it to some level of significance. For example, α = 0.05 (5% level of significance).

power of a test is the probability that our test rejects the null hypothesis, when the alternative hypothesis is true. We must specify the level of significance (α), before we do our test (or the probability of a type I error). If we want to minimize our type II errors (β), we express our null hypothesis with the intent to reject it. The power of a test is then 1-β.

Let’s do an example. Let’s assume we have 49 samples from a normally distributed population. We know σ = 14, but μ is unknown. We start with our hypothesis test, we suggest our population mean is 50, and test it against a one-sided alternative that the population mean is less than 50.

H0: μ = 50Ha: μ < 50

We set our significance level at α = 0.05. We now pick our appropriate test statistic. We have a normal distribution, and we know our population standard deviation, the appropriate test statistic is:

We calculate which values of Z we reject our null hypothesis. We check our Z score from a Z table, and find that the Z value with an area of 0.05 to the left is -1.64. We therefore reject if Z ≤ -1.64. To facilitate our calculations of the power of the test, we express the rejection region as . We need to find what values of will we reject the null hypothesis. By expressing our Z and calculating  as follows:

This means we have a 0.05 probability of committing a Type I error when our sample mean is 46.72 or less.

Now, if µ = 45, and our null hypothesis wrongly states µ = 50, what is the probability of a Type II error? That is, what’s the probability of not rejecting the null when it is in fact false.

We standardize as follows:

We look up in the Z table and find the area to the right of Z approximately 0.469. Therefore, our probability of a Type II error is β=0.469. And the probability of rejecting the null hypothesis when it is false, that is the power of the test, is 1 – β = 1 – 0.469 = 0.531!

If my explanations leave you with a few questions, which wouldn’t surprise me in the slightest, check out this jbstatistics video (or many of the other ones which are much better at showing graphics to explain the concepts and helped me understand them to begin with.

## Introduction

Most often in university statistics courses, parametric techniques are given the primary focus. These techniques involve summarizing distributions with typical parameters like mean, median, mode, variance, and standard deviation. The focus is on central tendency and the spread of subsets of the populations to make general conclusions and determine equivalence.

This has some limitations when dealing with geological problems. Samples are typically not drawn from a normal population, and despite common usage, typically are not from log-normal populations either. Also, it is often the goal to identify unusual values, outliers, and not general conclusions about the entire population like mean and median values would.

The goal of the next series of blog posts is to go over these classical statistical methods along with some more robust nonparametric techniques. These techniques will be less sensitive to outlying values, skewed distributions, which are more relevant to geologists. We will discuss 1 and 2 variable, and multivariate scenarios, all with the goal of better understanding the data set and visualizing variations. The end goal is often to find unexpected results and generate new ideas.

There have been huge advances both in the types of data we collect and the amount over the past 20 years. We are getting so good at collecting large amounts of data, but very little has been done in developing new forms of statistical analysis. Spatial and time variables are also special types of data that need to be treated different and in their own context. The availability of open source software and advances in hardware has made analysis less expensive but can also lead to inappropriate methods.

The test data set I will use in this report is from the Unearthed competition data. The Explorer Challenge Unearthed is from a competition in South Australia where OZ Minerals opened up their data from the Mount Woods project. It lies close to the Prominent Hill copper and gold mine, but no discovery had been made at Mount Woods. This region is known for its copper and gold deposits, but any economic mineral could be of interest.

2 Terabytes of data is available for testing statistical methods. Typical economic geology hurdles exist. Large scale deposits are a result of a combination of factors and very rare. Deposits are often at depth, making collecting data difficult. It’s also difficult to distinguish ore-grade deposits from unmineralized rocks. When choosing what statistical method to use it’s important to determine the data type

Categorical data can be divided into nominal or ordinal

• Nominal data is data that has two or more categories, but no intrinsic order. An example of this could be lithological rock types.
• Ordinal data is organized into two or more categories as well, but these categories can be ranked. An example of this could be descriptions of alteration as trace, weak, moderate, or strong.

Continuous variables can be divided into interval or ratio data.

• Interval data is continuous data measured along a continuum. Temperature in Celsius or Fahrenheit is an example of this type of data. The difference between 5 degrees and 15 degrees is the same as between 15 degrees and 25 degrees.
• Ratio data is the same, with the condition of 0 being equal to none of that variable. For example, Kelvin scale, or in our case geochemical assays. Most natural science data is of this type.

The data available includes a drillhole database with 678 drillholes, with 659 geological logs, and 585 chemical assays of up to 55 elements. It is not known if the distribution of the various features of the data is normal (Gaussian), log-normal, or neither. There is a variety of types of data available for testing. The drillhole data will be the bulk of the data used for testing in this report. 27,000 core tray photographs are available but won’t be used for this study.

Raw geophysics data is also available for test data. Other possible data sets to use could be regional airborne magnetics and radiometric surveys, prospect-level ground gravity surveys, 2D seismic surveys, regional magnetotellurics, regional and prospect-scale induced polarisations surveys, and multi-field inversion and modelling around Prominent Hill.

Data wrangling, cleaning, analyses and presentations will all be done using Microsoft Excel and Python 3.x.

## My First Jupyter Notebook!, and some code for Excel users

This is my first Jupyter Notebook, and I thought I’d do something very basic. Some very simple tools to help someone start using pandas DataFrames with Excel files. The file is a very, very, abbreviated Excel file of geochem data (some trace elements), from Nancy Normore. The data is a few samples of breccias from the Flin Flon area, with 5 trace elements, fragment types(mafic or felsic), areas, and UTMs. This data is purposefully a little messy (EDA is important), and I plan to do some demonstrations on how to clean data in Python in the future using this same dataset. If you know some basic Python skills (e.g. for loops, while, if statements), you’ll probably start to see some ways to streamline your workflow.

And just as I was finishing this, DataCamp posted a webinar on the same subject. I love their stuff and highly suggest a subscription as intermediate level training.

With all that said I’m going to have to switch gears on subject matter here, as I have a Networks class to finish. However, if you’re curious about how to do something, feel free to ask! I may have a solution for you, or at least be able to find one.

The notebook you will want to open is PyForExcel.ipynb

## Orange: A Data Mining Tool

To continue from my previous post, I will introduce a great tool for basic data mining and machine learning that absolutely any geologist can use with no programming knowledge needed. That tool is Orange. It is free, open source, and intuitive. It can be used to simply visualize your data, and can even go as far as applying some of the most common machine learning algorithms that are used today.

Above we see the initial set-up of a project in Orange. On the left are the various widgets that are available to us. In the middle is our workflow. Typically we would start off with a file, and link that to a data file on our PC (e.g. CSV file). In this case the data was provided to us was 1729 samples of various rock types, locations names, and a few geochemical assay results.

We can also attach a Data Table to the file, to allow us to view the data in a familiar table format.

And as with any exploratory data analysis, some visualization is good to do. Here we can just connect a scatter plot (in visualize tools), to our file.

With any machine learning algorithm, and important step is to normalize the data. In this case, we will center by mean, and scale by standard deviation.

From here we continue on with our machine learning workflow. Below is an example of a basic, completed project that predicts the name of the rock on the remaining data. In this case Random Forest performed the best and was used in the prediction.

As you can see, this is just scraping the surface of Orange (or I suppose the peel!). There are numerous tutorials online that would do a much better job at getting into the nitty-gritty, as I myself am just starting to use this. Python itself is still more powerful, and more flexible. In fact Orange uses Python as it’s backend. However, I expect you can go very far with Orange. And although you can quickly start playing around with some machine learning, knowing how to set up training data and test data, and how to interpret the results, still requires careful thought.

## Course Introduction and Reasons Why Machine Learning Projects Fail

I just had the privledge of attending the short course titled above at PDAC, 2019. I would like to thank the course instructors:

First I will give a quick overview of the first day were we went into the history of machine learning, and some of the basics. First to clearly define what Artificial Intelligence is versus Machine Learning. AI involves building machines that react like humans. To give an example, the new “Turing Test” would be to ask it “Can you go into the house and make me a cup of coffee?”. True AI should be able to do this, and we are nowhere near that point. Machine learning on the other hand, is a subset of AI that involves using algorithms to make predictions and classifications based on a large set of training data. A single algorithm can adpat and change it’s own parameters to solve a number of problems. Machine learning can be supervised, where we provide the labels for the data (e.g. rock names, ore, waste, etc.), or it could be unsupervised, where data is clustered based on similarities. Reinforcement learning is another field which is focused on performance and involves finding a balance between exporation and exploitation (e.g. multi-arm bandit problems). A humourus quote that captured the difference.

“If you’re seeing it in PowerPoint, it’s artificial intelligence. If you are seeing it in python, it’s machine learning.”

Cases were we will see machine learning perform the best will be automating menial tasks (e.g. core logging, autonomous driving, and drilling), dealing with highly complex data that humans are not capable of seeing trends in (e.g. exploration with layers of data over past 3D), and cases where rapid reaction time is necessary (e.g. real-time geometallurgy).

One important thing to keep in mind, this will always be a tool for the geologist to use, and not something to replace the geologist entirely. Data must be collected and curated competently, and must be interpreted properly afterwards.

However, this tool has the potential to greatly enhance the ability for the geologists to do both of these things.

A number of other key terms were discussed like cost functions, precision, recall, F-Scores, ROC curves, overfitting and underfitting, all of which deserve their own discussions, which I will do in later posts.

We also went over reasons why machine learning projects fail, which I believe deserves some specific attention:

• Asking the wrong questions: A specific goal should be delineated before the process begins. This allows you to focus resources on what kinds of data needs to be collected. Aimlessly looking through data is a dangerous endeavor as well. We are notorious as humans, in seeing patterns that don’t exist.
• Lack of firm support by key stakeholders: Data science projects often have impacts across many departments in an organization. Defining the strategy keeps the project on track, and prevents stakeholder apathy.
• Data problems: This is a problem I’m particularly familiar with. Quality, consistency, and incompleteness of data is frequently a major problem (A PDF is not a geophysical survey). If there is not enough data, a data scientist should reserve the right to ask for more data. And data collection and data wrangling is often going to be a large part of the job.
• Lack of the right data science “team”: Even within pure data science teams, you are rarely going to find one person that does everything. There are data engineers, data scientists, data analysists, with experience in Exploratory Data Analysis, Statistics, Coding, Feature Engineering, Visualization, and storytelling. This on top of the absolutely essential domain knowledge that the geologists can provide. Finding that unicorn can also set you up for a failed project should that person become unavailable in the middle of your project.
• Overly complex models: As often the case, keeping it simple can often lead to better results.
• Over-promising: Particularly with the increased interest in this area of research, keeping expectations reasonable is important. Often improvements don’t occur right away as each project requires it’s own solutions and refinements as time goes on.

That’s it for now, but I’ll post again shortly about a great new tool for geologists that requires no coding-savvy at all… Orange!