Archive for the ‘statistics’ Category

The Wilcoxon Sign Rank Test: A Geological Example

Geological data is often not normally distributed, and with data that is not normally distributed parametric methods should not be used.  In this case, I did a paired experiment, which is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.

Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.

We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

Location IDFeb 2018 (As ppm)Feb 2019 (As ppm)
Figure 1: Fictional As soil sample data.

In this sample data set, I used a Shapiro-Wilk test to test for normality (∝=0.05) (“Shapiro-Wilk Test Calculator”, 2020). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.

As usual we state our hypothesis:

We state our level of significance at ∝=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculate some new information for our test: the absolute difference, and the rank.

Location IDFeb 2018 (As ppm)Feb 2019 (As ppm)DifferenceAbsolute DifferenceRank
Figure 2: Calculation for the Wilcoxon signed rank test.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore wstat=13. We use a table to find that the critical value for n=10 and ∝=0.05, is wcrit=8, is  (“Wilcoxon Signed-Ranks Table”, 2020). Therefore,  wstat>wcrit, and we cannot reject the null hypothesis. It is possible there has been no change in As levels between the two years. If our critical value was greater than our test statistic, we would be able to reject the null and confirm statistically significant change in As values.

Tags: , ,
Posted in geology, statistics, univariate | No Comments »

Missingness Analysis

Dealing with Missing Data Properly Using the missingno Package in Python

Recently, Kaggle started a playground competition Categorical Feature Encoding Challenge II. This competition built on a previous competition by adding a twist… missing data. I did a short analysis of that missing data and built a notebook you can see here, but I thought I’d do a more thorough explanation in a blog!

One of the first things you’ll ever want to do with a dataset is to deal with missing data. In the world of geology, mining, and oil and gas, missing data is very common. Machine learning pipelines require that something be done with these missing values. NaN (Not a number), is not acceptable. We could blindly drop any columns or rows with missing data, but we could lose a large amount of valuable data that way. We could impute a mean, but this may not be the best way to represent that data either (or even possible… “What is the mean value of granite and shale?”).

Dealing with missing data will usually require 4 steps:
  1. Check for missing values. Are the NaN, or some other representation of NULL? This can be subtly misleading as well. Sometimes -999 can be used to represent a NULL value, which can have obvious, harsh, unintended consequences.
  2. Analyze the amount of missing values, and the type. If it’s small enough, maybe we can get away with dropping the features or instances that contain the missing data. How random is the missing data?
  3. Either delete or impute these values. We can use mean, median, most frequent, or use other domain specific information to help impute these values. We can even build another machine learning algorithm using the feature with the missing values as the target to try and “guess” the value.
  4. Evaluate and compare the performance of each imputed option. The iterative process, as always.

This specific blog will focus on step 2.

The 3 types of missing data you can come across are:
  • Missing Completely At Random (MCAR) In this case the probability of one variable missing does not correlate to another variable that is also missing. This implies there is complete randomness in the missing data, however real data is rarely like this.
  • Missing At Random (MAR): In this case the probability of one variable missing is related to another measured variable. As an example, an exploration company only sends assay samples testing for uranium, if radiometric data is over 200 cps.
  • Not Missing At Random (NMAR): The probability of a missing values is correlated to that variable itself. Hypothetically, if a geochemical database has replaced values that are lower than detection limit with NaN, this would be NMAR.

When trying to classify the type (or types) of missing data in a feature, common sense and knowledge of the data will always prevail. How and why it was collected and stored will be your first task as a “data detective”.

If that fails, we can do some statistical testing, using typical t-tests and using the python package missingno, which allows us to use some nice visualizations. Using these tests, we can determine two cases, either the data is MCAR, or it is MAR or NMAR. Unfortunately, we cannot determine between MAR and NMAR using statistical methods.

3 ways missingno can visualize missing data is through a matrix, a heatmap, and a dendrogram. Below are examples from the Cat in the Dat II dataset and Missing Migrants dataset both available in Kaggle.

Cat in the Dat Matrix – MCARMatrix-Cat

Missing Migrant Matrix – MAR or NMARMatrix-Migrants

Cat in the Dat Heatmap – MCARHeatmap-Cat

Missing Migrant Heatmap – MAR or NMARHeatmap-Migrants

Cat in the Dat Dendrogram – MCARDendrogram-Cat

Missing Migrant Dendrogram – MAR or NMARDendrogram-Migrants


Tags: ,
Posted in machine learning, statistics | No Comments »

Univariate Statistics Part 4, Methods to compare distributions

Comparing Distribution. Chi-square and other robust methods.

Most of the discussion previously has assumed normal distributions, and given some options for nonparametric situations, but we need tests to determine what type of distribution we are dealing with.

The chi-square goodness-of-fit test is the first, and simplest way to do this. It compares a known distribution to the empirical data set. However, it does require a large enough sample size, and can be influenced by binning choices.

For example, we first assume is that two distributions are equal, and we reject this if we have sufficient proof they are different. The test statistic for a chi-square goodness-of-fit is:


Let’s do a simple example of a 6-sided dice toss. We expect an even distribution from 60 tosses, 10 for each possibility. That is 1 through 6 should have the same frequencies. From 60 tosses we get, 10 ones, 14 twos, 6 threes, 18 fours, 2 fives, and 10 sixes.

Chi-Squared-Test Example

Chi-Squared-Test Example 2

This is true, so we reject or null hypothesis. This die is loaded (with a significance level of 0.05). However, if we wanted to use a significance level of 0.01, our critical value would be 16.81, and we would not be able to reject our null hypothesis. Maybe this die isn’t so loaded after all.

Robust methods to estimate distribution

With the chi-square test only applicable to large data sets (n>30), we need a test to use for smaller samples. The Kolmogorov-Smirnov (K-S) test fills this need. The test statistic relies on the maximum distance between the two cumulative probability curves.

K-S Example from Wikipedia

K-S example from Wikipedia

K-S Test statistic

CDF is the theoretical cumulative distribution function, and EDF is the empirical distribution function.

The two major advantages of this test is the statistic itself doesn’t depend on the CDF being tested. Also, this is an exact test, that doesn’t depend on an adequate sample size.

We reject our null hypothesis, if the test statistic D, is greater than the critical value we get from our table. There are a variety of ways to get the D value but they should be equivalent. However, you should ensure that the statistic is calculated in a way that matches the critical value table used.

Tags: , ,
Posted in statistics, univariate | No Comments »

Univariate Statistics Part 3, How do you comparing central tendency

These are tests used to compare the averages between to distributions. Comparing two means, we could compare the mean to a predefined value (e.g., Are these sandwiches on average different that Subway’s 12” claim?). Or we could compare two means from two separate sample sets together (e.g., Do the sandwiches at this Subway differ from the sandwiches at another?). Both tests use similar methods, however if we are comparing to a predefined value, we use a normal distribution, and if we are comparing two sample sets, we use a t-distribution.

We also must consider sample size, as with any sample set under 30, we cannot assume our standard deviation is accurate. Therefore, with sample sizes under 30 we use the t-distribution. We therefore have a few scenarios to consider:

  • One sample vs. a predefined population mean
    • Large sample size
    • Small sample size
  • Two samples against each other
    • Large sample size
    • Small sample size
  • Paired experiments

One sample vs. a predefined population mean

Our t-scores or z-scores are calculated as:

Let’s go back to our sandwiches example. We want to know whether our sandwiches are really 12” in length. We have 10 sandwiches and measure them (11.9, 11.8, 12.1, 11.9, 12.0, 12.1, 11.9, 11.8, 12.2, and 12.0 inches). We have the following calculations for our test statistic:

We also must determine our significance level α, which we will say is 0.05. And state our hypotheses. We use a two-tailed test, so our hypotheses are:

We have a small sample size, so we use the t-distribution.

I used a t-distribution calculator to calculate the p-value of -1.108 and found that p(two-tailed) is 0.2966. As this is greater than our significance level of 0.05, we cannot reject the null hypothesis.

One thing to note, as sample size increases, the t-distribution begins to approximate normal distribution, so a general approach is to use the t-distribution.

Means of two different sample groups

Now when using the t-test on two different sample groups, with a small size, we can obtain a better estimate by pooling our sample variances. As long as our population variances are equal between the samples, the samples are independent to each other, and the samples follow a normal distribution.

For this example, we are testing to see if one granite (Granite A) is elevated in U. We have 30 samples of each granite (Granite A, and Granite B). Granite A has a mean U of 923 ppm, and a standard deviation of 433 ppm. Granite B has a mean U of 723 ppm, and a standard deviation of 399 ppm.

We define our significance level at α=5%. That is, we want less than a 5% probability of getting 200 ppm difference between the two samples, or less than a 5% chance of incorrectly rejecting the null hypothesis.

We look at the critical Z value on a normalized normal distribution chart and find that for a 95% confidence level on our one-tailed test, we need a critical value of 1.65 (or 1.65 standard deviations from the mean).

The distance from the mean to the critical value of 1.65 is:

Based on this information, we know that there is a 5% chance of obtaining a difference of 177 ppm or higher. Our observed difference is 200 ppm. As this is higher than our significance level, we can reject our null hypothesis and state that Granite A is elevated in U compared to Granite B.

Paired experiments

Once we begin comparing different methods to the same population, we lose the requirement that our samples are independent. The variance between samples is greater than the variance between methods (i.e. Rock A analyzed by method A, will be close to Rock A analyzed by method B, however Rock B to Z will show much more variability in either method). Because of this, we cannot pool our variances as before. We instead use pairwise differences, di, and consider this to follow a t-distribution. d̄1, and sD, are the mean and standard deviation. nD is the number of pairs. D0, is the value of our null hypothesis, and μD is the observed value we are testing against. We also make our assumption that the distribution of the differences is normal, and the differences are a random sample. Our test statistic is:

This paired experiment is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.

Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.

Let’s do an example of a paired experiment using the Wilcoxon Sign rank test. We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

In this sample data set, I used a Shapiro-Wilk test to test for normality (α=0.05). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.

As usual we state our hypotheses:

We state our level of significance at α=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculated some new information for our test, the absolute difference, and the rank.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore ?stat=13. We use a table to find that the critical value for ?=10,??? ?=0.05, is ?crit=8. Therefore, ?stat > ?crit , and we cannot reject the null hypothesis. If our critical value was greater than our test statistic, we would be able to reject the null.

Some robust methods for comparing two medians

The two most powerful tests when normality is not certain, are the Mann-Whitney and the Terry-Hoeffding test.

The Mann-Whitney, Wilcoxin rank-sum, or the U-test ranks the two groups into one combined set and calculates the sum of ranks for each group R1 and R2.

The smaller of these two values is taken. If the null hypothesis is true, both values should be equal. For n1 and n2 ≤ 20, the critical values table is consulted. A significant value is observed if the calculated U is less than the critical value. Larger data sets are converted to standard normal by:

Where t is the number of tied values for a given rank. In this case, we have the opposite scenario where a significant value is observed with the Z is calculated larger than the critical value.

The Terry-Hoeffding test is performed similarly but uses the sum of the rankits with the smaller group of data instead of the raw ranks.

Where E(N,Ri ) is the rankit for the Rth rank of N values. Resulting in a test with greater power. T is used to do a one-sided test to determine if the smaller group has a larger median, -T to test if the smaller group has a smaller median, and the absolute value of T for two-sided tests. For sample sizes greater than 20, the approximated critical value is:

In this case, r is the critical value of Pearson’s r, given that N and n2 are sufficiently large.

If doing a two-tailed test, testing the null hypothesis for equality, we reject the null when:

If doing a one-tailed test, we reject the null when:

Posted in statistics, univariate | No Comments »

Measures of spread and scale

Often, we may want to compare our sample variance, with a hypothetical population variance, in order to ensure it doesn’t exceed a certain value. Back to our sandwiches, perhaps Subway is okay with footlongs varying in length by no more than 1”. The test statistic we use is the chi-squared distribution with n-1 degrees of freedom.

In this test we hypothesize that H0: s2 = σ02, or that our hypothesized value is equal to our sample value. Our alternative hypothesis could be Ha: s2 > σ02, for a one-tailed test. This would imply that χ2, would get larger and larger as s2 increases in comparison to our hypothesized σ2. Alternatively, the opposite scenario would imply small values show evidence against our null hypothesis. With two-tailed tests, we reject our null with very large, and very small values of χ2.

As an example, let’s imagine a quality control experiment to ensure that 500 gram sand samples taken in the field do not exceed a variance of 50 grams2. Our hypothesis could be:

We take 9 samples, and each have the weights 455, 460, 473, 496, 503, 512, 523, 527, 540 grams. Our s2is calculated to be 818.6172. Now we calculate our χ2:

Using R, or a chi-squared table, we find that the p-value for our chi-square score is <0.00001. Although we haven’t stated an acceptable α, we can say there is strong evidence with p-value <0.0001 that the true variance of our sample weights is greater than 50 grams2.

If we are comparing two sample variances, we use the F distribution. In this case we have a sample variance s12, from n1 observations with a variance of σ12. And another sample variance s22, from n2 observations with a variance of σ22. Both have independent observations, and both are from normally distributed populations. As usual we test: H0: σ12= σ22, against either:

Let’s do another example. Does a sample method that collects 2000 gram samples have a greater variability than one that collects 500 gram samples?

Using a computer, we find the p-value of F = 2.785 with 10, 6 degrees of freedom is 0.111225. Our p-value is greater than our α(0.05), and therefore the result is not significant, and we cannot reject our null hypothesis.

We must mention a few caveats. Some sample sizes are very small, and larger sample sizes would be ideal. And secondly, normality is very important for both chi-squared and F distributions. If normality is not certain, non-parametric tests should be done.

The most powerful non-parametric test (Rock, 1988a), would be the Klotz test, based on the squares of normal scores. Normal scores are, Ai, where Ai= φ-1(Ri/(N+1)), and φ-1 is the percent point (cumulative distribution) function of the standard normal distribution, Ri is the rank of the i-th observation, and N is the sample size. The test statistic is calculated as follows:

As usual, if the calculated value is below the test value from the normal distribution, we accept the null hypothesis, if not we reject in favor of the alternative. That is, where α is the significance level:

Tags: ,
Posted in statistics, univariate | No Comments »

We often would like to compare a population parameter to our sample. We may know this parameter in advance, or we may want to compare it to another sample. Suppose we want to know for sure Subway is giving us a real footlong sandwich when we ask for one. We buy 30 12″ sandwiches and measure their lengths to be 11.5″ on average. Is this close enough to 12″ to accept Subway’s claim? Or we could have 30 samples from 2 different granites. Are these two granites geochemically equivalent or not?

In order to do this we follow a 5-step process

  1. Formulate a null hypothesis, H0 (what we assume to be true), and our alternative hypotheses, Ha (which we only accept with sufficient evidence to reject the null)
  2. Specify our level of significance. How important is it to avoid Type I errors. That is rejecting the null hypothesis, when it was in fact true. This could be very important when discussing the legal system and whether or not a person is innocent of a crime. It may be of less or equal importance in Subway sandwiches depending on your love of sandwiches.
  3. Calculate our test statistic.
  4. Define the region of rejection.
  5. Either accept our null hypothesis, or reject it in favour of our alternative hypothesis.

In our hypothesis testing, we can make two errors. 

  • Type I errors occur when we reject our null hypothesis, but the null is true. For example, our justice system assumes a person is innocent. If we reject this, and put an innocent person in jail, we have made a type I error.
  • Type II errors occurs when we accept the null, but our alternative was true. In this case, a guilty person gets away without punishment.

Although often type I errors are considered worse, this is not always true. Knowing the difference and the consequences for each error can impact your decisions. These errors are correlated, if we try to decrease the number of type II errors, the number of type I errors will increase.

This is the probability that the observed relationship between two sample sets is due to randomness, and that no difference exists between them. The p-value helps us quantify this significance if it exists. It is a decreasing index of the reliability of the results. The smaller the p-value, the more reliable we believe that relation. Or in other words, a p-value of 0.05 indicates that we have a 5% probability of rejecting our null hypothesis, when in fact it was correct (i.e. making a type I error). Or we have a 1/20 chance that our test would result in a relationship that is equal to or stronger than what we have observed.

We can also look at it as all hypothesis tests are in a sense “negative”. We can never 100% reject the null hypothesis, but we can disprove it to some level of significance. For example, α = 0.05 (5% level of significance).

power of a test is the probability that our test rejects the null hypothesis, when the alternative hypothesis is true. We must specify the level of significance (α), before we do our test (or the probability of a type I error). If we want to minimize our type II errors (β), we express our null hypothesis with the intent to reject it. The power of a test is then 1-β.

Let’s do an example. Let’s assume we have 49 samples from a normally distributed population. We know σ = 14, but μ is unknown. We start with our hypothesis test, we suggest our population mean is 50, and test it against a one-sided alternative that the population mean is less than 50.

H0: μ = 50Ha: μ < 50

We set our significance level at α = 0.05. We now pick our appropriate test statistic. We have a normal distribution, and we know our population standard deviation, the appropriate test statistic is:

We calculate which values of Z we reject our null hypothesis. We check our Z score from a Z table, and find that the Z value with an area of 0.05 to the left is -1.64. We therefore reject if Z ≤ -1.64. To facilitate our calculations of the power of the test, we express the rejection region as . We need to find what values of will we reject the null hypothesis. By expressing our Z and calculating  as follows:

This means we have a 0.05 probability of committing a Type I error when our sample mean is 46.72 or less.

Now, if µ = 45, and our null hypothesis wrongly states µ = 50, what is the probability of a Type II error? That is, what’s the probability of not rejecting the null when it is in fact false.

We standardize as follows:

We look up in the Z table and find the area to the right of Z approximately 0.469. Therefore, our probability of a Type II error is β=0.469. And the probability of rejecting the null hypothesis when it is false, that is the power of the test, is 1 – β = 1 – 0.469 = 0.531!

If my explanations leave you with a few questions, which wouldn’t surprise me in the slightest, check out this jbstatistics video (or many of the other ones which are much better at showing graphics to explain the concepts and helped me understand them to begin with.

Tags: , , ,
Posted in statistics, univariate | No Comments »

Exploratory Data Analysis and data types in a geological context


Most often in university statistics courses, parametric techniques are given the primary focus. These techniques involve summarizing distributions with typical parameters like mean, median, mode, variance, and standard deviation. The focus is on central tendency and the spread of subsets of the populations to make general conclusions and determine equivalence.

This has some limitations when dealing with geological problems. Samples are typically not drawn from a normal population, and despite common usage, typically are not from log-normal populations either. Also, it is often the goal to identify unusual values, outliers, and not general conclusions about the entire population like mean and median values would.

The goal of the next series of blog posts is to go over these classical statistical methods along with some more robust nonparametric techniques. These techniques will be less sensitive to outlying values, skewed distributions, which are more relevant to geologists. We will discuss 1 and 2 variable, and multivariate scenarios, all with the goal of better understanding the data set and visualizing variations. The end goal is often to find unexpected results and generate new ideas.

There have been huge advances both in the types of data we collect and the amount over the past 20 years. We are getting so good at collecting large amounts of data, but very little has been done in developing new forms of statistical analysis. Spatial and time variables are also special types of data that need to be treated different and in their own context. The availability of open source software and advances in hardware has made analysis less expensive but can also lead to inappropriate methods.

The test data set I will use in this report is from the Unearthed competition data. The Explorer Challenge Unearthed is from a competition in South Australia where OZ Minerals opened up their data from the Mount Woods project. It lies close to the Prominent Hill copper and gold mine, but no discovery had been made at Mount Woods. This region is known for its copper and gold deposits, but any economic mineral could be of interest.

2 Terabytes of data is available for testing statistical methods. Typical economic geology hurdles exist. Large scale deposits are a result of a combination of factors and very rare. Deposits are often at depth, making collecting data difficult. It’s also difficult to distinguish ore-grade deposits from unmineralized rocks. When choosing what statistical method to use it’s important to determine the data type

Categorical data can be divided into nominal or ordinal

  • Nominal data is data that has two or more categories, but no intrinsic order. An example of this could be lithological rock types. 
  • Ordinal data is organized into two or more categories as well, but these categories can be ranked. An example of this could be descriptions of alteration as trace, weak, moderate, or strong.

Continuous variables can be divided into interval or ratio data. 

  • Interval data is continuous data measured along a continuum. Temperature in Celsius or Fahrenheit is an example of this type of data. The difference between 5 degrees and 15 degrees is the same as between 15 degrees and 25 degrees. 
  • Ratio data is the same, with the condition of 0 being equal to none of that variable. For example, Kelvin scale, or in our case geochemical assays. Most natural science data is of this type.

Table 1

The data available includes a drillhole database with 678 drillholes, with 659 geological logs, and 585 chemical assays of up to 55 elements. It is not known if the distribution of the various features of the data is normal (Gaussian), log-normal, or neither. There is a variety of types of data available for testing. The drillhole data will be the bulk of the data used for testing in this report. 27,000 core tray photographs are available but won’t be used for this study.

Raw geophysics data is also available for test data. Other possible data sets to use could be regional airborne magnetics and radiometric surveys, prospect-level ground gravity surveys, 2D seismic surveys, regional magnetotellurics, regional and prospect-scale induced polarisations surveys, and multi-field inversion and modelling around Prominent Hill.

Data wrangling, cleaning, analyses and presentations will all be done using Microsoft Excel and Python 3.x.

Tags: ,
Posted in geology, statistics | No Comments »