Archive for the ‘geology’ Category

The Wilcoxon Sign Rank Test: A Geological Example

Geological data is often not normally distributed, and with data that is not normally distributed parametric methods should not be used.  In this case, I did a paired experiment, which is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.

Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.

We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

Location IDFeb 2018 (As ppm)Feb 2019 (As ppm)
11811
2127
31716
41010
51417
6810
7915
81317
91220
10935
Figure 1: Fictional As soil sample data.

In this sample data set, I used a Shapiro-Wilk test to test for normality (∝=0.05) (“Shapiro-Wilk Test Calculator”, 2020). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.

As usual we state our hypothesis:

We state our level of significance at ∝=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculate some new information for our test: the absolute difference, and the rank.

Location IDFeb 2018 (As ppm)Feb 2019 (As ppm)DifferenceAbsolute DifferenceRank
11811-777
2127-555
31716-111
4101000
51417333
6810222
7915666
81317444
91220888
1093526269
Figure 2: Calculation for the Wilcoxon signed rank test.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore wstat=13. We use a table to find that the critical value for n=10 and ∝=0.05, is wcrit=8, is  (“Wilcoxon Signed-Ranks Table”, 2020). Therefore,  wstat>wcrit, and we cannot reject the null hypothesis. It is possible there has been no change in As levels between the two years. If our critical value was greater than our test statistic, we would be able to reject the null and confirm statistically significant change in As values.

Tags: , ,
Posted in geology, statistics, univariate | No Comments »

Exploratory Data Analysis and data types in a geological context

Introduction

Most often in university statistics courses, parametric techniques are given the primary focus. These techniques involve summarizing distributions with typical parameters like mean, median, mode, variance, and standard deviation. The focus is on central tendency and the spread of subsets of the populations to make general conclusions and determine equivalence.

This has some limitations when dealing with geological problems. Samples are typically not drawn from a normal population, and despite common usage, typically are not from log-normal populations either. Also, it is often the goal to identify unusual values, outliers, and not general conclusions about the entire population like mean and median values would.

The goal of the next series of blog posts is to go over these classical statistical methods along with some more robust nonparametric techniques. These techniques will be less sensitive to outlying values, skewed distributions, which are more relevant to geologists. We will discuss 1 and 2 variable, and multivariate scenarios, all with the goal of better understanding the data set and visualizing variations. The end goal is often to find unexpected results and generate new ideas.

There have been huge advances both in the types of data we collect and the amount over the past 20 years. We are getting so good at collecting large amounts of data, but very little has been done in developing new forms of statistical analysis. Spatial and time variables are also special types of data that need to be treated different and in their own context. The availability of open source software and advances in hardware has made analysis less expensive but can also lead to inappropriate methods.

The test data set I will use in this report is from the Unearthed competition data. The Explorer Challenge Unearthed is from a competition in South Australia where OZ Minerals opened up their data from the Mount Woods project. It lies close to the Prominent Hill copper and gold mine, but no discovery had been made at Mount Woods. This region is known for its copper and gold deposits, but any economic mineral could be of interest.

2 Terabytes of data is available for testing statistical methods. Typical economic geology hurdles exist. Large scale deposits are a result of a combination of factors and very rare. Deposits are often at depth, making collecting data difficult. It’s also difficult to distinguish ore-grade deposits from unmineralized rocks. When choosing what statistical method to use it’s important to determine the data type

Categorical data can be divided into nominal or ordinal

  • Nominal data is data that has two or more categories, but no intrinsic order. An example of this could be lithological rock types. 
  • Ordinal data is organized into two or more categories as well, but these categories can be ranked. An example of this could be descriptions of alteration as trace, weak, moderate, or strong.

Continuous variables can be divided into interval or ratio data. 

  • Interval data is continuous data measured along a continuum. Temperature in Celsius or Fahrenheit is an example of this type of data. The difference between 5 degrees and 15 degrees is the same as between 15 degrees and 25 degrees. 
  • Ratio data is the same, with the condition of 0 being equal to none of that variable. For example, Kelvin scale, or in our case geochemical assays. Most natural science data is of this type.

Table 1

The data available includes a drillhole database with 678 drillholes, with 659 geological logs, and 585 chemical assays of up to 55 elements. It is not known if the distribution of the various features of the data is normal (Gaussian), log-normal, or neither. There is a variety of types of data available for testing. The drillhole data will be the bulk of the data used for testing in this report. 27,000 core tray photographs are available but won’t be used for this study.

Raw geophysics data is also available for test data. Other possible data sets to use could be regional airborne magnetics and radiometric surveys, prospect-level ground gravity surveys, 2D seismic surveys, regional magnetotellurics, regional and prospect-scale induced polarisations surveys, and multi-field inversion and modelling around Prominent Hill.

Data wrangling, cleaning, analyses and presentations will all be done using Microsoft Excel and Python 3.x.

Tags: ,
Posted in geology, statistics | No Comments »

Orange: A Data Mining Tool

To continue from my previous post, I will introduce a great tool for basic data mining and machine learning that absolutely any geologist can use with no programming knowledge needed. That tool is Orange. It is free, open source, and intuitive. It can be used to simply visualize your data, and can even go as far as applying some of the most common machine learning algorithms that are used today.

Above we see the initial set-up of a project in Orange. On the left are the various widgets that are available to us. In the middle is our workflow. Typically we would start off with a file, and link that to a data file on our PC (e.g. CSV file). In this case the data was provided to us was 1729 samples of various rock types, locations names, and a few geochemical assay results.

We can also attach a Data Table to the file, to allow us to view the data in a familiar table format.

And as with any exploratory data analysis, some visualization is good to do. Here we can just connect a scatter plot (in visualize tools), to our file.

With any machine learning algorithm, and important step is to normalize the data. In this case, we will center by mean, and scale by standard deviation.

From here we continue on with our machine learning workflow. Below is an example of a basic, completed project that predicts the name of the rock on the remaining data. In this case Random Forest performed the best and was used in the prediction.

As you can see, this is just scraping the surface of Orange (or I suppose the peel!). There are numerous tutorials online that would do a much better job at getting into the nitty-gritty, as I myself am just starting to use this. Python itself is still more powerful, and more flexible. In fact Orange uses Python as it’s backend. However, I expect you can go very far with Orange. And although you can quickly start playing around with some machine learning, knowing how to set up training data and test data, and how to interpret the results, still requires careful thought.

Tags: ,
Posted in geology, machine learning | No Comments »

Course Introduction and Reasons Why Machine Learning Projects Fail

I just had the privledge of attending the short course titled above at PDAC, 2019. I would like to thank the course instructors:

First I will give a quick overview of the first day were we went into the history of machine learning, and some of the basics. First to clearly define what Artificial Intelligence is versus Machine Learning. AI involves building machines that react like humans. To give an example, the new “Turing Test” would be to ask it “Can you go into the house and make me a cup of coffee?”. True AI should be able to do this, and we are nowhere near that point. Machine learning on the other hand, is a subset of AI that involves using algorithms to make predictions and classifications based on a large set of training data. A single algorithm can adpat and change it’s own parameters to solve a number of problems. Machine learning can be supervised, where we provide the labels for the data (e.g. rock names, ore, waste, etc.), or it could be unsupervised, where data is clustered based on similarities. Reinforcement learning is another field which is focused on performance and involves finding a balance between exporation and exploitation (e.g. multi-arm bandit problems). A humourus quote that captured the difference.

“If you’re seeing it in PowerPoint, it’s artificial intelligence. If you are seeing it in python, it’s machine learning.”

Cases were we will see machine learning perform the best will be automating menial tasks (e.g. core logging, autonomous driving, and drilling), dealing with highly complex data that humans are not capable of seeing trends in (e.g. exploration with layers of data over past 3D), and cases where rapid reaction time is necessary (e.g. real-time geometallurgy).

One important thing to keep in mind, this will always be a tool for the geologist to use, and not something to replace the geologist entirely. Data must be collected and curated competently, and must be interpreted properly afterwards.

However, this tool has the potential to greatly enhance the ability for the geologists to do both of these things.

A number of other key terms were discussed like cost functions, precision, recall, F-Scores, ROC curves, overfitting and underfitting, all of which deserve their own discussions, which I will do in later posts.

We also went over reasons why machine learning projects fail, which I believe deserves some specific attention:

  • Asking the wrong questions: A specific goal should be delineated before the process begins. This allows you to focus resources on what kinds of data needs to be collected. Aimlessly looking through data is a dangerous endeavor as well. We are notorious as humans, in seeing patterns that don’t exist.
  • Lack of firm support by key stakeholders: Data science projects often have impacts across many departments in an organization. Defining the strategy keeps the project on track, and prevents stakeholder apathy.
  • Data problems: This is a problem I’m particularly familiar with. Quality, consistency, and incompleteness of data is frequently a major problem (A PDF is not a geophysical survey). If there is not enough data, a data scientist should reserve the right to ask for more data. And data collection and data wrangling is often going to be a large part of the job.
  • Lack of the right data science “team”: Even within pure data science teams, you are rarely going to find one person that does everything. There are data engineers, data scientists, data analysists, with experience in Exploratory Data Analysis, Statistics, Coding, Feature Engineering, Visualization, and storytelling. This on top of the absolutely essential domain knowledge that the geologists can provide. Finding that unicorn can also set you up for a failed project should that person become unavailable in the middle of your project.
  • Overly complex models: As often the case, keeping it simple can often lead to better results.
  • Over-promising: Particularly with the increased interest in this area of research, keeping expectations reasonable is important. Often improvements don’t occur right away as each project requires it’s own solutions and refinements as time goes on.

That’s it for now, but I’ll post again shortly about a great new tool for geologists that requires no coding-savvy at all… Orange!

Tags: ,
Posted in geology, machine learning | No Comments »