# Exploratory Data Analysis and data types in a geological context

## Introduction

Most often in university statistics courses, parametric techniques are given the primary focus. These techniques involve summarizing distributions with typical parameters like mean, median, mode, variance, and standard deviation. The focus is on central tendency and the spread of subsets of the populations to make general conclusions and determine equivalence.

This has some limitations when dealing with geological problems. Samples are typically not drawn from a normal population, and despite common usage, typically are not from log-normal populations either. Also, it is often the goal to identify unusual values, outliers, and not general conclusions about the entire population like mean and median values would.

The goal of the next series of blog posts is to go over these classical statistical methods along with some more robust nonparametric techniques. These techniques will be less sensitive to outlying values, skewed distributions, which are more relevant to geologists. We will discuss 1 and 2 variable, and multivariate scenarios, all with the goal of better understanding the data set and visualizing variations. The end goal is often to find unexpected results and generate new ideas.

There have been huge advances both in the types of data we collect and the amount over the past 20 years. We are getting so good at collecting large amounts of data, but very little has been done in developing new forms of statistical analysis. Spatial and time variables are also special types of data that need to be treated different and in their own context. The availability of open source software and advances in hardware has made analysis less expensive but can also lead to inappropriate methods.

The test data set I will use in this report is from the Unearthed competition data. The Explorer Challenge Unearthed is from a competition in South Australia where OZ Minerals opened up their data from the Mount Woods project. It lies close to the Prominent Hill copper and gold mine, but no discovery had been made at Mount Woods. This region is known for its copper and gold deposits, but any economic mineral could be of interest.

2 Terabytes of data is available for testing statistical methods. Typical economic geology hurdles exist. Large scale deposits are a result of a combination of factors and very rare. Deposits are often at depth, making collecting data difficult. It’s also difficult to distinguish ore-grade deposits from unmineralized rocks. When choosing what statistical method to use it’s important to determine the data type

Categorical data can be divided into nominal or ordinal

• Nominal data is data that has two or more categories, but no intrinsic order. An example of this could be lithological rock types.
• Ordinal data is organized into two or more categories as well, but these categories can be ranked. An example of this could be descriptions of alteration as trace, weak, moderate, or strong.

Continuous variables can be divided into interval or ratio data.

• Interval data is continuous data measured along a continuum. Temperature in Celsius or Fahrenheit is an example of this type of data. The difference between 5 degrees and 15 degrees is the same as between 15 degrees and 25 degrees.
• Ratio data is the same, with the condition of 0 being equal to none of that variable. For example, Kelvin scale, or in our case geochemical assays. Most natural science data is of this type.

The data available includes a drillhole database with 678 drillholes, with 659 geological logs, and 585 chemical assays of up to 55 elements. It is not known if the distribution of the various features of the data is normal (Gaussian), log-normal, or neither. There is a variety of types of data available for testing. The drillhole data will be the bulk of the data used for testing in this report. 27,000 core tray photographs are available but won’t be used for this study.

Raw geophysics data is also available for test data. Other possible data sets to use could be regional airborne magnetics and radiometric surveys, prospect-level ground gravity surveys, 2D seismic surveys, regional magnetotellurics, regional and prospect-scale induced polarisations surveys, and multi-field inversion and modelling around Prominent Hill.

Data wrangling, cleaning, analyses and presentations will all be done using Microsoft Excel and Python 3.x.