## Archive for the ‘univariate’ Category

### The Wilcoxon Sign Rank Test: A Geological Example

Geological data is often not normally distributed, and with data that is not normally distributed parametric methods should not be used.  In this case, I did a paired experiment, which is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.

Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.

We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

In this sample data set, I used a Shapiro-Wilk test to test for normality (∝=0.05) (“Shapiro-Wilk Test Calculator”, 2020). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.

As usual we state our hypothesis:

We state our level of significance at ∝=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculate some new information for our test: the absolute difference, and the rank.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore wstat=13. We use a table to find that the critical value for n=10 and ∝=0.05, is wcrit=8, is  (“Wilcoxon Signed-Ranks Table”, 2020). Therefore,  wstat>wcrit, and we cannot reject the null hypothesis. It is possible there has been no change in As levels between the two years. If our critical value was greater than our test statistic, we would be able to reject the null and confirm statistically significant change in As values.

## Comparing Distribution. Chi-square and other robust methods.

Most of the discussion previously has assumed normal distributions, and given some options for nonparametric situations, but we need tests to determine what type of distribution we are dealing with.

The chi-square goodness-of-fit test is the first, and simplest way to do this. It compares a known distribution to the empirical data set. However, it does require a large enough sample size, and can be influenced by binning choices.

For example, we first assume is that two distributions are equal, and we reject this if we have sufficient proof they are different. The test statistic for a chi-square goodness-of-fit is:

Let’s do a simple example of a 6-sided dice toss. We expect an even distribution from 60 tosses, 10 for each possibility. That is 1 through 6 should have the same frequencies. From 60 tosses we get, 10 ones, 14 twos, 6 threes, 18 fours, 2 fives, and 10 sixes.

This is true, so we reject or null hypothesis. This die is loaded (with a significance level of 0.05). However, if we wanted to use a significance level of 0.01, our critical value would be 16.81, and we would not be able to reject our null hypothesis. Maybe this die isn’t so loaded after all.

### Robust methods to estimate distribution

With the chi-square test only applicable to large data sets (n>30), we need a test to use for smaller samples. The Kolmogorov-Smirnov (K-S) test fills this need. The test statistic relies on the maximum distance between the two cumulative probability curves.

K-S example from Wikipedia

CDF is the theoretical cumulative distribution function, and EDF is the empirical distribution function.

The two major advantages of this test is the statistic itself doesn’t depend on the CDF being tested. Also, this is an exact test, that doesn’t depend on an adequate sample size.

We reject our null hypothesis, if the test statistic D, is greater than the critical value we get from our table. There are a variety of ways to get the D value but they should be equivalent. However, you should ensure that the statistic is calculated in a way that matches the critical value table used.

Tags: , ,
Posted in statistics, univariate | No Comments »

### Univariate Statistics Part 3, How do you comparing central tendency

These are tests used to compare the averages between to distributions. Comparing two means, we could compare the mean to a predefined value (e.g., Are these sandwiches on average different that Subway’s 12” claim?). Or we could compare two means from two separate sample sets together (e.g., Do the sandwiches at this Subway differ from the sandwiches at another?). Both tests use similar methods, however if we are comparing to a predefined value, we use a normal distribution, and if we are comparing two sample sets, we use a t-distribution.

We also must consider sample size, as with any sample set under 30, we cannot assume our standard deviation is accurate. Therefore, with sample sizes under 30 we use the t-distribution. We therefore have a few scenarios to consider:

• One sample vs. a predefined population mean
• Large sample size
• Small sample size
• Two samples against each other
• Large sample size
• Small sample size
• Paired experiments

### One sample vs. a predefined population mean

Our t-scores or z-scores are calculated as:

Let’s go back to our sandwiches example. We want to know whether our sandwiches are really 12” in length. We have 10 sandwiches and measure them (11.9, 11.8, 12.1, 11.9, 12.0, 12.1, 11.9, 11.8, 12.2, and 12.0 inches). We have the following calculations for our test statistic:

We also must determine our significance level α, which we will say is 0.05. And state our hypotheses. We use a two-tailed test, so our hypotheses are:

We have a small sample size, so we use the t-distribution.

I used a t-distribution calculator to calculate the p-value of -1.108 and found that p(two-tailed) is 0.2966. As this is greater than our significance level of 0.05, we cannot reject the null hypothesis.

One thing to note, as sample size increases, the t-distribution begins to approximate normal distribution, so a general approach is to use the t-distribution.

### Means of two different sample groups

Now when using the t-test on two different sample groups, with a small size, we can obtain a better estimate by pooling our sample variances. As long as our population variances are equal between the samples, the samples are independent to each other, and the samples follow a normal distribution.

For this example, we are testing to see if one granite (Granite A) is elevated in U. We have 30 samples of each granite (Granite A, and Granite B). Granite A has a mean U of 923 ppm, and a standard deviation of 433 ppm. Granite B has a mean U of 723 ppm, and a standard deviation of 399 ppm.

We define our significance level at α=5%. That is, we want less than a 5% probability of getting 200 ppm difference between the two samples, or less than a 5% chance of incorrectly rejecting the null hypothesis.

We look at the critical Z value on a normalized normal distribution chart and find that for a 95% confidence level on our one-tailed test, we need a critical value of 1.65 (or 1.65 standard deviations from the mean).

The distance from the mean to the critical value of 1.65 is:

Based on this information, we know that there is a 5% chance of obtaining a difference of 177 ppm or higher. Our observed difference is 200 ppm. As this is higher than our significance level, we can reject our null hypothesis and state that Granite A is elevated in U compared to Granite B.

### Paired experiments

Once we begin comparing different methods to the same population, we lose the requirement that our samples are independent. The variance between samples is greater than the variance between methods (i.e. Rock A analyzed by method A, will be close to Rock A analyzed by method B, however Rock B to Z will show much more variability in either method). Because of this, we cannot pool our variances as before. We instead use pairwise differences, di, and consider this to follow a t-distribution. d̄1, and sD, are the mean and standard deviation. nD is the number of pairs. D0, is the value of our null hypothesis, and μD is the observed value we are testing against. We also make our assumption that the distribution of the differences is normal, and the differences are a random sample. Our test statistic is:

This paired experiment is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.

Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.

Let’s do an example of a paired experiment using the Wilcoxon Sign rank test. We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

In this sample data set, I used a Shapiro-Wilk test to test for normality (α=0.05). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.

As usual we state our hypotheses:

We state our level of significance at α=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculated some new information for our test, the absolute difference, and the rank.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore ?stat=13. We use a table to find that the critical value for ?=10,??? ?=0.05, is ?crit=8. Therefore, ?stat > ?crit , and we cannot reject the null hypothesis. If our critical value was greater than our test statistic, we would be able to reject the null.

### Some robust methods for comparing two medians

The two most powerful tests when normality is not certain, are the Mann-Whitney and the Terry-Hoeffding test.

The Mann-Whitney, Wilcoxin rank-sum, or the U-test ranks the two groups into one combined set and calculates the sum of ranks for each group R1 and R2.

The smaller of these two values is taken. If the null hypothesis is true, both values should be equal. For n1 and n2 ≤ 20, the critical values table is consulted. A significant value is observed if the calculated U is less than the critical value. Larger data sets are converted to standard normal by:

Where t is the number of tied values for a given rank. In this case, we have the opposite scenario where a significant value is observed with the Z is calculated larger than the critical value.

The Terry-Hoeffding test is performed similarly but uses the sum of the rankits with the smaller group of data instead of the raw ranks.

Where E(N,Ri ) is the rankit for the Rth rank of N values. Resulting in a test with greater power. T is used to do a one-sided test to determine if the smaller group has a larger median, -T to test if the smaller group has a smaller median, and the absolute value of T for two-sided tests. For sample sizes greater than 20, the approximated critical value is:

In this case, r is the critical value of Pearson’s r, given that N and n2 are sufficiently large.

If doing a two-tailed test, testing the null hypothesis for equality, we reject the null when:

If doing a one-tailed test, we reject the null when:

Tags:
Posted in statistics, univariate | No Comments »

## Measures of spread and scale

Often, we may want to compare our sample variance, with a hypothetical population variance, in order to ensure it doesn’t exceed a certain value. Back to our sandwiches, perhaps Subway is okay with footlongs varying in length by no more than 1”. The test statistic we use is the chi-squared distribution with n-1 degrees of freedom.

In this test we hypothesize that H0: s2 = σ02, or that our hypothesized value is equal to our sample value. Our alternative hypothesis could be Ha: s2 > σ02, for a one-tailed test. This would imply that χ2, would get larger and larger as s2 increases in comparison to our hypothesized σ2. Alternatively, the opposite scenario would imply small values show evidence against our null hypothesis. With two-tailed tests, we reject our null with very large, and very small values of χ2.

As an example, let’s imagine a quality control experiment to ensure that 500 gram sand samples taken in the field do not exceed a variance of 50 grams2. Our hypothesis could be:

We take 9 samples, and each have the weights 455, 460, 473, 496, 503, 512, 523, 527, 540 grams. Our s2is calculated to be 818.6172. Now we calculate our χ2:

Using R, or a chi-squared table, we find that the p-value for our chi-square score is <0.00001. Although we haven’t stated an acceptable α, we can say there is strong evidence with p-value <0.0001 that the true variance of our sample weights is greater than 50 grams2.

If we are comparing two sample variances, we use the F distribution. In this case we have a sample variance s12, from n1 observations with a variance of σ12. And another sample variance s22, from n2 observations with a variance of σ22. Both have independent observations, and both are from normally distributed populations. As usual we test: H0: σ12= σ22, against either:

Let’s do another example. Does a sample method that collects 2000 gram samples have a greater variability than one that collects 500 gram samples?

Using a computer, we find the p-value of F = 2.785 with 10, 6 degrees of freedom is 0.111225. Our p-value is greater than our α(0.05), and therefore the result is not significant, and we cannot reject our null hypothesis.

We must mention a few caveats. Some sample sizes are very small, and larger sample sizes would be ideal. And secondly, normality is very important for both chi-squared and F distributions. If normality is not certain, non-parametric tests should be done.

The most powerful non-parametric test (Rock, 1988a), would be the Klotz test, based on the squares of normal scores. Normal scores are, Ai, where Ai= φ-1(Ri/(N+1)), and φ-1 is the percent point (cumulative distribution) function of the standard normal distribution, Ri is the rank of the i-th observation, and N is the sample size. The test statistic is calculated as follows:

As usual, if the calculated value is below the test value from the normal distribution, we accept the null hypothesis, if not we reject in favor of the alternative. That is, where α is the significance level:

Tags: ,
Posted in statistics, univariate | No Comments »

### Univariate Statistics Part 1, Formulating your hypothesis, and the p-value

We often would like to compare a population parameter to our sample. We may know this parameter in advance, or we may want to compare it to another sample. Suppose we want to know for sure Subway is giving us a real footlong sandwich when we ask for one. We buy 30 12″ sandwiches and measure their lengths to be 11.5″ on average. Is this close enough to 12″ to accept Subway’s claim? Or we could have 30 samples from 2 different granites. Are these two granites geochemically equivalent or not?

In order to do this we follow a 5-step process

1. Formulate a null hypothesis, H0 (what we assume to be true), and our alternative hypotheses, Ha (which we only accept with sufficient evidence to reject the null)
2. Specify our level of significance. How important is it to avoid Type I errors. That is rejecting the null hypothesis, when it was in fact true. This could be very important when discussing the legal system and whether or not a person is innocent of a crime. It may be of less or equal importance in Subway sandwiches depending on your love of sandwiches.
3. Calculate our test statistic.
4. Define the region of rejection.
5. Either accept our null hypothesis, or reject it in favour of our alternative hypothesis.

In our hypothesis testing, we can make two errors.

• Type I errors occur when we reject our null hypothesis, but the null is true. For example, our justice system assumes a person is innocent. If we reject this, and put an innocent person in jail, we have made a type I error.
• Type II errors occurs when we accept the null, but our alternative was true. In this case, a guilty person gets away without punishment.

Although often type I errors are considered worse, this is not always true. Knowing the difference and the consequences for each error can impact your decisions. These errors are correlated, if we try to decrease the number of type II errors, the number of type I errors will increase.

This is the probability that the observed relationship between two sample sets is due to randomness, and that no difference exists between them. The p-value helps us quantify this significance if it exists. It is a decreasing index of the reliability of the results. The smaller the p-value, the more reliable we believe that relation. Or in other words, a p-value of 0.05 indicates that we have a 5% probability of rejecting our null hypothesis, when in fact it was correct (i.e. making a type I error). Or we have a 1/20 chance that our test would result in a relationship that is equal to or stronger than what we have observed.

We can also look at it as all hypothesis tests are in a sense “negative”. We can never 100% reject the null hypothesis, but we can disprove it to some level of significance. For example, α = 0.05 (5% level of significance).

power of a test is the probability that our test rejects the null hypothesis, when the alternative hypothesis is true. We must specify the level of significance (α), before we do our test (or the probability of a type I error). If we want to minimize our type II errors (β), we express our null hypothesis with the intent to reject it. The power of a test is then 1-β.

Let’s do an example. Let’s assume we have 49 samples from a normally distributed population. We know σ = 14, but μ is unknown. We start with our hypothesis test, we suggest our population mean is 50, and test it against a one-sided alternative that the population mean is less than 50.

H0: μ = 50Ha: μ < 50

We set our significance level at α = 0.05. We now pick our appropriate test statistic. We have a normal distribution, and we know our population standard deviation, the appropriate test statistic is:

We calculate which values of Z we reject our null hypothesis. We check our Z score from a Z table, and find that the Z value with an area of 0.05 to the left is -1.64. We therefore reject if Z ≤ -1.64. To facilitate our calculations of the power of the test, we express the rejection region as . We need to find what values of will we reject the null hypothesis. By expressing our Z and calculating  as follows:

This means we have a 0.05 probability of committing a Type I error when our sample mean is 46.72 or less.

Now, if µ = 45, and our null hypothesis wrongly states µ = 50, what is the probability of a Type II error? That is, what’s the probability of not rejecting the null when it is in fact false.

We standardize as follows:

We look up in the Z table and find the area to the right of Z approximately 0.469. Therefore, our probability of a Type II error is β=0.469. And the probability of rejecting the null hypothesis when it is false, that is the power of the test, is 1 – β = 1 – 0.469 = 0.531!

If my explanations leave you with a few questions, which wouldn’t surprise me in the slightest, check out this jbstatistics video (or many of the other ones which are much better at showing graphics to explain the concepts and helped me understand them to begin with.