These are tests used to compare the averages between to distributions. Comparing two means, we could compare the mean to a predefined value (e.g., Are these sandwiches on average different that Subway’s 12” claim?). Or we could compare two means from two separate sample sets together (e.g., Do the sandwiches at this Subway differ from the sandwiches at another?). Both tests use similar methods, however if we are comparing to a predefined value, we use a normal distribution, and if we are comparing two sample sets, we use a t-distribution.
We also must consider sample size, as with any sample set under 30, we cannot assume our standard deviation is accurate. Therefore, with sample sizes under 30 we use the t-distribution. We therefore have a few scenarios to consider:
- One sample vs. a predefined population mean
- Large sample size
- Small sample size
- Two samples against each other
- Large sample size
- Small sample size
- Paired experiments
One sample vs. a predefined population mean
Our t-scores or z-scores are calculated as:

Let’s go back to our sandwiches example. We want to know whether our sandwiches are really 12” in length. We have 10 sandwiches and measure them (11.9, 11.8, 12.1, 11.9, 12.0, 12.1, 11.9, 11.8, 12.2, and 12.0 inches). We have the following calculations for our test statistic:

We also must determine our significance level α, which we will say is 0.05. And state our hypotheses. We use a two-tailed test, so our hypotheses are:

We have a small sample size, so we use the t-distribution.

I used a t-distribution calculator to calculate the p-value of -1.108 and found that p(two-tailed) is 0.2966. As this is greater than our significance level of 0.05, we cannot reject the null hypothesis.
One thing to note, as sample size increases, the t-distribution begins to approximate normal distribution, so a general approach is to use the t-distribution.
Means of two different sample groups
Now when using the t-test on two different sample groups, with a small size, we can obtain a better estimate by pooling our sample variances. As long as our population variances are equal between the samples, the samples are independent to each other, and the samples follow a normal distribution.

For this example, we are testing to see if one granite (Granite A) is elevated in U. We have 30 samples of each granite (Granite A, and Granite B). Granite A has a mean U of 923 ppm, and a standard deviation of 433 ppm. Granite B has a mean U of 723 ppm, and a standard deviation of 399 ppm.

We define our significance level at α=5%. That is, we want less than a 5% probability of getting 200 ppm difference between the two samples, or less than a 5% chance of incorrectly rejecting the null hypothesis.
We look at the critical Z value on a normalized normal distribution chart and find that for a 95% confidence level on our one-tailed test, we need a critical value of 1.65 (or 1.65 standard deviations from the mean).
The distance from the mean to the critical value of 1.65 is:

Based on this information, we know that there is a 5% chance of obtaining a difference of 177 ppm or higher. Our observed difference is 200 ppm. As this is higher than our significance level, we can reject our null hypothesis and state that Granite A is elevated in U compared to Granite B.
Paired experiments
Once we begin comparing different methods to the same population, we lose the requirement that our samples are independent. The variance between samples is greater than the variance between methods (i.e. Rock A analyzed by method A, will be close to Rock A analyzed by method B, however Rock B to Z will show much more variability in either method). Because of this, we cannot pool our variances as before. We instead use pairwise differences, di, and consider this to follow a t-distribution. d̄1, and sD, are the mean and standard deviation. nD is the number of pairs. D0, is the value of our null hypothesis, and μD is the observed value we are testing against. We also make our assumption that the distribution of the differences is normal, and the differences are a random sample. Our test statistic is:

This paired experiment is a simplified example of blocking, where comparisons are made between similar experimental units. This blocking needs to be done prior to performing the experiment.
Wilcoxon sign rank test can be used when we can’t assume a normal distribution in a paired experiment.
Let’s do an example of a paired experiment using the Wilcoxon Sign rank test. We will use the fictional data below to determine if Arsenic (As) is increasing in soil samples over a 1-year period.

In this sample data set, I used a Shapiro-Wilk test to test for normality (α=0.05). For February 2018, we calculated a p-value of 0.457 that our sample set is normal, so we assume that it is normal. For February 2019, our calculated p-value is 0.047, this is statistically significant and therefore we assume it is not normally distributed.
As usual we state our hypotheses:

We state our level of significance at α=0.05, and our number of observations n=9(we ignore values with no differences such as location 4). First, we need to calculated some new information for our test, the absolute difference, and the rank.

As this is a Wilcoxon signed rank test, we need to know the rank sum of the negative and positive differences. We ignore sample pairs with no differences.

Our Wilcox test statistic is the smallest of these two calculations, therefore ?stat=13. We use a table to find that the critical value for ?=10,??? ?=0.05, is ?crit=8. Therefore, ?stat > ?crit , and we cannot reject the null hypothesis. If our critical value was greater than our test statistic, we would be able to reject the null.
Some robust methods for comparing two medians
The two most powerful tests when normality is not certain, are the Mann-Whitney and the Terry-Hoeffding test.
The Mann-Whitney, Wilcoxin rank-sum, or the U-test ranks the two groups into one combined set and calculates the sum of ranks for each group R1 and R2.

The smaller of these two values is taken. If the null hypothesis is true, both values should be equal. For n1 and n2 ≤ 20, the critical values table is consulted. A significant value is observed if the calculated U is less than the critical value. Larger data sets are converted to standard normal by:

Where t is the number of tied values for a given rank. In this case, we have the opposite scenario where a significant value is observed with the Z is calculated larger than the critical value.
The Terry-Hoeffding test is performed similarly but uses the sum of the rankits with the smaller group of data instead of the raw ranks.

Where E(N,Ri ) is the rankit for the Rth rank of N values. Resulting in a test with greater power. T is used to do a one-sided test to determine if the smaller group has a larger median, -T to test if the smaller group has a smaller median, and the absolute value of T for two-sided tests. For sample sizes greater than 20, the approximated critical value is:

In this case, r is the critical value of Pearson’s r, given that N and n2 are sufficiently large.
If doing a two-tailed test, testing the null hypothesis for equality, we reject the null when:

If doing a one-tailed test, we reject the null when:
