# Univariate Statistics Part 4, Methods to compare distributions

## Comparing Distribution. Chi-square and other robust methods.

Most of the discussion previously has assumed normal distributions, and given some options for nonparametric situations, but we need tests to determine what type of distribution we are dealing with.

The chi-square goodness-of-fit test is the first, and simplest way to do this. It compares a known distribution to the empirical data set. However, it does require a large enough sample size, and can be influenced by binning choices.

For example, we first assume is that two distributions are equal, and we reject this if we have sufficient proof they are different. The test statistic for a chi-square goodness-of-fit is:

Let’s do a simple example of a 6-sided dice toss. We expect an even distribution from 60 tosses, 10 for each possibility. That is 1 through 6 should have the same frequencies. From 60 tosses we get, 10 ones, 14 twos, 6 threes, 18 fours, 2 fives, and 10 sixes.

This is true, so we reject or null hypothesis. This die is loaded (with a significance level of 0.05). However, if we wanted to use a significance level of 0.01, our critical value would be 16.81, and we would not be able to reject our null hypothesis. Maybe this die isn’t so loaded after all.

### Robust methods to estimate distribution

With the chi-square test only applicable to large data sets (n>30), we need a test to use for smaller samples. The Kolmogorov-Smirnov (K-S) test fills this need. The test statistic relies on the maximum distance between the two cumulative probability curves.

K-S example from Wikipedia

CDF is the theoretical cumulative distribution function, and EDF is the empirical distribution function.

The two major advantages of this test is the statistic itself doesn’t depend on the CDF being tested. Also, this is an exact test, that doesn’t depend on an adequate sample size.

We reject our null hypothesis, if the test statistic D, is greater than the critical value we get from our table. There are a variety of ways to get the D value but they should be equivalent. However, you should ensure that the statistic is calculated in a way that matches the critical value table used.