Section 4.7. Comparison of Distributions

Many statistical problems involve comparing samples drawn from one or more distributions, for example:

Are two samples drawn from the same/different distributions?
Is a sample different from a known reference distribution?
Do two samples have a significant difference in some statistic, e.g. do they differ in a number like the mean, median, or variance? (Or some other statistic that describes each sample.)

There are a wide variety of classical statistical tests for different cases. "Classical" here usually means frequentist statistics and the results and significance are typically presented as p-values, confidence intervals, or "N sigma" (technically N sigma refers to Gaussian probabilities). The variety of classical tests can be confusing and sometimes people apply the wrong one or violate the assumptions, so be careful.

The Kolmogorov-Smirnov test for comparing two distributions is one that is very commonly used in astronomy, and its somewhat less common relatives like the Mann-Whitney test, Shapiro-Wilk, etc (aside: I don't know why so many classical tests have these "Bob-Ray test" hyphenated names, but it's a thing).

Another common test is the "t test" or "Student's t" for testing the significance of the difference in means between two samples (aside: Strictly it's between two samples with different mean and same variance, and how do you know the variance? Hold onto that question because it is a good and subtle one.) There are a host of other named statistics, not all covered here, like z-tests (basically the concept of defining a statistic that is N sigma off a Gaussian distribution), F-tests, and so on.

An important point about these tests is that they almost always take the form of rejecting a null hypothesis. For example, the null hypothesis could be that two samples are drawn from the same distribution, you do a K-S test, and if the p-value is small, that rejects the null hypothesis: the two samples aren't drawn from the same distribution. But a large p-value doesn't prove that they are the same distribution. You can only prove differences. Similarly for doing a t-test on difference of means.

Note that both of these tests are giving you a significance, not measuring a parameter and its uncertainty. Further, in classical statistics a p-value, e.g. "D of 0.95 gives p-value of 0.04," is the probability that you would get a data-statistic D this or more extreme, ie probability of data given the model (null hypothesis). It is not the same as the probability of the model given the data, but that is how a lot of people casually treat it.

If you get confused by the wide variety of classical test statistics, some of which seem kind of pulled out of a hat, you may be ready to buy the bridge that Bayesians are selling! But not until Chapter 5. Bayesian statistics provides a framework for calculating the probability of the model given the data.

It's useful to look at the contents of scipy.stats, see http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html#module-scipy.stats

Section 4.7.1 Regression to the Mean

This is sort of a digression and is a cautionary tale about sample selection when comparing distributions. If you select the sample in some way that is related to what you are trying to measure, then your sample may be biased. An example is selecting low scoring students, tutoring them, and then measuring if their scores improve. There may be some improvement just because the original selection was biased, not due to the tutoring.

The original measurement of this effect by Galton is that if you take tall parents and measure the height of their children, they will be closer to the population average. (Even though height is hereditary - it's not perfectly hereditary.) Other examples include picking your telescope site based on what site had the best seeing/weather in the last 5 years, or picking the mutual fund that performed best last year: future performance will typically not be as good.

Section 4.7.2 Nonparametric Methods for Comparing Distributions

Here we don't describe the distributions by parameters in formulae. The well known test is the Kolmogorov-Smirnov test, which operates on the normalized cumulative distributions (CDF) of two samples $x1_i$ and $x2_i$. Normalized cumulative is $F1(x_i) = rank(x_i)/n1$ and goes from 0 to 1. The K-S statistic is D, the maximum difference between the cumulative distributions.

$D = max( |F1(x) - F2(x)| )$

It turns out that without any constraint on the distributions, smart people can derive a p-value from $D$. Note that $D$ only cares what the difference is at one point, the maximum, but since these are cumulative distributions that includes information about how the distribution behaves elsewhere in $x$. $F1(x)$ is sort of a rate at which points in the distribution of $x1$ accumulate as you integrate up in $x$.

scipy.stats includes K-S routines: stats.ks2_samp, stats.kstest which compares to a reference distribution, e.g. perfect Gaussian, (and stats.ksone which has a whole series of methods for calculating things from a parametric pdf). They return a D and p-value. As mentioned before, the p-value tells if you can reject the hypothesis of equal distributions. It doesn't say if they are the same.



In [1]:

    
%pylab
import numpy as np
from scipy import stats
vals = np.random.normal(loc=0.0, scale=1, size=1000)
print stats.kstest(vals, "norm")
# 1-sample K-S test vs gaussian. output is the statistic D, and a p-value









    



Welcome to pylab, a matplotlib-based Python environment [backend: MacOSX].
For more information, type 'help(pylab)'.(0.031247590670262337, 0.27789775655944182)



In [3]:

    
vals = np.random.normal(loc=0.0, scale=1, size=1000)
vals2 = np.random.normal(loc=0.1, scale=1, size=1000)
print stats.ks_2samp(vals, vals2)
# 2-sample K-S test. output is the statistic D, and a p-value









    



(0.05600000000000005, 0.08390555360705651)



In [10]:

    
vals1 = np.random.normal(loc=0.0, scale=1, size=100)
vals2 = np.random.normal(loc=0.3, scale=1, size=200)
plt.clf()
fig = plt.hist(vals1,bins=50,color='b')
fig = plt.hist(vals2,bins=50,color='r')



In [11]:

    
vsort1 = sorted(vals1)
vsort2 = sorted(vals2)
rank1norm = np.arange(size(vsort1)).astype(float)/(size(vsort1)-1)
rank2norm = np.arange(size(vsort2)).astype(float)/(size(vsort2)-1)
rank2norm_interp1 = np.interp(vsort1,vsort2,rank2norm)
absdiff = np.absolute(rank1norm-rank2norm_interp1)
# where D is largest
indmax = np.argmax(absdiff)
d = absdiff[indmax]
v_of_max = vsort1[indmax]
# print indmax, d, v_of_max
plt.clf()
fig = plt.plot(vsort1,rank1norm,'b-')
fig = plt.plot(vsort2,rank2norm,'r-')
fig = plt.plot([v_of_max,v_of_max],[rank1norm[indmax],rank2norm_interp1[indmax]],'k-')

The way the K-S $D$ is defined means that it is sensitive to shape of the DF near the center of the distribution, not the tails. It also isn't sensitive to changes in the differential DF, like small regions that deviate (bumps, dropouts). There are other tests with different statistics: the Anderson-Darling test is more sensitive to the tails. This is scipy.stats.anderson.

The Mann-Whitney test is a related item that basically compares medians, for when you might care about the median but not the dispersion. It works by rank ordering the combination of the two samples and looking at the differences of the ranks that belong to sample 1, vs. ranks belonging to sample 2. It's scipy.stats.mannwhitneyu.



In [13]:

    
x, y = np.random.normal(0, 1, size=(2,1000))
z = np.random.normal(0.1, 1, size=1000)
print stats.mannwhitneyu(x, y)
print stats.mannwhitneyu(x, z)
# Mann-Whitney U test. output is the U number, and p-value









    



(493430.0, 0.30546579916264566)
(468864.0, 0.0079513114355620942)

Yet another test is the Wilcoxon signed rank test. (Apparently, Wilcoxon didn't have any friends to share his test name.) This is for when you have two samples that are paired, like before/after. It computes the differences in pairs and looks at the ranks of positive and negative differences.



In [14]:

    
x, y = np.random.normal(0, 1, size=(2,1000))
z = np.random.normal(0.1, 1, size=1000)
print stats.wilcoxon(x, y)
print stats.wilcoxon(x, z)









    



(247097.0, 0.72999287879133246)
(232334.0, 0.04986401356261963)

Section 4.7.3 Comparison of Two-Dimensional Distributions

The lesson of this section is that there are not obvious good methods for this problem. There's no 2-d equivalent to a K-S test because there's no unique cumulative distribution in 2-d. They don't mention this, but sometimes people try to make a 2-D K-S test in the astro literature. I suspect most of these are flawed.

They give an example due to Peacock and Fasano and Francheschini that involves dividing your sample into quadrants around points, looking at the differences in counts, and computing some statistic. This has some behavior that is related to a correlation coefficient. It seems a little bit of a hack.

(My opinion) If you have data that is more continuous, like two images, you can calculate a cross-correlation between them.

Section 4.7.4 Is my distribution really Gaussian?

Here we can ask if we can reject the null hypothesis that a sample is drawn from an arbitrary Gaussian distribution. The mean or variance could be allowed to vary, Gaussianity is tested.

You can estimate the mean and std. deviation $\mu$ and $\sigma$ from the data, but the behavior of the test statistics often is different if $\mu$ or $\sigma$ are estimated rather than known ahead, since there is an uncertainty on those values.

Aside: This is a subtle point that can also muck up other statistical tests - it's pretty common in tests of the kind "Is such and such N sigma off?" Often these are examples of "z-tests" with assumed Gaussian rms, but if you have to estimate the rms from a small sample, it's an issue. See the Wikipedia article, http://en.wikipedia.org/wiki/Z-test

You can use an Anderson-Darling or K-S test to compare to a Gaussian distribution, being careful about the p-value, since computing $\mu$ or $\sigma$ from the data changes the mapping between statistic and p-value. There is also a Shapiro-Wilk test that is sensitive to tails.



In [15]:

    
x, y = np.random.normal(0, 1, size=(2,1000))
z = np.random.normal(0.1, 1, size=1000)
print stats.anderson(x, 'norm')
print stats.shapiro(x)









    



(0.16970265312272659, array([ 0.574,  0.653,  0.784,  0.914,  1.088]), array([ 15. ,  10. ,   5. ,   2.5,   1. ]))
(0.9984968900680542, 0.5526217818260193)

Often one wants to detect non-Gaussian outliers. One way is to compute the standard deviation $s$ and the interquartile range $\sigma_G = 0.7413(q_{75} - q_{25})$. The latter is robust to outliers, so if $s/\sigma_G>1$ it indicates heavy tails, and various numbers can be defined to get you a p-value for rejecting Gaussianity.

Insert Figure 4.7 here or see below.

Section 4.7.5 Is my distribution bimodal?

This is a pretty tough question, because bimodality is difficult to define. If you take a histogram and bin it too coarsely, you can make bimodality wash out, and if you bin it too finely, you can make the histogram noisy and have lots of peaks. They punt the question to Section 5.7.3.

Section 4.7.6 Parametric Methods for Comparing Distributions

If you have a sample that is not rejected for Gaussianity, you can use classical statistical tests to compare means and variances. These are more efficient than the nonparametric methods, perhaps not by much, but they are very common in the literature, so it's good to have heard of the t-test and F-test.

If you have two Gaussian datasets x1 and x2 that have the same $\sigma$ and want to know if they have different means, you can just compute the means and standard errors and ask how significant the difference $\Delta = \bar{x1} - \bar{x2}$ is in terms of the error sum in quadrature, using a Gaussian error function erf().

However, if you have to calculate the $\sigma_1$ and $\sigma_2$ from the data, then the added uncertainty means that the difference/error is not a Gaussian distribution! It is a Student's $t$ distribution which has heavier tails (Section 3.3.8). Everybody casually screws this up, including me. The distribution depends on the degrees of freedom $N_1+N_2-2$, and approaches Gaussian for large N, thankfully.

If you can't assume that the two samples have the same sigma, then you should use a variant called Welch's $t$ test. This would happen if they weren't coming from the same parent distribution.

scipy.stats includes functions ttest_rel (paired samples), ttest_ind (independent samples from a parent distribtion), and ttest_1samp (1 sample against a known mean).

Aside: "Student's t" was published in 1908 by a Guinness research chemist who developed it to compare samples of stout, monitoring the brewing process. He was on a research leave in a biology lab, but had to publish under the pseudonym Student. Also, you had me at "Guinness research chemist."



In [17]:

    
x, y = np.random.normal(0, 1, size=(2,1000))
z = np.random.normal(0.1, 1, size=1000)
print stats.ttest_ind(x,y)
print stats.ttest_ind(x,z)









    



(-0.29864574559603763, 0.76524141360649534)
(-1.9033180674616432, 0.057142636777541327)

The F-test is similarly used when you can assume Gaussianity and you want to test if a difference in variance between two samples is significant. The F statistic is ratio of variances $F = s_1^2 / s_2^2$. Be careful with this as it is probably sensitive to the assumption of Gaussianity. Note that the book says to use scipy.stats.f_oneway, but actually that does an F-test comparing variance of a full sample to variance of subgroups to detect differences in the MEAN. This is called an ANOVA (analysis of variance) and is very common in medical or social sciences, but I never see it used in astronomy.



In [18]:

    
x, y = np.random.normal(0, 1, size=(2,1000))
z = np.random.normal(0.3, 1, size=1000)
z2 = np.random.normal(0, 3, size=1000)
print stats.f_oneway(x,y)
print stats.f_oneway(x,z)
print stats.f_oneway(x,z2)









    



(0.048349268147615011, 0.82598343815077802)
(39.577596246845694, 3.8629721159324476e-10)
(1.5285016028421028, 0.21648278249564304)

Section 4.8 Nonparametric modeling and Histograms

Section 4.8.1 Histograms

Histograms are a type of density estimation of a discrete sample. Histograms can be evil because of dependence on binning, but they are really common. For a more complex look at this problem you can look at kernel estimation: e.g. plopping a small gaussian at the location of each data point and adding them up to produce the estimated density function.

Choosing the right bin width for a histogram depends on the sample size and the size of features you might hope to see. Essentially you are modeling the data DF as a piecewise constant function. One rule of thumb is Scott's rule:

$\Delta_b = 3.5\sigma / N^{1/3}$

This has some nice properties related to Gaussians. A non-Gaussian generalization is the Freedman-Diaconis rule:

$\Delta_b = 2(q_{75}-q_{25}) / N^{1/3}$

Numpy and matplotlib can compute-and-plot histograms easily with pyplot.hist, or compute the bin values with np.histogram. If you have already computed a set of histogram values, you can plot it as a bar chart with pyplot.bar, although I'd hope there is a cleaner way to do this.

Note that this is an example where the pyplot default looks snazzy but crap. It's plotting the histogram as a bar chart rather than a function (excessive ink between the bars), and is unnecessarily opaque. You can improve it with histtype='step'. Fill color without the bars is histtype='stepfilled'.



In [20]:

    
plt.clf()
x = np.random.normal(0, 1, size=1000)
#fig = plt.hist(x, bins=50)
fig = plt.hist(x, bins=50, histtype='step')



In [21]:

    
plt.clf()
x = np.random.normal(0, 1, size=1000)
counts, xbins = np.histogram(x, bins=50)
# xbins is values of bin edges, not centers, and has 51 elements
# This is a hack for which there should be a better way.
binwidth = (xbins[-1] - xbins[0]) / (size(xbins)-1)
fig = plt.bar(xbins[0:-1]+binwidth/2, counts, width=binwidth)



In [22]:

    
plt.clf()
from astroML.plotting import hist
fig = hist(x, bins='freedman')
#fig = hist(x, bins='freedman', histtype='step')

Section 4.8.2 Errors on Histogram Bins

Histogram bins ought to have units of either counts $n_i$ per bin width $\Delta_b$, or counts normalized to total $n_i/N_{tot}$ per bin width. If you get the units right, integrating the histogram should give a value that doesn't depend on choice of bin width.

It is typical to assign uncertanties to each bin's counts, using the Gaussian approximation to Poisson errors, "error" on $n_i$ $= \sqrt{n_i}$. This will go bad at small counts per bin and one should use the actual Poisson errors, or at least a better approximation - the error on 0 isn't 0. (My note: the Gaussian approximation gets bad at less than 5 counts/bin.)

Section 4.9 Selection Effects and Luminosity Function Estimation

Luminosity functions are a type of density function estimation, typically from a set of discrete measurements. So while this may seem like a sharp transition, it is related to histograms.

The major problem in luminosity function estimation is that we ususally have some kind of flux limit, so the data is truncated (Section 4.2.7). If you have a volume limited sample then calculating the number of objects per brightness per volume is easier, but this is rare.

If you have a selection function $S(x)$ that depends only on the variable of interest $x$, then you can do a completeness correction from the observed $f(x)$ to the true $h(x)$: $$h(x) = f(x)/S(x).$$

However if your selection function depends on some other variable $y$ then you have to make corrections for values of $x$ that could not be observed. For the LF case think of $x$ as luminosity and $y$ as distance.

The classic way of correcting this in astronomy is the $1/V_{max}$ method, where each object is weighted by $1/Volume$ using the volume within which it could have been observed. This section discusses an alternate method, the Lynden-Bell $C^-$ method, which makes an in-principle less restrictive set of assumptions. $1/V_{max}$ assumes that the LF is constant within the volume probed, while $C^-$ assumes that it is separable, i.e. $$n(x,y) = \Psi(x) \rho(y).$$



In [ ]:



In [16]:

    
# Figure 4.7
#
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt

#----------------------------------------------------------------------
# This function adjusts matplotlib settings for a uniform feel in the textbook.
# Note that with usetex=True, fonts are rendered with LaTeX.  This may
# result in an error if LaTeX is not installed on your system.  In that case,
# you can set usetex to False.
from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

from astroML.stats import mean_sigma, median_sigmaG

# create some distributions
np.random.seed(1)
normal_vals = stats.norm(loc=0, scale=1).rvs(10000)
dual_vals = stats.norm(0, 1).rvs(10000)
dual_vals[:4000] = stats.norm(loc=3, scale=2).rvs(4000)

x = np.linspace(-4, 10, 1000)
normal_pdf = stats.norm(0, 1).pdf(x)
dual_pdf = 0.6 * stats.norm(0, 1).pdf(x) + 0.4 * stats.norm(3, 2).pdf(x)

vals = [normal_vals, dual_vals]
pdf = [normal_pdf, dual_pdf]
xlims = [(-4, 4), (-4, 10)]


#------------------------------------------------------------
# Compute the statistics and plot the results
fig = plt.figure(figsize=(5, 7))
fig.subplots_adjust(left=0.13, right=0.95,
                    bottom=0.06, top=0.95,
                    hspace=0.1)

for i in range(2):
    ax = fig.add_subplot(2, 1, 1 + i)  # 2 x 1 subplot

    # compute some statistics
    A2, sig, crit = stats.anderson(vals[i])
    D, pD = stats.kstest(vals[i], "norm")
    W, pW = stats.shapiro(vals[i])

    mu, sigma = mean_sigma(vals[i], ddof=1)
    median, sigmaG = median_sigmaG(vals[i])

    N = len(vals[i])
    Z1 = 1.3 * abs(mu - median) / sigma * np.sqrt(N)
    Z2 = 1.1 * abs(sigma / sigmaG - 1) * np.sqrt(N)

    print 70 * '_'
    print "  Kolmogorov-Smirnov test: D = %.2g  p = %.2g" % (D, pD)
    print "  Anderson-Darling test: A^2 = %.2g" % A2
    print "    significance  | critical value "
    print "    --------------|----------------"
    for j in range(len(sig)):
        print "    %.2f          | %.1f%%" % (sig[j], crit[j])
    print "  Shapiro-Wilk test: W = %.2g p = %.2g" % (W, pW)
    print "  Z_1 = %.1f" % Z1
    print "  Z_2 = %.1f" % Z2

    # plot a histogram
    ax.hist(vals[i], bins=50, normed=True, histtype='stepfilled', alpha=0.5)
    ax.plot(x, pdf[i], '-k')
    ax.set_xlim(xlims[i])

    # print information on the plot
    info = "Anderson-Darling: $A^2 = %.2f$\n" % A2
    info += "Kolmogorov-Smirnov: $D = %.2g$\n" % D
    info += "Shapiro-Wilk: $W = %.2g$\n" % W
    info += "$Z_1 = %.1f$\n$Z_2 = %.1f$" % (Z1, Z2)
    ax.text(0.97, 0.97, info,
            ha='right', va='top', transform=ax.transAxes)

    if i == 0:
        ax.set_ylim(0, 0.55)
    else:
        ax.set_ylim(0, 0.35)
        ax.set_xlabel('$x$')

    ax.set_ylabel('$p(x)$')

plt.show()









    



______________________________________________________________________
  Kolmogorov-Smirnov test: D = 0.0076  p = 0.6
  Anderson-Darling test: A^2 = 0.29
    significance  | critical value 
    --------------|----------------
    0.58          | 15.0%
    0.66          | 10.0%
    0.79          | 5.0%
    0.92          | 2.5%
    1.09          | 1.0%
  Shapiro-Wilk test: W = 1 p = 0.59
  Z_1 = 0.2
  Z_2 = 1.0
______________________________________________________________________
  Kolmogorov-Smirnov test: D = 0.28  p = 0
  Anderson-Darling test: A^2 = 1.9e+02
    significance  | critical value 
    --------------|----------------
    0.58          | 15.0%
    0.66          | 10.0%
    0.79          | 5.0%
    0.92          | 2.5%
    1.09          | 1.0%
  Shapiro-Wilk test: W = 0.94 p = 0
  Z_1 = 32.2
  Z_2 = 2.5





    



/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/scipy/stats/morestats.py:588: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")



In [ ]: