Kolmogorov-Smirnov test

In this notebook we will illustrate the use of the Kolmogorov-Smirnov test (K-S test) using functions from the SciPy stats module. In particular, we will look at the influence of the sample size.

Table of contents

Preamble

Data setup

K-S test

Other tests

Odds and ends

Preamble

The first step is to setup the computational environment. For that we will start by importing the major packages needed for this notebook:


In [1]:
import sys
import math
import numpy as np
import scipy as sp
import matplotlib as mpl
import pandas as pd

Let us check the versions being used:


In [2]:
print('System: {}'.format(sys.version))
for package in (np, sp, mpl, pd):
    print('Package: {} {}'.format(package.__name__, package.__version__))


System: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
Package: numpy 1.11.2
Package: scipy 0.18.1
Package: matplotlib 1.5.3
Package: pandas 0.19.0

Now, we import the stats module from the scipy package plus additional modules/packages required by this notebook:


In [3]:
from scipy import stats
import matplotlib.pyplot as plt

To make the plots visible inside the notebook, we need the following Jupyter "magic" command:


In [4]:
%matplotlib inline

Now we are ready to start exploring.

Back to top

Data setup

First of all, we will define a normal random variable (RV) with mean 1.0 and standard deviation 0.1:


In [5]:
mean = 1.0
std = 0.1
rv = stats.norm(loc=mean, scale=std)

This is said to be a 'frozen' RV object since the location and scale parameters are given. Let us plot the probability density function (pdf) as well as the cumulative density function (cdf), between percentiles 0.001 and 0.999, to check the random variable:


In [6]:
x = np.linspace(rv.ppf(0.001), rv.ppf(0.999), 1000)
fig, ax = plt.subplots(1, 1)
ax.plot(x, rv.pdf(x), label='frozen pdf')
ax.plot(x, rv.cdf(x), lw=2, label='frozen cdf')
ax.axhline(0.5, ls=':')
ax.axhline(1.0, ls=':')
ax.axvline(mean, ls='-.', label='mean')
ax.legend(loc='best', frameon=False)
plt.show()


Now we will generate nine random samples, ranging from 10 to 5000 samples, and look at them in a histogram plot together with the normal pdf to see how they match:


In [7]:
Nsamples = [u*d for d in (10,100,1000) for u in (1,2,5)]
ysamples = []
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True, figsize=(8,6))
for row in range(3):
    for col in range(3):
        N = Nsamples[3*row+col]
        ax[row,col].plot(x, rv.pdf(x), 'k-', lw=2)
        y = rv.rvs(size=N)
        b = int(math.sqrt(N))
        ysamples.append(y)
        ax[row,col].hist(y, bins=b, normed=True, histtype='stepfilled', alpha=0.2)
        ax[row,col].set_title('{} samples, {} bins'.format(N, b))
plt.show()


We used a different number of bins in the histogram plots in order to increase the discretization. As can be seen, the larger the size of the random sample, the better it fits the normal pdf. Similarly, we can plot the random samples, after ordering them in ascending order (empirical cumulative distribution function), together with the normal cdf:


In [8]:
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True, figsize=(8,6))
for row in range(3):
    for col in range(3):
        N = Nsamples[3*row+col]
        y = ysamples[3*row+col]
        ax[row,col].plot(x, rv.cdf(x), 'k-', lw=2)
        yn = np.sort(y)
        xn = np.linspace(0., 1., num=N)
        ax[row,col].plot(yn, xn, 'o', alpha=0.2)
        ax[row,col].axhline(0.5, ls=':')
        ax[row,col].axvline(mean, ls=':')
        ax[row,col].set_title('{} samples'.format(N))
plt.show()


Interestingly, the larger the sample size the better it fits the normal cdf. But then again, this is what can be expected after seeing the histogram plots.

K-S test

Finally, let us compute the K-S test for goodness of fit:


In [9]:
Dsamples = []
psamples = []
for N,y in zip(Nsamples, ysamples):
    D, pvalue = stats.kstest(y, 'norm', args=(mean, std))
    Dsamples.append(D)
    psamples.append(pvalue)
    print('{:4d} samples: D={}, pvalue={}'.format(N, D, pvalue))


  10 samples: D=0.2753656378887702, pvalue=0.3672249663303915
  20 samples: D=0.16207336255334687, pvalue=0.6311904860261404
  50 samples: D=0.12271525544264794, pvalue=0.4093877060599882
 100 samples: D=0.052767400233737355, pvalue=0.9434452597119428
 200 samples: D=0.04967174052786183, pvalue=0.7215677555413473
 500 samples: D=0.04359284400909841, pvalue=0.2904804671538115
1000 samples: D=0.04504318112027783, pvalue=0.03352035385205698
2000 samples: D=0.024740356806760677, pvalue=0.17003506437368565
5000 samples: D=0.01355273829933401, pvalue=0.31737455312039525

Let us plot these results to see that they are pretty much scattered:


In [10]:
fig, ax = plt.subplots(1, 1)
ax.scatter(psamples, Dsamples, s=Nsamples, alpha=0.2)
for p,D,N in zip(psamples, Dsamples, Nsamples):
    ax.text(p, D, str(N), ha='center', va='center')
ax.set_xlabel('p-value')
ax.set_ylabel('D')
ax.set_title('K-S test results')
plt.show()


It is curious to see that the highest p-value does not occur for the lowest K-S statistic (D) value. In fact, it seems that, as the size of the sample increases, the p-value does not show a tendency to increase, despite the fact that both the histogram plot and the ordered sample plot seem to fit better the normal pdf and cdf, respectively.

Other tests

For comparison, we will perform two other tests. First, the Anderson-Darling test (A-D test), for data coming from a particular distribution:


In [11]:
for N,y in zip(Nsamples, ysamples):
    A2, critical_values, significance_level = stats.anderson(y)
    print('{:4d} samples: A2={}'.format(N, A2), critical_values, significance_level)


  10 samples: A2=0.5069708309412295 [ 0.501  0.57   0.684  0.798  0.95 ] [ 15.   10.    5.    2.5   1. ]
  20 samples: A2=0.27668806915380273 [ 0.506  0.577  0.692  0.807  0.96 ] [ 15.   10.    5.    2.5   1. ]
  50 samples: A2=0.2909090175886888 [ 0.538  0.613  0.736  0.858  1.021] [ 15.   10.    5.    2.5   1. ]
 100 samples: A2=0.1849651760906994 [ 0.555  0.632  0.759  0.885  1.053] [ 15.   10.    5.    2.5   1. ]
 200 samples: A2=0.4283317149499055 [ 0.565  0.644  0.772  0.901  1.071] [ 15.   10.    5.    2.5   1. ]
 500 samples: A2=0.3411598750071221 [ 0.571  0.651  0.781  0.911  1.083] [ 15.   10.    5.    2.5   1. ]
1000 samples: A2=0.46830249492018083 [ 0.574  0.653  0.784  0.914  1.088] [ 15.   10.    5.    2.5   1. ]
2000 samples: A2=0.20674428709617132 [ 0.575  0.655  0.785  0.916  1.09 ] [ 15.   10.    5.    2.5   1. ]
5000 samples: A2=0.5176316371325811 [ 0.576  0.655  0.786  0.917  1.091] [ 15.   10.    5.    2.5   1. ]

Second, since we are using a normal random variable, the Shapiro-Wilk test (S-W test) for normality:


In [12]:
Wsamples = []
psamples = []
for N,y in zip(Nsamples, ysamples):
    W, pvalue = stats.shapiro(y)
    Wsamples.append(W)
    psamples.append(pvalue)
    print('{:4d} samples: W={}, pvalue={}'.format(N, W, pvalue))


  10 samples: W=0.9056932330131531, pvalue=0.2526984214782715
  20 samples: W=0.9664967656135559, pvalue=0.6798665523529053
  50 samples: W=0.9820287823677063, pvalue=0.640101432800293
 100 samples: W=0.9936828017234802, pvalue=0.92562335729599
 200 samples: W=0.9884196519851685, pvalue=0.10419952124357224
 500 samples: W=0.9973545074462891, pvalue=0.6103889346122742
1000 samples: W=0.9976917505264282, pvalue=0.17618140578269958
2000 samples: W=0.9993303418159485, pvalue=0.7211231589317322
5000 samples: W=0.9994010925292969, pvalue=0.1019936203956604

It is not easy to plot the A-D test results, but the S-W test results are:


In [13]:
fig, ax = plt.subplots(1, 1)
ax.scatter(psamples, Wsamples, s=Nsamples, alpha=0.2)
for p,W,N in zip(psamples, Wsamples, Nsamples):
    ax.text(p, W, str(N), ha='center', va='center')
ax.set_xlabel('p-value')
ax.set_ylabel('W')
ax.set_title('S-W test results')
plt.show()


Both tests show that ... (to be concluded!).

Back to top

Odds and ends

This notebook was created by Paulo Xavier Candeias.

Back to top