In this notebook we will illustrate the use of the Kolmogorov-Smirnov test (K-S test) using functions from the SciPy stats module. In particular, we will look at the influence of the sample size.
The first step is to setup the computational environment. For that we will start by importing the major packages needed for this notebook:
In [1]:
import sys
import math
import numpy as np
import scipy as sp
import matplotlib as mpl
import pandas as pd
Let us check the versions being used:
In [2]:
print('System: {}'.format(sys.version))
for package in (np, sp, mpl, pd):
print('Package: {} {}'.format(package.__name__, package.__version__))
Now, we import the stats module from the scipy package plus additional modules/packages required by this notebook:
In [3]:
from scipy import stats
import matplotlib.pyplot as plt
To make the plots visible inside the notebook, we need the following Jupyter "magic" command:
In [4]:
%matplotlib inline
In [5]:
mean = 1.0
std = 0.1
rv = stats.norm(loc=mean, scale=std)
This is said to be a 'frozen' RV object since the location and scale parameters are given. Let us plot the probability density function (pdf) as well as the cumulative density function (cdf), between percentiles 0.001 and 0.999, to check the random variable:
In [6]:
x = np.linspace(rv.ppf(0.001), rv.ppf(0.999), 1000)
fig, ax = plt.subplots(1, 1)
ax.plot(x, rv.pdf(x), label='frozen pdf')
ax.plot(x, rv.cdf(x), lw=2, label='frozen cdf')
ax.axhline(0.5, ls=':')
ax.axhline(1.0, ls=':')
ax.axvline(mean, ls='-.', label='mean')
ax.legend(loc='best', frameon=False)
plt.show()
Now we will generate nine random samples, ranging from 10 to 5000 samples, and look at them in a histogram plot together with the normal pdf to see how they match:
In [7]:
Nsamples = [u*d for d in (10,100,1000) for u in (1,2,5)]
ysamples = []
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True, figsize=(8,6))
for row in range(3):
for col in range(3):
N = Nsamples[3*row+col]
ax[row,col].plot(x, rv.pdf(x), 'k-', lw=2)
y = rv.rvs(size=N)
b = int(math.sqrt(N))
ysamples.append(y)
ax[row,col].hist(y, bins=b, normed=True, histtype='stepfilled', alpha=0.2)
ax[row,col].set_title('{} samples, {} bins'.format(N, b))
plt.show()
We used a different number of bins in the histogram plots in order to increase the discretization. As can be seen, the larger the size of the random sample, the better it fits the normal pdf. Similarly, we can plot the random samples, after ordering them in ascending order (empirical cumulative distribution function), together with the normal cdf:
In [8]:
fig, ax = plt.subplots(3, 3, sharex=True, sharey=True, figsize=(8,6))
for row in range(3):
for col in range(3):
N = Nsamples[3*row+col]
y = ysamples[3*row+col]
ax[row,col].plot(x, rv.cdf(x), 'k-', lw=2)
yn = np.sort(y)
xn = np.linspace(0., 1., num=N)
ax[row,col].plot(yn, xn, 'o', alpha=0.2)
ax[row,col].axhline(0.5, ls=':')
ax[row,col].axvline(mean, ls=':')
ax[row,col].set_title('{} samples'.format(N))
plt.show()
In [9]:
Dsamples = []
psamples = []
for N,y in zip(Nsamples, ysamples):
D, pvalue = stats.kstest(y, 'norm', args=(mean, std))
Dsamples.append(D)
psamples.append(pvalue)
print('{:4d} samples: D={}, pvalue={}'.format(N, D, pvalue))
Let us plot these results to see that they are pretty much scattered:
In [10]:
fig, ax = plt.subplots(1, 1)
ax.scatter(psamples, Dsamples, s=Nsamples, alpha=0.2)
for p,D,N in zip(psamples, Dsamples, Nsamples):
ax.text(p, D, str(N), ha='center', va='center')
ax.set_xlabel('p-value')
ax.set_ylabel('D')
ax.set_title('K-S test results')
plt.show()
It is curious to see that the highest p-value does not occur for the lowest K-S statistic (D) value. In fact, it seems that, as the size of the sample increases, the p-value does not show a tendency to increase, despite the fact that both the histogram plot and the ordered sample plot seem to fit better the normal pdf and cdf, respectively.
For comparison, we will perform two other tests. First, the Anderson-Darling test (A-D test), for data coming from a particular distribution:
In [11]:
for N,y in zip(Nsamples, ysamples):
A2, critical_values, significance_level = stats.anderson(y)
print('{:4d} samples: A2={}'.format(N, A2), critical_values, significance_level)
Second, since we are using a normal random variable, the Shapiro-Wilk test (S-W test) for normality:
In [12]:
Wsamples = []
psamples = []
for N,y in zip(Nsamples, ysamples):
W, pvalue = stats.shapiro(y)
Wsamples.append(W)
psamples.append(pvalue)
print('{:4d} samples: W={}, pvalue={}'.format(N, W, pvalue))
It is not easy to plot the A-D test results, but the S-W test results are:
In [13]:
fig, ax = plt.subplots(1, 1)
ax.scatter(psamples, Wsamples, s=Nsamples, alpha=0.2)
for p,W,N in zip(psamples, Wsamples, Nsamples):
ax.text(p, W, str(N), ha='center', va='center')
ax.set_xlabel('p-value')
ax.set_ylabel('W')
ax.set_title('S-W test results')
plt.show()
Both tests show that ... (to be concluded!).