In [68]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
import seaborn
scipy
Kolmogorov–Smirnov testA short note on how to use the scipy
Kolmogorov–Smirnov test because quite frankly the documentation was not compelling and I always forget how the scipy.stats
module handles distributions arguments.
Using a beta
distribution generate rvs
from it and save them as d1
, then make a second set of data containing the fist set and 50 drawd from a different distribution: this measn d2
not beta
. We plot the Gaussian KDE to show how large the effect is graphically
In [65]:
xvals = np.linspace(0, 0.5, 500)
args = [0.5, 20]
d1 = ss.beta.rvs(args[0], args[1], size=1000)
noise = ss.norm.rvs(0.4, 0.1, size=50)
d2 = np.append(d1[:len(d1)-len(noise)], noise)
fig, (ax1, ax2) = plt.subplots(nrows = 2)
ax1.plot(xvals, ss.gaussian_kde(d1).evaluate(xvals), "-k",
label="d1")
ax1.legend()
ax2.plot(xvals, ss.gaussian_kde(d2).evaluate(xvals), "-k",
label="d2")
ax2.legend()
plt.show()
The trick here is to give the scipy.stats.kstest
the data, then a callable function of the cumulative distribution function. There is some clever way to just pass in a string of the name, but I was unconvinved that the argumements were going to the proper place, so I prefer this method
In [66]:
beta_cdf = lambda x: ss.beta.cdf(x, args[0], args[1])
ss.kstest(d1, beta_cdf)
Out[66]:
In [67]:
ss.kstest(d2, beta_cdf)
Out[67]:
Clearly this works as expected: for d1
we get a fairly large $p$-value suggesting we should accept the null hypothesis (the data is beta-distributed with the given $a$ and $b$, while for $d2$ we get a low $p$-value suggesting we might discard the null hypothesis (which we know to be true since that was the way the data was built).
It remains unclear how to choose the $p$ threshold.