freqopttest
This notebook will introduce you to freqopttest
, a Python package for a nonparameteric, interpretable two-sample testing as described in
Interpretable Distribution Features with Maximum Testing Power
Wittawat Jitkrittum, Zoltán Szabó, Kacper Chwialkowski, Arthur Gretton
NIPS 2016.
The paper can be found here.
Before running this notebook, make sure that you have freqopttest
included in Python's search path. See here for how to do this.
Make sure that the following import statements do not produce any fatal error.
In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import freqopttest.util as util
import freqopttest.data as data
import freqopttest.kernel as kernel
import freqopttest.tst as tst
import freqopttest.glo as glo
import theano
A two-sample test proposes a null hypothesis $H_0: P=Q$ against an alternative hypothesis $H_1: P\neq Q$. As an example, let us consider a simple two-dimensional toy problem where
$$ P(x) = \mathcal{N}(x \mid (0,0), I), \\ Q(y) = \mathcal{N}(y \mid (m_y,0), I). $$We refer to this problem as the Gaussian mean difference (GMD) problem. If $m_y=0$, then $H_0$ is true. If not, then $H_1$ holds.
In this framework, a toy problem is represented by a SampleSource
object (a source of two samples). Many toy problems are included in freqopttest.data
module. Classes implementing a SampleSource
have names starting with SS
. Here, we will construct a SSGaussMeanDiff
which implements the Gaussian mean difference problem.
In [2]:
# Full sample size
n = 600
# Consider two dimensions here
dim = 2
seed = 15
# mean shift
my = 1.0
ss = data.SSGaussMeanDiff(dim, my=my)
#ss = data.SSGaussVarDiff(dim)
#ss = data.SSSameGauss(dim)
#ss = data.SSBlobs()
A SampleSource
can be sampled to get a TSTData
(two-sample test data) object. A TSTData
object is just an encapsulation of two $n\times d$ numpy arrays, representing the two samples. The TSTData
object will be fed to a testing algorithm to conduct the two-sample test. All two-sample test algorithms are implemented to take in a TSTData
object as its input.
Let us sample from the SampleSource
to get a TSTData
.
In [3]:
# draw n points from P, and n points from Q.
# tst_data is an instance of TSTData
tst_data = ss.sample(n, seed=seed)
Note that in practice, we have access to only two data matrices: X, an $(n \times d)$ matrix, and Y, an $(n \times d)$ matrix. We can also directly construct a TSTData
with
tst_data = data.TSTData(X, Y)
Here, we sample a TSTData
from a SampleSource
.
In [4]:
# Let's plot the data
x, y = tst_data.xy()
plt.plot(x[:, 0], x[:, 1], 'b.', markeredgecolor='b', label='$X\sim P$', alpha=0.5)
plt.plot(y[:, 0], y[:, 1], 'r.', markeredgecolor='r', label='$Y\sim Q$', alpha=0.5)
plt.axis('equal')
plt.legend()
Out[4]:
Now that we have tst_data
, let us randomly split it into two disjoint halves of equal size: tr
and te
. The training set tr
will be used for parameter optimization. The testing set te
will be used for the actual two-sample test. tr
and tr
are of type TSTData
.
In [5]:
tr, te = tst_data.split_tr_te(tr_proportion=0.5, seed=2)
In [6]:
# J = number of test locations (features)
J = 1
# significance level of the test
alpha = 0.01
In [7]:
# These are options to the optimizer. Many have default values.
# See the code for the full list of options.
op = {
'n_test_locs': J, # number of test locations to optimize
'max_iter': 200, # maximum number of gradient ascent iterations
'locs_step_size': 1.0, # step size for the test locations (features)
'gwidth_step_size': 0.1, # step size for the Gaussian width
'tol_fun': 1e-4, # stop if the objective does not increase more than this.
'seed': seed+5, # random seed
}
# optimize on the training set
test_locs, gwidth, info = tst.MeanEmbeddingTest.optimize_locs_width(tr, alpha, **op)
The optimization procedure returns back 3 things:
test_locs
: a $J\times d$ numpy array of optimized $J$ test locations.gwidth
: optimized Gaussian width (a floating-point number)info
: a dictionary containing information gathered during the optimization e.g., test location trajectories. This is only for logging and not needed to construct a test.From the optimization, we now have test_locs
(optimized test locations as a $Jxd$ matrix) and gwidth
(optimized Gaussian width). Let us use these two parameters to perform the ME test on the test data te
(not tr
).
In [8]:
# Construct a MeanEmbeddingTest object with the best optimized test features,
# and optimized Gaussian width
met_opt = tst.MeanEmbeddingTest(test_locs, gwidth, alpha)
In [9]:
# Do the two-sample test on the test data te.
# The returned test_result is a dictionary.
test_result = met_opt.perform_test(te)
In [10]:
test_result
Out[10]:
Here we see that the test rejects the null hypothesis with a very small p-value.
Exercise: Go back to the beginning, change my=0
, and run this notebook again. With this setting, we have that $H_0: P=Q$ is true. The test will not reject the null hypothesis.
See ipynb/me_test.ipynb
for more ways to use the ME test. See ipynb/smooth_cf_test.ipynb
on how to use the SCF test with optimization.
Here we show what is contained in the info
variable returned from the optimization.
In [11]:
# Plot evolution of the test locations, Gaussian width
# trajectories of the Gaussian width
gwidths = info['gwidths']
fig, axs = plt.subplots(2, 2, figsize=(10, 9))
axs[0, 0].plot(gwidths)
axs[0, 0].set_xlabel('iteration')
axs[0, 0].set_ylabel('Gaussian width')
axs[0, 0].set_title('Gaussian width evolution')
# evolution of objective values
objs = info['obj_values']
axs[0, 1].plot(objs)
axs[0, 1].set_title('Objective $\lambda(T)$')
# trajectories of the test locations
# iters x J. X Coordinates of all test locations
locs = info['test_locs']
for coord in [0, 1]:
locs_d0 = locs[:, :, coord]
J = locs_d0.shape[1]
axs[1, coord].plot(locs_d0)
axs[1, coord].set_xlabel('iteration')
axs[1, coord].set_ylabel('index %d of test_locs'%(coord))
axs[1, coord].set_title('evolution of %d test locations'%J)
print('optimized width: %.3f'%gwidth)
In [ ]: