Introduction to fsic

This notebook will introduce you to fsic, a Python package for nonparameteric independence testing. See the Github page of fsic for more information.

Make sure that you have fsic included in Python's search path. In particular the following import statements should not produce any fatal error.


In [247]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import fsic.util as util
import fsic.data as data
import fsic.kernel as kernel
import fsic.indtest as it
import fsic.glo as glo
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy
import theano


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Demo: NFSIC Test

NFSIC (Normalized Finite Set Independence Criterion) is an indepedence test proposing a null hypothesis

$H_0: X \text{ and } Y \text{ are independent }$

against an alternative hypothesis

$H_1: X \text{ and } Y \text{ are dependent }$

where $X \in \mathbb{R}^{d_x}, Y \in \mathbb{R}^{d_y}$ are random variables. For demonstration purpose, let us consider a simple one-dimensional toy problem in which $X,Y$ are known to be dependent i.e., $H_1$ is true. This problem is what we refer to as the Sinusoid problem.

Sinusoid problem

In the Sinusoid problem, $X, Y \in [-\pi, \pi]$ and the joint probability density is given by

$p_{xy}(x, y) \propto 1 + \sin(\omega x)\sin(\omega y)$

where $\omega$ is the frequency of the sinusoid controlling the difficulty of the problem (i.e., the higher the $\omega$, the more difficult to detect the dependence). We will use $\omega=1$. A plot of the sample drawn from this model is shown below.

In this framework, a toy problem is represented by a PairedSource object (a source of paired sample). Many toy problems are included in fsic.data module. Classes implementing a PairedSource have names starting with PS. Here, we will construct a PS2DSinFreq which implements the Sinusoid problem.


In [248]:
omega = 1
ps = data.PS2DSinFreq(freq=omega)

# There are many more PairedSource implmentations.
#ps = data.PSIndSameGauss(dx, dy)
#ps = data.PS2DUnifRotate(angle=np.pi/4)
#ps = data.PSUnifRotateNoise(angle=np.pi/3, noise_dim=2)

In [249]:
# Significance level of the test
alpha = 0.01

# Number of paired samples to draw
n = 1000

# Random seed
seed = 1

In [250]:
# Draw n paired samples from the PairedSource with the random seed
pdata = ps.sample(n, seed=seed)

The drawn sample from a PairedSource is represented as an object of PairedData class. A PairedData object is just an encapsulation of the paired sample $(X, Y)$. The PairedData object will be fed to a testing algorithm to conduct the independence test.

In practice, we have acceess to only data matrices X, an $(n \times d_x)$ matrix, and Y, an $(n \times d_y)$ matrix. We can also directly construct a PairedData with

pdata = data.PairedData(X, Y)

Here, we sample it from a PairedSource.


In [251]:
# Let's plot the data.
X, Y = pdata.xy()
plt.plot(X, Y, 'k.')
plt.xlim([-np.pi, np.pi])
plt.ylim([-np.pi, np.pi])
plt.xlabel('x')
plt.ylabel('y')


Out[251]:
<matplotlib.text.Text at 0x7f587792d5d0>

Now that we have pdata, let us randomly split it into two disjoint halves: tr and te. The training set tr will be used for parameter optimization. The testing set te will be used for the actual independence test. tr and tr are of type PairedData.


In [252]:
tr, te = pdata.split_tr_te(tr_proportion=0.5, seed=seed+1)

Let us optimize the parameters of NFSIC on tr. The optimization relies on theano to compute the gradient.


In [253]:
# J is the number of test locations
J = 1

# There are many options for the optimization. 
# Almost all of them have default values. 
# Here, we will list a few to give you a sense of what you can control.
op = {
    'n_test_locs':J,  # number of test locations
    'max_iter':200, # maximum number of gradient ascent iterations
    'V_step':1, # Step size for the test locations of X
    'W_step':1, # Step size for the test locations of Y
    'gwidthx_step':1, # Step size for the Gaussian width of X
    'gwidthy_step':1, # Step size for the Gaussian width of Y
    'tol_fun':1e-4, # Stop if the objective function does not increase more than this
    'seed':seed+7 # random seed
}

# Do the optimization with the options in op.
op_V, op_W, op_gwx, op_gwy, info = it.GaussNFSIC.optimize_locs_widths(tr, alpha, **op )

The optimization procedure returns back

  1. op_V: optimized test locations (features) for $X$. A $J \times d_x$ numpy array.
  2. op_W: optimized test locations for $Y$. A $J \times d_y$ numpy array.
  3. op_gwx: optimized Gaussian width (for Gaussian kernels) for $X$. A floating point number.
  4. op_gwy: optimized Gaussian width (for Gaussian kernels) for $Y$. A floating point number.
  5. info: information gathered during the optimization i.e., variable trajectories. A dictionary.

Let us use these values to construct an NFSIC test. An NFSIC test using Gaussian kernels is implemented in fsic.indtest.GaussNFSIC. This is the same as using fsic.indtest.NFSIC which is a generic implementation with Gaussian kernels.


In [254]:
nfsic_opt = it.GaussNFSIC(op_gwx, op_gwy, op_V, op_W, alpha)

# This is the same as 
#k = kernel.KGauss(op_gwx)
#l = kernel.KGauss(op_gwy)
#nfsic_opt = it.NFSIC(k, l, op_V, op_W, alpha)

Perform the independence test on the testing data te.


In [261]:
# return a dictionary of testing results
nfsic_opt.perform_test(te)


Out[261]:
{'alpha': 0.01,
 'h0_rejected': True,
 'pvalue': 1.1098408285686999e-20,
 'test_stat': 86.955616273046701,
 'time_secs': 0.001252889633178711}

It can be seen that the test correctly rejects $H_0$ with a very small p-value.

Learned test location(s)

Let us check the optimized test locations.


In [262]:
# Let's plot the locations on top of the data
xtr, ytr = tr.xy()
plt.plot(X, Y, 'k.', label='Data')
plt.plot(op_V, op_W, 'r*', label='Test location', markersize=20)
plt.xlim([-np.pi, np.pi])
plt.ylim([-np.pi, np.pi])
plt.title('Training data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(numpoints=1)


Out[262]:
<matplotlib.legend.Legend at 0x7f5877f38c10>

We expect the learned location(s) to be in the region where $p(x, y)$ differs most from $p(x)p(y)$.

Exercise

Go back to where we define the PairedSource, and change omega to 0. This makes $(X, Y) \sim U([-\pi, \pi]^2)$. In this case, $X$ and $Y$ are independent i.e., $H_0$ is true. Run the whole procedure again and verify that the test will not reject $H_0$. (Technically, the probability of rejecting is about $\alpha$.)


In [ ]: