The covariance test

One of the first works in this framework of post-selection inference is the covariance test. The test was motivated by a drop in covariance of the residual through one step of the LARS path.

The basic theory behind the covtest can be seen by sampling $n$ IID Gaussians and looking at the spacings between the top two. A simple calculation Mills' ratio calculation leads to $$ Z^n_{(1)} (Z^n_{(1)} - Z^n_{(2)}) \overset{D}{\to} \text{Exp}(1) $$ Here is a little simulation.


In [1]:
import numpy as np
%pylab inline

from statsmodels.distributions import ECDF
from selection.covtest import covtest, reduced_covtest


Populating the interactive namespace from numpy and matplotlib

In [2]:
Z = np.random.standard_normal((2000,50))
T = np.zeros(2000)
for i in range(2000):
    W = np.sort(Z[i])
    T[i] = W[-1] * (W[-1] - W[-2])

Ugrid = np.linspace(0,1,101)
covtest_fig = plt.figure(figsize=(6,6))
ax = covtest_fig.gca()
ax.plot(Ugrid, ECDF(np.exp(-T))(Ugrid), linestyle='steps', c='k',
        label='covtest')
ax.set_title('Null distribution')
ax.legend(loc='upper left')


Out[2]:
<matplotlib.legend.Legend at 0x10a114790>

The covariance test is an asymptotic result, and can be used in a sequential procedure called forward stop to determine when to stop the LASSO path.

An exact version of the covariance test was developed in a general framework for problems beyond the LASSO using the Kac-Rice formula. A sequential version along the LARS path was developed, which we refer to as the spacings test.

Here is the exact test, which is the first step of the spacings test.


In [3]:
from scipy.stats import norm as ndist
Texact = np.zeros(2000)
for i in range(2000):
    W = np.sort(Z[i])
    Texact[i] = ndist.sf(W[-1]) / ndist.sf(W[-2])
ax.plot(Ugrid, ECDF(Texact)(Ugrid), c='blue', linestyle='steps', label='exact covTest')
covtest_fig


Out[3]:

Covariance test for regression

The above tests were based on an IID sample, though both the covtest and its exact version can be used in a regression setting. Both tests need access to the covariance of the noise.

Formally, suppose $$ y|X \sim N(\mu, \Sigma) $$ the exact test is a test of $$H_0:\mu=0$$ based on $$ \lambda_{\max} = \|X^Ty\|_{\infty}, $$ the point the first variable enters the LASSO. That is, $\lambda_{\max}$ is the smallest $\lambda$ for which 0 solves $$ \text{minimize}_{\beta} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1. $$

Formally, the exact test conditions on the variable $i^*(y)$ that achieves $\lambda_{\max}$ and tests a weaker null hypothesis $$H_0:X[:,i^*(y)]^T\mu=0.$$ The covtest is an approximation of this test, based on the same Mills ratio calculation. (This calculation roughly says that the overshoot of a Gaussian above a level $u$ is roughly an exponential random variable with mean $u^{-1}$).

Here is a simulation under $\Sigma = \sigma^2 I$ with $\sigma$ known.


In [4]:
n, p, nsim, sigma = 50, 1000, 500, 1.5

def instance(n, p, beta=None, sigma=sigma):
    X = np.random.standard_normal((n,p)) + np.random.standard_normal(n)[:,None] # equicorrelated 1/2
    X /= X.std(0)[None,:]
    X /= np.sqrt(n)
    Y = np.random.standard_normal(n) * sigma
    if beta is not None:
        Y += np.dot(X, beta)
    return X, Y

In [5]:
X, Y = instance(n, p, sigma=sigma) 
cone, pval, idx, sign = covtest(X, Y)
pval


Out[5]:
0.21828522556482188

In [6]:
def simulation(beta):
    Pcov = []
    Pexact = []

    for i in range(nsim):
        X, Y = instance(n, p, sigma=sigma, beta=beta)
        Pcov.append(covtest(X, Y, sigma=sigma, exact=False)[1])
        Pexact.append(covtest(X, Y, sigma=sigma, exact=True)[1])

    Ugrid = np.linspace(0,1,101)
    plt.figure(figsize=(6,6))
    plt.plot(Ugrid, ECDF(Pcov)(Ugrid), label='covtest', ls='steps', c='k')
    plt.plot(Ugrid, ECDF(Pexact)(Ugrid), label='exact covtest', ls='steps', c='blue')
    plt.legend(loc='lower right')

Null


In [7]:
beta = np.zeros(p)
simulation(beta)


1-sparse


In [8]:
beta = np.zeros(p)
beta[0] = 4
simulation(beta)


2-sparse


In [9]:
beta = np.zeros(p)
beta[:2] = 4
simulation(beta)


5-sparse


In [10]:
beta = np.zeros(p)
beta[:5] = 4
simulation(beta)


Several steps -- covtest version


In [11]:
from selection.covtest import forward_step

Pu = []
Pk = []
Pc = []

beta = np.zeros(p)
nsim = 100
for _ in range(nsim):
    X, Y = instance(n, p, sigma=sigma, beta=beta)
    Pvals = forward_step(X, Y, sigma=sigma,
                         tests=['covtest', 
                                'reduced_known',
                                'reduced_unknown'],
                         nstep=10)
    Pu.append(Pvals['reduced_unknown'])
    Pk.append(Pvals['reduced_known'])
    Pc.append(Pvals['covtest'])


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-92e633de78d5> in <module>()
     13                                 'reduced_known',
     14                                 'reduced_unknown'],
---> 15                          nstep=10)
     16     Pu.append(Pvals['reduced_unknown'])
     17     Pk.append(Pvals['reduced_known'])

TypeError: forward_step() got an unexpected keyword argument 'tests'

In [ ]:
np.array(Pu).shape

In [ ]:
def simulation(beta):
    Pu = []
    Pr = []
    Pk = []

    for i in range(nsim):
        X, Y = instance(n, p, sigma=sigma, beta=beta)
        Pu.append(reduced_covtest(X, Y, burnin=500, ndraw=5000)[1])
        Pr.append(reduced_covtest(X, Y, burnin=500, ndraw=5000, sigma=sigma)[1])
        Pk.append(covtest(X, Y, sigma=sigma)[1])
        
    g = np.linspace(0,1,101)
    plt.figure(figsize=(8,8))
    plt.plot(g, ECDF(Pu)(g), label='unknown, reduced')
    plt.plot(g, ECDF(Pk)(g), label='known, full')
    plt.plot(g, ECDF(Pr)(g), label='known, reduced')
    plt.legend(loc='lower right')

In [ ]:
g = np.linspace(0,1,101)
[plt.plot(g, ECDF(np.array(Pc)[:,i])(g)) for  i in range(10)]

In [ ]: