One of the first works in this framework of post-selection inference is the covariance test. The test was motivated by a drop in covariance of the residual through one step of the LARS path.
The basic theory behind the covtest can be seen by sampling $n$ IID
Gaussians and looking at the spacings between the top two.
A simple calculation Mills' ratio calculation leads to
$$
Z^n_{(1)} (Z^n_{(1)} - Z^n_{(2)}) \overset{D}{\to} \text{Exp}(1)
$$
Here is a little simulation.
In [1]:
import numpy as np
%pylab inline
from statsmodels.distributions import ECDF
from selection.covtest import covtest, reduced_covtest
In [2]:
Z = np.random.standard_normal((2000,50))
T = np.zeros(2000)
for i in range(2000):
W = np.sort(Z[i])
T[i] = W[-1] * (W[-1] - W[-2])
Ugrid = np.linspace(0,1,101)
covtest_fig = plt.figure(figsize=(6,6))
ax = covtest_fig.gca()
ax.plot(Ugrid, ECDF(np.exp(-T))(Ugrid), linestyle='steps', c='k',
label='covtest')
ax.set_title('Null distribution')
ax.legend(loc='upper left')
Out[2]:
The covariance test is an asymptotic result, and can be used in a sequential procedure called forward stop to determine when to stop the LASSO path.
An exact version of the covariance test was developed in a general framework for problems beyond the LASSO using the Kac-Rice formula. A sequential version along the LARS path was developed, which we refer to as the spacings test.
Here is the exact test, which is the first step of the spacings test.
In [3]:
from scipy.stats import norm as ndist
Texact = np.zeros(2000)
for i in range(2000):
W = np.sort(Z[i])
Texact[i] = ndist.sf(W[-1]) / ndist.sf(W[-2])
ax.plot(Ugrid, ECDF(Texact)(Ugrid), c='blue', linestyle='steps', label='exact covTest')
covtest_fig
Out[3]:
The above tests were based on an IID sample, though
both the covtest and its exact version can be used
in a regression setting. Both tests need access to the covariance
of the noise.
Formally, suppose $$ y|X \sim N(\mu, \Sigma) $$ the exact test is a test of $$H_0:\mu=0$$ based on $$ \lambda_{\max} = \|X^Ty\|_{\infty}, $$ the point the first variable enters the LASSO. That is, $\lambda_{\max}$ is the smallest $\lambda$ for which 0 solves $$ \text{minimize}_{\beta} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1. $$
Formally, the exact test conditions on the variable $i^*(y)$ that achieves $\lambda_{\max}$ and tests a weaker null hypothesis $$H_0:X[:,i^*(y)]^T\mu=0.$$ The covtest is an approximation of this test, based on the same Mills ratio calculation. (This calculation roughly says that the overshoot of a Gaussian above a level $u$ is roughly an exponential random variable with mean $u^{-1}$).
Here is a simulation under $\Sigma = \sigma^2 I$ with $\sigma$ known.
In [4]:
n, p, nsim, sigma = 50, 1000, 500, 1.5
def instance(n, p, beta=None, sigma=sigma):
X = np.random.standard_normal((n,p)) + np.random.standard_normal(n)[:,None] # equicorrelated 1/2
X /= X.std(0)[None,:]
X /= np.sqrt(n)
Y = np.random.standard_normal(n) * sigma
if beta is not None:
Y += np.dot(X, beta)
return X, Y
In [5]:
X, Y = instance(n, p, sigma=sigma)
cone, pval, idx, sign = covtest(X, Y)
pval
Out[5]:
In [6]:
def simulation(beta):
Pcov = []
Pexact = []
for i in range(nsim):
X, Y = instance(n, p, sigma=sigma, beta=beta)
Pcov.append(covtest(X, Y, sigma=sigma, exact=False)[1])
Pexact.append(covtest(X, Y, sigma=sigma, exact=True)[1])
Ugrid = np.linspace(0,1,101)
plt.figure(figsize=(6,6))
plt.plot(Ugrid, ECDF(Pcov)(Ugrid), label='covtest', ls='steps', c='k')
plt.plot(Ugrid, ECDF(Pexact)(Ugrid), label='exact covtest', ls='steps', c='blue')
plt.legend(loc='lower right')
In [7]:
beta = np.zeros(p)
simulation(beta)
In [8]:
beta = np.zeros(p)
beta[0] = 4
simulation(beta)
In [9]:
beta = np.zeros(p)
beta[:2] = 4
simulation(beta)
In [10]:
beta = np.zeros(p)
beta[:5] = 4
simulation(beta)
In [11]:
from selection.covtest import forward_step
Pu = []
Pk = []
Pc = []
beta = np.zeros(p)
nsim = 100
for _ in range(nsim):
X, Y = instance(n, p, sigma=sigma, beta=beta)
Pvals = forward_step(X, Y, sigma=sigma,
tests=['covtest',
'reduced_known',
'reduced_unknown'],
nstep=10)
Pu.append(Pvals['reduced_unknown'])
Pk.append(Pvals['reduced_known'])
Pc.append(Pvals['covtest'])
In [ ]:
np.array(Pu).shape
In [ ]:
def simulation(beta):
Pu = []
Pr = []
Pk = []
for i in range(nsim):
X, Y = instance(n, p, sigma=sigma, beta=beta)
Pu.append(reduced_covtest(X, Y, burnin=500, ndraw=5000)[1])
Pr.append(reduced_covtest(X, Y, burnin=500, ndraw=5000, sigma=sigma)[1])
Pk.append(covtest(X, Y, sigma=sigma)[1])
g = np.linspace(0,1,101)
plt.figure(figsize=(8,8))
plt.plot(g, ECDF(Pu)(g), label='unknown, reduced')
plt.plot(g, ECDF(Pk)(g), label='known, full')
plt.plot(g, ECDF(Pr)(g), label='known, reduced')
plt.legend(loc='lower right')
In [ ]:
g = np.linspace(0,1,101)
[plt.plot(g, ECDF(np.array(Pc)[:,i])(g)) for i in range(10)]
In [ ]: