The covariance test¶
One of the first works in this framework of post-selection inference is the covariance test. The test was motivated by a drop in covariance of the residual through one step of the LARS path.
The basic theory behind the covtest
can be seen by sampling n IID Gaussians and looking at the spacings between the top two. A simple calculation Mills’ ratio calculation leads to
Here is a little simulation.
[1]:
import numpy as np
np.random.seed(0)
%matplotlib inline
import matplotlib.pyplot as plt
from statsmodels.distributions import ECDF
from selectinf.algorithms.covtest import covtest
We will sample 2000 times from Z∼N(0,I50×50) and look at the normalized spacing between the top 2 values.
[2]:
Z = np.random.standard_normal((2000,50))
T = np.zeros(2000)
for i in range(2000):
W = np.sort(Z[i])
T[i] = W[-1] * (W[-1] - W[-2])
Ugrid = np.linspace(0,1,101)
covtest_fig = plt.figure(figsize=(6,6))
ax = covtest_fig.gca()
ax.plot(Ugrid, ECDF(np.exp(-T))(Ugrid), drawstyle='steps', c='k',
label='covtest', linewidth=3)
ax.set_title('Null distribution')
ax.legend(loc='upper left');

The covariance test is an asymptotic result, and can be used in a sequential procedure called forward stop to determine when to stop the LASSO path.
An exact version of the covariance test was developed in a general framework for problems beyond the LASSO using the Kac-Rice formula. A sequential version along the LARS path was developed, which we refer to as the spacings test.
Here is the exact test, which is the first step of the spacings test.
[3]:
from scipy.stats import norm as ndist
Texact = np.zeros(2000)
for i in range(2000):
W = np.sort(Z[i])
Texact[i] = ndist.sf(W[-1]) / ndist.sf(W[-2])
ax.plot(Ugrid, ECDF(Texact)(Ugrid), c='blue', drawstyle='steps', label='exact covTest',
linewidth=3)
covtest_fig
[3]:

Covariance test for regression¶
The above tests were based on an IID sample, though both the covtest
and its exact version can be used in a regression setting. Both tests need access to the covariance of the noise.
Formally, suppose
the exact test is a test of
The test is based on
This value of \lambda is the value at which the first variable enters the LASSO. That is, \lambda_{\max} is the smallest \lambda for which 0 solves
Formally, the exact test conditions on the variable i^*(y) that achieves \lambda_{\max} and tests a weaker null hypothesis
The covtest is an approximation of this test, based on the same Mills ratio calculation. (This calculation roughly says that the overshoot of a Gaussian above a level u is roughly an exponential random variable with mean u^{-1}).
Here is a simulation under \Sigma = \sigma^2 I with \sigma known. The design matrix, before standardization, is Gaussian equicorrelated in the population with parameter 1/2.
[4]:
n, p, nsim, sigma = 50, 200, 1000, 1.5
def instance(n, p, beta=None, sigma=sigma):
X = (np.random.standard_normal((n,p)) + np.random.standard_normal(n)[:,None])
X /= X.std(0)[None,:]
X /= np.sqrt(n)
Y = np.random.standard_normal(n) * sigma
if beta is not None:
Y += np.dot(X, beta)
return X, Y
Let’s make a dataset under our global null and compute the exact covtest p-value.
[5]:
X, Y = instance(n, p, sigma=sigma)
cone, pval, idx, sign = covtest(X, Y, exact=False)
pval
[5]:
0.06488988967797554
The object cone
is an instance of selection.affine.constraints
which does much of the work for affine selection procedures. The variables idx
and sign
store which variable achieved \lambda_{\max} and the sign of its correlation with y.
[6]:
cone
[6]:
[7]:
def simulation(beta):
Pcov = []
Pexact = []
for i in range(nsim):
X, Y = instance(n, p, sigma=sigma, beta=beta)
Pcov.append(covtest(X, Y, sigma=sigma, exact=False)[1])
Pexact.append(covtest(X, Y, sigma=sigma, exact=True)[1])
Ugrid = np.linspace(0,1,101)
plt.figure(figsize=(6,6))
plt.plot(Ugrid, ECDF(Pcov)(Ugrid), label='covtest', ds='steps', c='k', linewidth=3)
plt.plot(Ugrid, ECDF(Pexact)(Ugrid), label='exact covtest', ds='steps', c='blue', linewidth=3)
plt.legend(loc='lower right')