The covariance test

One of the first works in this framework of post-selection inference is the covariance test. The test was motivated by a drop in covariance of the residual through one step of the LARS path.

The basic theory behind the covtest can be seen by sampling \(n\) IID Gaussians and looking at the spacings between the top two. A simple calculation Mills’ ratio calculation leads to

\[Z^n_{(1)} (Z^n_{(1)} - Z^n_{(2)}) \overset{D}{\to} \text{Exp}(1)\]

Here is a little simulation.

[1]:
import numpy as np
np.random.seed(0)
%matplotlib inline
import matplotlib.pyplot as plt
from statsmodels.distributions import ECDF
from selectinf.algorithms.covtest import covtest

We will sample 2000 times from \(Z \sim N(0,I_{50 \times 50})\) and look at the normalized spacing between the top 2 values.

[2]:
Z = np.random.standard_normal((2000,50))
T = np.zeros(2000)
for i in range(2000):
    W = np.sort(Z[i])
    T[i] = W[-1] * (W[-1] - W[-2])

Ugrid = np.linspace(0,1,101)
covtest_fig = plt.figure(figsize=(6,6))
ax = covtest_fig.gca()
ax.plot(Ugrid, ECDF(np.exp(-T))(Ugrid), drawstyle='steps', c='k',
        label='covtest', linewidth=3)
ax.set_title('Null distribution')
ax.legend(loc='upper left');
../_images/algorithms_covtest_4_0.png

The covariance test is an asymptotic result, and can be used in a sequential procedure called forward stop to determine when to stop the LASSO path.

An exact version of the covariance test was developed in a general framework for problems beyond the LASSO using the Kac-Rice formula. A sequential version along the LARS path was developed, which we refer to as the spacings test.

Here is the exact test, which is the first step of the spacings test.

[3]:
from scipy.stats import norm as ndist
Texact = np.zeros(2000)
for i in range(2000):
    W = np.sort(Z[i])
    Texact[i] = ndist.sf(W[-1]) / ndist.sf(W[-2])
ax.plot(Ugrid, ECDF(Texact)(Ugrid), c='blue', drawstyle='steps', label='exact covTest',
        linewidth=3)
covtest_fig
[3]:
../_images/algorithms_covtest_6_0.png

Covariance test for regression

The above tests were based on an IID sample, though both the covtest and its exact version can be used in a regression setting. Both tests need access to the covariance of the noise.

Formally, suppose

\[y|X \sim N(\mu, \Sigma)\]

the exact test is a test of

\[H_0:\mu=0.\]

The test is based on

\[\lambda_{\max} = \|X^Ty\|_{\infty}.\]

This value of \(\lambda\) is the value at which the first variable enters the LASSO. That is, \(\lambda_{\max}\) is the smallest \(\lambda\) for which 0 solves

\[\text{minimize}_{\beta} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1.\]

Formally, the exact test conditions on the variable \(i^*(y)\) that achieves \(\lambda_{\max}\) and tests a weaker null hypothesis

\[H_0:X[:,i^*(y)]^T\mu=0.\]

The covtest is an approximation of this test, based on the same Mills ratio calculation. (This calculation roughly says that the overshoot of a Gaussian above a level \(u\) is roughly an exponential random variable with mean \(u^{-1}\)).

Here is a simulation under \(\Sigma = \sigma^2 I\) with \(\sigma\) known. The design matrix, before standardization, is Gaussian equicorrelated in the population with parameter 1/2.

[4]:
n, p, nsim, sigma = 50, 200, 1000, 1.5

def instance(n, p, beta=None, sigma=sigma):
    X = (np.random.standard_normal((n,p)) + np.random.standard_normal(n)[:,None])
    X /= X.std(0)[None,:]
    X /= np.sqrt(n)
    Y = np.random.standard_normal(n) * sigma
    if beta is not None:
        Y += np.dot(X, beta)
    return X, Y

Let’s make a dataset under our global null and compute the exact covtest \(p\)-value.

[5]:
X, Y = instance(n, p, sigma=sigma)
cone, pval, idx, sign = covtest(X, Y, exact=False)
pval
[5]:
0.06488988967797554

The object cone is an instance of selection.affine.constraints which does much of the work for affine selection procedures. The variables idx and sign store which variable achieved \(\lambda_{\max}\) and the sign of its correlation with \(y\).

[6]:
cone
[6]:
$$Z \sim N(\mu,\Sigma) | AZ \leq b$$
[7]:
def simulation(beta):
    Pcov = []
    Pexact = []

    for i in range(nsim):
        X, Y = instance(n, p, sigma=sigma, beta=beta)
        Pcov.append(covtest(X, Y, sigma=sigma, exact=False)[1])
        Pexact.append(covtest(X, Y, sigma=sigma, exact=True)[1])

    Ugrid = np.linspace(0,1,101)
    plt.figure(figsize=(6,6))
    plt.plot(Ugrid, ECDF(Pcov)(Ugrid), label='covtest', ds='steps', c='k', linewidth=3)
    plt.plot(Ugrid, ECDF(Pexact)(Ugrid), label='exact covtest', ds='steps', c='blue', linewidth=3)
    plt.legend(loc='lower right')

Null

[8]:
beta = np.zeros(p)
simulation(beta)
../_images/algorithms_covtest_16_0.png

1-sparse

[9]:
beta = np.zeros(p)
beta[0] = 4
simulation(beta)
../_images/algorithms_covtest_18_0.png

2-sparse

[10]:
beta = np.zeros(p)
beta[:2] = 4
simulation(beta)
../_images/algorithms_covtest_20_0.png

5-sparse

[11]:
beta = np.zeros(p)
beta[:5] = 4
simulation(beta)
../_images/algorithms_covtest_22_0.png