{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Randomized LASSO\n", "\n", "This selection algorithm allows the researcher to form a model \n", "after observing the *subgradient* of this optimization problem\n", "\n", "$$\n", "\\text{minimize}_{\\beta} \\frac{1}{2} \\|y-X\\beta\\|^2_2 + \\sum_j \\lambda_j |\\beta_j| - \\omega^T\\beta + \\frac{\\epsilon}{2} \\|\\beta\\|^2_2\n", "$$\n", "\n", "where $\\omega \\sim N(0,\\Sigma)$ is Gaussian randomization with a covariance specified by the user. Data splitting\n", "is (asymptotically) a special case of this randomization mechanism." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(100, 20)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "from selectinf.randomized.api import lasso\n", "from selectinf.tests.instance import gaussian_instance\n", "\n", "np.random.seed(0) # for reproducibility\n", "\n", "X, y = gaussian_instance(n=100,\n", " p=20, \n", " s=5, \n", " signal=3,\n", " equicorrelated=False, \n", " rho=0.4,\n", " random_signs=True)[:2]\n", "n, p = X.shape\n", "n, p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Randomization mechanism\n", "\n", "By default, isotropic Gaussian randomization is chosen with variance chosen based on \n", "mean diagonal of $X^TX$ and the standard deviation of $y$." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1, 6, 17, 18])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L = lasso.gaussian(X, y, 2 * np.diag(X.T.dot(X)) * np.std(y)) \n", "signs = L.fit()\n", "active_set = np.nonzero(signs != 0)[0]\n", "active_set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that variables `[1,6,17,18]` are chosen here. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inference\n", "\n", "For inference, the user can in principle choose any target jointly normal with $\\nabla \\ell(\\beta^*;X,y) = \n", "X^T(X\\beta^*-y)$ where $\\beta^*$ is the population minimizer under the model $(X_i,y_i) \\overset{IID}{\\sim} F$.\n", "\n", "For convenience, we have provided some targets, though our functions expect boolean representation of the active set." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from selectinf.randomized.lasso import selected_targets\n", "active_bool = np.zeros(p, np.bool)\n", "active_bool[active_set] = True\n", "\n", "(observed_target,\n", " cov_target,\n", " cov_target_score,\n", " alternatives) = selected_targets(L.loglike, np.ones(n), active_bool)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given our target $\\widehat{\\theta}$ and its estimated covariance $\\Sigma$\n", "as well as its joint covariance $\\tilde{\\Gamma}$ with $\\nabla \\ell(\\beta^*; X,y)$ we use th linear\n", "decomposition \n", "$$\n", "\\begin{aligned}\n", "\\nabla \\ell(\\beta^*; X,y) &= \\nabla \\ell(\\beta^*; X,y) - \\tilde{\\Gamma} \\Sigma^{-1} \\widehat{\\theta} + \\tilde{\\Gamma} \\Sigma^{-1} \\widehat{\\theta} \\\\\n", "&= N + \\Gamma \\widehat{\\theta}.\n", "\\end{aligned}\n", "$$\n", "\n", "We have arranged things so that (pre-selection) $N$ is uncorrelated (and asympotically independent of) $\\widehat{\\theta}$.\n", "\n", "We can then form univariate tests of $H_{0,j}:\\theta_j=0$ based on this conditional distribution.\n", "As the form is unknown, we approximate it using MCMC with `ndraw` steps after a burnin of `burnin` steps.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4,)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "observed_target.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "Xsel_inv = np.linalg.pinv(X[:, active_set])\n", "np.testing.assert_allclose(observed_target, Xsel_inv.dot(y))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "dispersion = np.linalg.norm(y - X[:, active_set].dot(Xsel_inv.dot(y)))**2 / (n - len(active_set))\n", "np.testing.assert_allclose(cov_target, dispersion * Xsel_inv.dot(Xsel_inv.T))\n", "np.testing.assert_allclose(cov_target_score, - X.T.dot(X)[:,active_set].dot(cov_target).T, rtol=np.inf, atol=1.e-10) # some zeros so relative" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "pivots, pvals, intervals = L.summary(observed_target,\n", " cov_target, # \\Sigma\n", " cov_target_score, # \\tilde{\\Gamma}\n", " alternatives,\n", " ndraw=10000,\n", " burnin=2000,\n", " compute_intervals=True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.04349979, 0.0516205 , 0.00783708, 0.44920772])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pvals" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-29.28009003, -15.89881596],\n", " [ 30.51031835, 37.58445178],\n", " [-25.87190582, -8.52983107],\n", " [ -3.73877411, 7.06250691]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "intervals" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "all,-slideshow", "formats": "ipynb,Rmd" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }