A significance test for the lasso

Size: px

Start display at page:

Download "A significance test for the lasso"

Elisabeth Wells
6 years ago
Views:

1 A significance test for the lasso Richard Lockhart 1 Jonathan Taylor 2 Ryan J. Tibshirani 3 Robert Tibshirani 2 arxiv: v1 [math.st] 30 Jan Simon Fraser University, 2 Stanford University, 3 Carnegie-Mellon University Abstract In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1 asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model. Our proof of this result assumes some (reasonable regularity conditions on the predictor matrix X, and covers the important high-dimensional case p > n. Of course, for testing the significance of an additional variable between two nested linear models, one may use the usual chi-squared test, comparing the drop in residual sum of squares (RSS to a χ 2 1 distribution. But when this additional variable is not fixed, but has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than χ 2 1 under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the l 1 penalty. Therefore the test statistic (which is based on lasso fitted values is in a sense balanced by these two opposing properties adaptivity and shrinkage and its null distribution is tractable and asymptotically Exp(1. Keywords: lasso, least angle regression, significance test 1 Introduction Given an outcome vector y R n and matrix X R n p of predictor variables, we consider the usual linear regression setup: y = Xβ +σǫ, (1 where β R p are unknown coefficients to be estimated, σ 2 > 0 is the marginal noise variance, and the components of the noise vector ǫ R n are i.i.d. N(0,1. 1 We focus on the lasso estimator (Tibshirani 1996, Chen et al. 1998, defined as 1 ˆβ = argmin β R p 2 y Xβ 2 2 +λ β 1, (2 where λ 0 is a tuning parameter, controlling the degree of sparsity in the estimate ˆβ. Here we assume that the columns of X are in general position in order to ensure uniqueness of the lasso solution [this is quite a weak condition, to be discussed again shortly; see also Tibshirani (2012]. 1 If an intercept term is desired, then we can still assume a model of the form (1 after centering y and the columns of X. see Section 2.2 for discussion of this point. 1

2 There has been a considerable amount of recent work dedicated to the lasso problem, both in terms of computation and theory. A comprehensive summary of the literature in either category would be too long for our purposes here, so we instead give a short summary: for computational work, some relevant contributions are Friedman et al. (2007, Beck & Teboulle (2009, Friedman et al. (2010, Becker, Bobin & Candes (2011, Boyd et al. (2011, Becker, Candes & Grant (2011; and for theoretical work see, e.g., Greenshtein & Ritov (2004, Fuchs (2005, Donoho (2006, Candes & Tao (2006, Zhao & Yu (2006, Wainwright (2009, Candes & Plan (2009. Generally speaking, theory for the lasso is focused on bounding the estimation error Xˆβ Xβ 2 2 or ˆβ β 2 2, or ensuring exact recovery of the underlying model, supp(ˆβ = supp(β [with supp( denoting the support function]; favorable results in both respects can be shown under the right assumptions on the generative model (1 and the predictor matrix X. Strong theoretical backing, as well as fast algorithms, have made the lasso a highly popular tool. Yet, there are still major gaps in our understanding of the lasso as an estimation procedure. In many real applications of the lasso, a practitioner will undoubtedly seek some sort of inferential guarantees for his or her computed lasso model but, generically, the usual constructs like p-values, confidence intervals, etc., do not exist for lasso estimates. There is a small but growing literature dedicated to inference for the lasso, and important progress has certainly been made, mostly through the use of resampling methods; we review this work in Section 2.5. The current paper focuses on a significance test for lasso models that does not employ resampling, but instead proposes a test statistic that has a simple and exact asymptotic null distribution. Section 2 defines the problem that we are trying to solve, give the details of our proposal the covariance test statistic. Section 3 considers an orthogonal predictor matrix X, in which case the statistic greatly simplifies. Here we derive its Exp(1 asymptotic distribution using relatively simple arguments from extreme value theory. Section 4 treats a general (nonorthogonal X, and under some regularity conditions, derives an Exp(1 limiting distribution for the covariance test statistic, but through a different method of proof that relies on discrete-time Gaussian processes. Section 5 empirically verifies convergence of the null distribution to Exp(1 over a variety of problem setups. Up until this point we have assumed that the error variance σ 2 is known; in Section 6 we discuss the case of unknown σ 2. Section 7 gives some real data examples. Section 8 covers extensions to the elastic net, generalized linear models, and the Cox model for survival data. We conclude with a discussion in Section 9. 2 Significance testing in linear modeling Classic theory for significance testing in linear regression operates on two fixed nested models. For example, if M and M {j} are fixed subsets of {1,...p}, then to test the significance of the jth predictor in the model (with variables in M {j}, one naturally uses the chi-squared test, which computes the drop in residual sum of squares (RSS from regression on M {j} and M, R j = (RSS M RSS M {j} /σ 2, (3 and compares this to a χ 2 1 distribution. (Here σ2 is assumed to be known; when σ 2 is unknown, we use the sample variance in its place, which results in the F-test, equivalent to the t-test, for testing the significance of variable j. Often, however, one would like to run the same test for M and M {j} that are not fixed, but the outputs of an adaptive or greedy procedure. Unfortunately, adaptivity invalidates the use of a χ 2 1 null distribution for the statistic (3. As a simple example, consider forward stepwise regression: starting with an empty model M =, we enter predictors one at a time, at each step choosing the predictor j that gives the largest drop in residual sum of squares. In other words, forward stepwise regression chooses j at each step in order to maximize R j in (3, over all j / M. Since R j follows a χ 2 1 distribution under the null hypothesis for each fixed j, the maximum possible R j will clearly 2

3 Test statistic Test statistic Chi squared on 1 df (a Forward stepwise Exp(1 (b Lasso Figure 1: A simple example with n = 100 observations and p = 10 orthogonal predictors. All true regression coefficients are zero, β = 0. On the left is a quantile-quantile plot, constructed over 1000 simulations, of the standard chi-squared statistic R 1 in (3, measuring the drop in residual sum of squares for the first predictor to enter in forward stepwise regression, versus the χ 2 1 distribution. The dashed vertical line marks the 95% quantile of the χ 2 1 distribution. The right panel shows a quantile-quantile plot of the covariance test statistic T 1 in (5 for the first predictor to enter in the lasso path, versus its asymptotic distribution Exp(1. The covariance test explicitly accounts for the adaptive nature of lasso modeling, whereas the usual chi-squared test is not appropriate for adaptively selected models, e.g., those produced by forward stepwise regression. be stochastically larger than χ 2 1 under the null. Therefore, using a chi-squared test to evaluate the significance of a predictor entered by forward stepwise regression would be far too liberal (having type I error much larger than the nominal level. Figure 1(a demonstrates this point by displaying the quantiles of R 1 in forward stepwise regression (the chi-squared statistic for the first predictor to enter versus those of a χ 2 1 variate, in the fully null case (when β = 0. A test at level 0.05 for example, using the χ 2 1 cutoff of 3.84, would have actual type I error about 39%. The failure of standard testing methodology when applied to forward stepwise regression is not an anomaly in general, there is no simple way to carry out the significance tests designed for fixed linear models in an adaptive setting. Our aim is hence to provide a (new significance test for the predictor variables chosen adaptively by the lasso, which we describe next. 2.1 The covariance test statistic The test statistic that we propose here is constructed from the lasso solution path, i.e., the solution ˆβ(λ in (2 a function of the tuning parameter λ [0,. The lasso path can be computed by the well-known LARS algorithm of Efron et al. (2004 [see also Osborne et al. (2000a, Osborne et al. (2000b], which traces out the solution as λ decreases from to 0. Note that when rank(x < p, there are possibly many lasso solutions at each λ and therefore many possible solution paths; we assume that the columns of X are in general position 2, implying that there is a unique lasso solution 2 To be precise, we say that points X 1,...X p R n are in general position provided that no k-dimensional affine subspace L R n, k < min{n,p}, contains more than k +1 elements of {±X 1,...±X p}, excluding antipodal pairs. Equivalently: the affine span of any k + 1 points s 1 X i1,...s k+1 X ik+1, for any signs s 1,...s k+1 { 1,1}, does not contain any element of the set {±X i : i i 1,...i k+1 }. 3

4 at each λ > 0 and hence a unique path. The assumption that X has columns in general position is a very weak one [much weaker, e.g., than assuming that rank(x = p]. For example, if the entries of X are drawn from a continuous probability distribution onr np, then the columns of X are almost surely in general position, and this is true regardless of the sizes of n and p. See Tibshirani (2012. Before defining our statistic, we briefly review some properties of the lasso path. The path ˆβ(λ is a continuous and piecewise linear function of λ, with knots (changes in slope at values λ 1 λ 2... λ r 0 (these knots depend on y,x. At λ =, the solution ˆβ( has no active variables (i.e., all variables have zero coefficients; for decreasing λ, each knot λ k marks the entry or removal of some variable from the current active set (i.e., its coefficient becomes nonzero or zero, respectively. Therefore the active set, and also the signs of active coefficients, remain constant in between knots. At any point λ in the path, the corresponding active set A = supp(ˆβ(λ of the lasso solution indexes a linearly independent set of predictor variables, i.e., rank(x A = A, where we use X A to denote the columns of X in A. For a matrix X satisfying the positive cone condition (a restrictive condition that covers, e.g., orthogonal matrices, there are no variables removed from the active set as λ decreases, and therefore the number of knots is min{n, p}. We can now precisely define the problem that we are trying to solve: at a given step in the lasso path (i.e., at a given knot, we consider testing the significance of the variable that enters the active set. To this end, we propose a test statistic defined at the kth step of the path. First we define some needed quantities. Let A be the active set just before λ k, and suppose that predictor j enters at λ k. Denote by ˆβ(λ k+1 the solution at the next knot in the path λ k+1, using predictors {A j}. Finally, let β A (λ k+1 be the solution of the lasso problem using only the active predictors X A, at λ = λ k+1. To be perfectly explicit, 1 β A (λ k+1 = argmin β A R 2 y X Aβ A 2 2 +λ k+1 β A 1. (4 A We propose the covariance test statistic defined by T k = ( y,xˆβ(λk+1 y,x A βa (λ k+1 /σ 2. (5 Intuitively, the covariance statistic in (5 is a function of the difference between Xˆβ and X A βa, the fitted values given by incorporating the jth predictor into the current active set, and leaving it out, respectively. These fitted values are parametrized by λ, and so one may ask: at which value of λ should this difference be evaluated? Well, note first that β A (λ k = ˆβ A (λ k, i.e., the solution of the reduced problem at λ k is simply that of the full problem, restricted to the active set A (as verified by the KKT conditions. Clearly then, this means that we cannot evaluate the difference at λ = λ k, as the jth variable has a zero coefficient upon entry at λ k, and hence Xˆβ(λ k = X AˆβA (λ k = X A βa (λ k. Indeed, the natural choice for the tuning parameter in (5 is λ = λ k+1 : this allows the jth coefficient to have its fullest effect on the fit Xˆβ before the entry of the next variable at λ k+1 (or possibly, the deletion of a variable from A at λ k+1. Secondly, one may also ask the particular choice of function of Xˆβ(λ k+1 X A βa (λ k+1. The covariance statistic in (5 uses an inner product of this difference with y, which can be roughly 4

5 thought of as an (uncentered covariance, hence explaining its name. 3 At a high level, the larger the covariance of y with Xˆβ compared to that with X A βa, the more important the role of variable j in the proposed model A {j}. There certainly may be other functions that would seem appropriate here, but the covariance form in (5 has a distinctive advantage: this statistic admits a simple and exact asymptotic null distribution, assuming normality of the errors in (1. The null hypothesis here is that the current lasso model contains all truly active variables, A supp(β, and in Sections 3 and 4, we show that under the null, T k d Exp(1, i.e., T k is asymptotically distributed as a standard exponential random variable, given reasonable assumptions on X and the magnitudes of the nonzero true coefficients. In the above limit, we are considering both n,p (and in Section 4 we allow for the possibility p > n, the high-dimensional case. We will also see the result T k d Exp(1 applies to the first null predictor entered, after all signal variables (if there are any, have been entered. For the lth null variable entered, the distribution is Exp(1/l. Since l is unknown, we propose to always use the Exp(1 distribution for testing. Now Exp(1/l < Exp(1 for l > 1, so this yields a conservative test. See Figure 1(b for a quantile-quantile plot of T 1 versus an Exp(1 variate for the same fully null example (β = 0 used in Figure 1(a; this shows that the weak convergence to Exp(1 can be quite fast, as the quantiles are decently matched even for p = 10. Before proving this limiting distribution in Sections 3 (for an orthogonal X and 4 (for a general X, we give an example of its application to real data, and discuss issues related to practical usage. We also derive useful alternative expressions for the statistic, discuss the connection to degrees of freedom, and review related work. 2.2 Prostate cancer data example and practical issues We consider a training set of 67 observations and 8 predictors, the goal being to predict log of the PSA level of men who had surgery for prostate cancer. For more details see Hastie et al. (2008 and the references therein. Table 1 shows the results of forward stepwise regression and the lasso. Both methods entered the same predictors in the same order. The forward stepwise p-values are smaller than the lasso p-values, we would enter four predictors at level The latter would enter only one or maybe two predictors. However we know that the forward stepwise p-values are accurate, as they are based on a null distribution that does not account for the adaptive choice of predictors. Remark A. In the above example we implied that one might stop entering variables when the p-value rose above some threshold. More generally, our proposed test statistic and associated p- values could be used as the basis for multiple testing and false discovery rate control methods for this problem; we leave that to future work. Remark B. Note that for a general X, a predictor may enter the active set more than one time, since it may leave the active set at some point, In this case we treat each entry as separate problems. Therefore, our test is specific to a step in the path, and not to a predictor at large; Remark C. When we have an intercept in the model (1, we center y, and use the covariance test as is. Empirically the resulting distribution of T is close to Exp(1. Theoretically this creates dependence between components of ǫ. But the dependence is weak, and we don t study it. Remark D. The covariance test is applied in a sequential manner, estimating p-values for each predictor as it enters. A more difficult problem is to test any of the nonzero predictors in a linear model fit by the lasso, at some arbitrary value of the tuning parameter λ. We discuss this problem briefly in Section 9. 3 From its definition in (5, we get T k = y µ,xˆβ(λ k+1 y µ,x A βa (λ k+1 + µ,xˆβ(λ k+1 X A βa (λ k+1 by expanding y = y µ + µ, with µ = Xβ denoting the true mean. The first two terms are now really empirical covariances, and the last term is typically small. In fact, when X is orthogonal, it is not hard to see that this last term is exactly zero under the null hypothesis. 5

6 Table 1: Forward stepwise and lasso applied to the prostate cancer data example. The error variance is estimated by ˆσ 2, the MSE of the full model. Forward stepwise regression p-values are based on comparing the drop in residual sum of squares (divided by ˆσ 2 to an F(1,n p distribution (using χ 2 1 instead produced slight smaller p-values. The lasso p-values use a simple modification of the covariance test (5 for unknown variance, given in Section 6. Step Predictor Forward entered stepwise Lasso 1 lcavol lweight svi lbph pgg age lcp gleason Alternate expressions for the covariance statistic Here we derive two alternate forms for the covariance statistic in (5. The first lends some insight into the role of shrinkage, and the second is helpful for the convergence results that we establish in Sections 3 and 4. We rely on some basic properties of lasso solutions; see, e.g., Tibshirani & Taylor (2012, Tibshirani (2012. To remind the reader, we are assuming that X has columns in general position. For any fixed λ, if the lasso solution has active set A = supp(ˆβ(λ and signs s A = sign(ˆβ A (λ, then it can be written explicitly (over active variables as ˆβ A (λ = (X T AX A 1 X T Ay λ(x T AX A 1 s A. In the above expression, the first term (XA TX A 1 XA T y simply gives the regression coefficients of y on the active variables X A, and the second term λ(xa TX A 1 s A can be thought of as a shrinkage term, shrinking the values of these coefficients towards zero. Further, the lasso fitted value at λ is Xˆβ(λ = P A y λ(x T A + s A, (6 where P A = X A (X T A X A 1 X T A denotes the projection onto the column space of X A, and (X T A + = X A (X T A X A 1 is the (Moore-Penrose pseudo-inverse of X T A. Using the representation (6 for the fitted values, we can derive our first alternate expression for the covariance statistic in (5. If A and s A are the active set and signs just before the knot λ k, and j is the variable added to the active set at λ k, then by (6, Xˆβ(λ k+1 = P A {j} y λ k+1 (X T A {j} + s A {j}, where s A {j} joins s A and the sign of the jth coefficient, i.e., s A {j} = sign(ˆβ A {j} (λ k+1. Let us assume for the moment that the solution of reduced lasso problem (4 at λ k+1 has all variables active and s A = sign( β A (λ k+1 remember, this holds for the reduced problem at λ k, and we will return to this assumption shortly. Then, again by (6, X A βa (λ k+1 = P A y λ k+1 (X T A + s A, and plugging the above two expressions into (5, T k = y T (P A {j} P A y/σ 2 λ k+1 ( (X T A + s A (X T A {j} + s A {j} /σ 2. (7 6

7 Note that the first term above is y T (P A {j} P A y/σ 2 = ( y P A y 2 2 y P A {j}y 2 2 /σ2, which is exactly the chi-squared statistic for testing the significance of variable j, as in (3. Hence if A, j were fixed, then without the second term, T k would have a χ 2 1 distribution under the null. But of course A,j are not fixed, and so much like we saw previously with forward stepwise regression, the first term in (7 will be generically larger than χ 2 1, because j is chosen adaptively based on its inner product with the current lasso residual vector. Interestingly, the second term in (7 adjusts for this adaptivity: with this term, which is composed of the shrinkage factors in the solutions of the two relevant lasso problems (on X and X A, we prove in the coming sections that T k has an asymptotic Exp(1 null distribution. Therefore, the presence of the second term restores the (asymptotic mean of T k to 1, which is what it would have been if A,j were fixed and the second term were missing. In short, adaptivity and shrinkage balance each other out. This insight aside, the form (7 of the covariance statistic leads to a second representation that will beuseful forthe theoreticalworkin Sections3and4. Wecallthis the knot form ofthe covariance statistic, described in the next lemma. Lemma 1. Let A be the active set just before the kth step in the lasso path, i.e., A = supp(ˆβ(λ k, with λ k being the kth knot. Also let s A denote the signs of the active coefficients, s A = sign(ˆβ A (λ k, and j the predictor that enters the active set at λ k. Then, assuming that s A = sign( β A (λ k+1, (8 or in other words, all coefficients are active in the reduced lasso problem (4 at λ k+1 and have signs s A, we have T k = C(A,s A,j λ k (λ k λ k+1 /σ 2, (9 where C(A,s A,j = (X T A {j} + s A {j} (X T A + s A 2 2. The proof starts with expression (7, and arrives at (9 through simple algebraic manipulations. We defer it until Appendix A.1. When does the condition (8 hold? This was a key assumption behind both of the forms (7 and (9 for the statistic. We first note that the solution β A of the reduced lasso problem has signs s A at λ k, so it will have the same signs s A at λ k+1 provided that no variables are deleted from the active set in the solution path β A (λ for λ [λ k+1,λ k ]. Therefore, assumption (8 holds: 1. When X satisfies the positive cone condition (which includes X orthogonal, because no variables ever leave the active set in this case. In fact, for X orthogonal, it is straightforward to check that C(A,s A,j = 1, so T k = λ k (λ k λ k+1 /σ When k = 1 (we are testing the first variable to enter, as a variable cannot leave the active set right after it has entered. If k = 1 and X has unit norm columns, X i 2 2 = 1 for i = 1,...p, then we again have C(A,s A,j = 1 (note that A =, so T 1 = λ 1 (λ 1 λ 2 /σ When s A = sign((x A + y, i.e., s A contains the signs of the least squares coefficients on X A, because the same active set and signs cannot appear at two different knots in the lasso path (applied here to the reduced lasso problem on X A. The first and second scenarios are considered in Sections 3 and 4.1, respectively. The third scenario is actually somewhat general and occurs, e.g., when s A = sign((x A + y = sign(βa, both the lasso and least squares on X A recover the signs of the true coefficients. Section 4.2 studies the general X and k > 1 case, wherein this third scenario is important. 7

8 2.4 Connection to degrees of freedom There is a interesting connection between the covariance statistic in (5 and the degrees of freedom of a fitting procedure. In the regression setting (1, for an estimate ŷ [which we think of as a fitting procedure ŷ = ŷ(y], its degrees of freedom is typically defined (Efron 1986a as df(ŷ = 1 n σ 2 Cov(ŷ i,y i. (10 i=1 In words, df(ŷ sums the covariances of each observation y i with its fitted value ŷ i. Hence the more adaptive a procedure, the higher this covariance, and the greater its degrees of freedom. Using this definition, one can reason [and confirm by simulation, just as in Figure 1(a] that with k predictors entered into the model, forward stepwise regression had used substantially more than k degrees of freedom. But something quite remarkable happens when we consider the lasso: for a model containing k nonzero coefficients, the degrees of freedom of the lasso fit is equal to k (either exactly or in expectation, depending on the assumptions [Efron et al. (2004, Zou et al. (2007, Tibshirani & Taylor (2012]. Why does this happen? Roughly speaking, it is the same adaptivity versus shrinkage phenomenon at play. [Recall our discussion in the last section following the expression (7 for the covariance statistic.] The lasso adaptively chooses the active predictors, which costs extra degrees of freedom; but it also shrinks the nonzero coefficients (relative to the usual least squares estimates, which decreases the degrees of freedom just the right amount, so that the total is simply k. The current work in this paper arose from our desire to find a statistic whose degrees of freedom was equal to one after each LARS step. 2.5 Related work There is quite a bit ofrecent work that is related to the proposal ofthis paper. Wasserman& Roeder (2009 propose a procedure for variable selection and p-value estimation based on sample splitting, and this was extended by Meinhasuen et al. (2009. Meinshausen & Bühlmann (2010 propose Stability Selection, a generic method which controls the expected number of false positive selections. Zhang & Zhang (2011 and Bühlmann (2012 derive p-values for parameter components in lasso and ridge regression, based on stochastic upper bounds for the parameter estimates. Minnier et al. (2011 use perturbation resampling-based procedures to approximate the distribution of a general class of penalized parameter estimates. Berk et al. (2010 and Laber & Murphy (2011 propose methods for conservative statistical inference after model selection and classification, respectively. One big difference with the work here: we propose a simple statistic with an exact asymptotic null distribution and do not require any resampling or sample splitting. Unlike other approaches, our proposal exploits the special properties of the lasso, and does not work for other selection procedures. 3 An orthogonal predictor matrix X We discuss the special case of an orthogonal predictor matrix X, i.e., one that satisfies X T X = I. Even though the results here can be seen as special cases of those for a general X in Section 4, the arguments in the current orthogonal X case rely on relatively straightforward extreme value theory and are hence much simpler than their general X counterparts (which analyze the knots in the lasso path via Gaussian process theory. Furthermore, the Exp(1 limiting distribution for the covariance statistic translates here to some interesting and previously unknown (as far as we can tell results on the order statistics of independent standard normals. For these reasons, we discuss the orthogonal case in detail. 8

9 As noted in the discussion following Lemma 1 (see the first point, we know that the covariance statistic for testing the entry of the variable at the kth step in the lasso path is T k = λ k (λ k λ k+1 /σ 2. Using orthogonality, we rewrite y Xβ 2 2 = X T y β 2 2 +C for a constant C (not depending on β in the criterion in (2, and then we can see that the lasso solution at any given value of λ has the closed-form: ˆβ j (λ = S λ (Xj T y, j = 1,...p, where X 1,...X p are columns of X, and S λ :R R is the soft-thresholding function, x λ if x > λ S λ (x = 0 if λ x λ x+λ if x < λ. Letting U j = Xj T y, j = 1,...p, the knots in the lasso path are simply the values of λ at which the coefficients become nonzero (i.e., cease to be thresholded, λ 1 = U (1, λ 2 = U (2,... λ p = U (p, where U (1 U (2... U (p are the order statistics of U 1,... U p (somewhat of an abuse of notation. Therefore, T k = U (k ( U (k U (k+1 /σ 2. Next, we study with the case k = 1, the test for the first predictor to enter the active set along the lasso path. We then examine the case k > 1, the test for subsequent predictors. 3.1 The first predictor to enter, k = 1 Consider the covariance test statistic for the first predictor to enter the active set, i.e., for k = 1, T 1 = U (1 ( U (1 U (2 /σ 2. We are interested in the distribution of T 1 under the null hypothesis; since we are testing the first step, this is H 0 : y N(0,σ 2 I. Under the null, U 1,...U p are i.i.d., U j N(0,σ 2, and so U 1 /σ,... U p /σ follow a χ 1 distribution (absolute value of a standard Gaussian. That T 1 has an asymptotic Exp(1 null distribution is now given by the next result. Lemma 2. Let V 1 V 2... V p be the order statistics of an independent sample of χ 1 variates (i.e., they are the sorted absolute values of an independent sample of standard Gaussian variates. Then V 1 (V 1 V 2 d Exp(1 as p. Proof. The χ 1 distribution has CDF F(x = (2Φ(x 11(x > 0 where Φ is the standard normal CDF. We first compute F (t(1 F(t lim t (F (t 2 = lim t(1 Φ(t = 1, t φ(t 9

10 the last equality using Mill s ratio. Then Theorem in de Haan & Ferreira (2006 implies that, for constants a p = F 1 (1 1/p and b p = pf (a p, the random variables W 1 = b p (V 1 a p and W 2 = b p (V 2 a p converge jointly in distribution, (W 1,W 2 d ( loge 1, log(e 1 +E 2, where E 1,E 2 are independent standard exponentials. Now note that V 1 (V 1 V 2 = (a p +W 1 /b p (W 1 W 2 /b p = a p b p (W 1 W 2 + W 1(W 1 W 2 b p. We claim that a p /b p ; this would give the desired result, as it would imply that first term above converges in distribution to log(e 2 +E 1 log(e 1, which is standard exponential, and the second term converges to zero, as b p. Writing a p,b p more explicitly, we see that 1 1/p= 2Φ(a p 1, i.e., 1 Φ(a p = 1/(2p, and b p = 2pφ(a p. Using Mill s inequalities, and multiplying by 2p, φ(a p a p ( 1 1 a 2 p b p a p 1 Φ(a p φ(a p a p, ( 1 1 a 2 1 b p. p a p Since a p, this means that b p /a p 1, completing the proof. We were unable to find this remarkably simple result elsewhere in the literature. An easy generalization is as follows. Lemma 3. If V 1 V 2... V p are the order statistics of an independent sample of χ 1 variates, then for any fixed k 1, ( V1 (V 1 V 2,V 2 (V 2 V 3,...V k (V k V k+1 d ( Exp(1,Exp(1/2,...Exp(1/k as p, where the limiting distribution (on the right-hand side above has independent components. We leave the proof of Lemma 3 to Appendix A.2, since it follows from arguments very similar to those given for Lemma 2. Practically, Lemma 3 tells us that under the global null hypothesis y N(0,σ 2, comparing the covariance statistic T k at the kth step of the lasso path to an Exp(1 distribution is increasingly conservative [at the first step, T 1 is asymptotically Exp(1, at the second step, T 2 is asymptotically Exp(1/3, at the third step, T 3 is asymptotically Exp(1/3, and so forth]. This progressive conservatism is favorable, if we place importance on parsimony in the fitted model: we are less and less likely to incur a false rejection of the null hypothesis as the size of the model grows. Moreover, we know that the test statistics T 1,T 2,... at successive steps are independent, and hence so are the corresponding p-values; from the point of view of multiple testing corrections, this is nearly an ideal scenario. Of real interest is the distribution of T k not under global null, but rather under the broader null hypothesis that all variables left out of the current model are truly inactive variables (i.e., they have zero coefficients in the true model. We study this in next section. 3.2 Subsequent predictors, k > 1 We suppose that exactly k 0 components of the true coefficient vector β are nonzero, and consider testing the entry of the predictor at the step k = k We show that, under the null hypothesis that all truly active predictors are added to the model at steps 1,...k 0, the test statistic T k0+1 is 10

11 asymptotically Exp(1; further, the test statistic T k0+d at a future step k = k 0 +d is asymptotically Exp(1/d. The basic idea behind our argument is as follows: if we assume that the nonzero components of β are large enough in magnitude, then it is not hard to show (relying on orthogonality, here that the truly active predictors are added to the model along the first k 0 steps of the lasso path, with probability tending to one. The test statistic at the (k 0 +1st step and beyond would therefore depend on the order statistics of U i for truly inactive variables i, subject to the constraint that the largest of these values is smaller than the smallest U j for truly active variables j. But with our strong signal assumption, i.e., that the nonzero entries of β are large in absolute value, this constraint has essentially no effect, and we are back to studying the order statistics from a χ 1 distribution, as in the last section. We make this idea precise below. Theorem 1. Assume that X R n p is orthogonal, and that there are k 0 nonzero components in the true coefficient vector β. Let A = supp(β be the true active set. Also assume that the smallest nonzero true coefficient is large compared to 2logp, min j A β j 2logp as p. Let B denote the event that the first k 0 variables entering the model along the lasso path are those in A, i.e., { } B = min j A U j > max j/ A U j, where U j = X T y, j = 1,...p. Then P(B 1 as p, and for each fixed d 0, we have (T k0+1,t k0+2,...t k0+d d ( Exp(1,Exp(1/2,...Exp(1/d as p. The same convergence in distribution holds conditionally on B. Proof. We first bound P(B. Let θ p = min i A β i, and choose a p such that a p 2logp and θ p a p. Note that U j N(βj,σ2, independently for j = 1,...p. For j A, ( ap β ( i ap β ( i ap θ p P( U j a p = Φ Φ Φ 0, σ σ σ so ( P min j A U j > a p = P( U j > a p 1. j A At the same time, ( ( p k0 P max j/ A U j a p = Φ(a p /σ Φ( a p /σ 1. Therefore P(B 1. This in fact means that P(A P(A B 0 for any sequence of events A, so only the weak convergence of (T k0+1,...t k0+d remains to be proved. For this, we let m = p k 0, and V 1 V 2... V m denote the order statistics of the sample U j, j / A of independent χ 1 variates. Then, on the event B, we have As P(B 1, we have in general T k0+i = V i (V i V i+1, for i = 1,...d. T k0+i = V i (V i V i+1 +o P (1, for i = 1,...d. Hence we are essentially back in the setting of the last section, and the desired convergence result follows from the same arguments as in the proof of Lemma 3. 11

12 4 A general predictor matrix X In this section, we assume that the predictor matrix X R n p has unit norm columns, X i 2 = 1, i = 1,...p, and that the columns X 1,...X p R n are in general position, but otherwise X can be arbitrary. These are very weak assumptions. Our proposed covariance test statistic (5 is closely intertwined with the knots λ 1... λ r in the lasso path, as it was defined in terms of difference between fitted values at successive knots. Moreover, Lemma 1 showed that (provided there are no sign changes in the reduced lasso problem over [λ k+1,λ k ] this test statistic can be expressed even more explicitly in terms of the values of these knots. As was the case in the last section, this knot form is quite important for our analysis here. Therefore, it is helpful to recall (Efron et al. 2004, Tibshirani 2012 the explicit formulae for the knots in the lasso path. If A denotes the active set and s A denotes the signs of active coefficients at a knot λ k, A = supp (ˆβ(λ, sa = sign (ˆβA (λ k, then the next knot λ k+1 is given by λ k+1 = max { λ join } k+1,λleave k+1, (11 where λ join k+1 and λleave k+1 are the values of λ at which, if we were to decrease the tuning parameter from λ k and continue along the current (linear trajectory for the lasso coefficients, a variable would join and leave the active set A, respectively. These values are λ join k+1 = max Xj T(I P { } Ay X T j (I P A y j/ A, s { 1,1} s Xj T 1 (XT A + s A s Xj T λ k, (12 (XT A + s A where P A is the projection onto the column space of X A, P A = X A (XA TX A 1 XA T, and (XT A + is the pseudo-inverse (XA T+ = (XA TX A 1 XA T; and λ leave [(X A + { y] j [(XA + } y] j k+1 = max j A [(XA TX 1 A 1 s A ] j [(XA TX λ A 1 k. (13 s A ] j As we did in Section 3 with the orthogonal X case, we begin by studying the asymptotic distribution of the covariance statistic in the case k = 1 (i.e., the first model along the path, wherein the expressions for the next knot (11, (12, (13 greatly simplify. Following this, we give of sketch of the arguments for the more difficult case k > 1. For the sake of readability we defer the proofs and most technical details until the appendix. 4.1 The first predictor to enter, k = 1 As per our discussion following Lemma 1, we know that the first predictor to enter along the lasso path cannot leave at the next step, so assumption (8 holds, and the covariance statistic for testing the entry of the first variable is T 1 = λ 1 (λ 1 λ 2 /σ 2. Now let U j = X T j y, j = 1,...p, and R = XT X. With λ 0 =, we have A =, and trivially, no variables can leave the active set. The first knot is hence given by (12, which can be expressed as λ 1 = max su j. (14 j=1,...p,s { 1,1} Letting j 1,s 1 be the first variable to enter and its sign (i.e., they achieve the maximum in the above expression, and recalling that j 1 cannot leave the active set immediately after it has entered, the second knot is again given by (12, written as λ 2 = max j j 1,s { 1,1} su j sr j,j1 U j1 1 ss 1 R j,j { suj sr j,j1 U j1 1 ss 1 R j,j1 s 1 U j1 }.

13 The general position assumption on X implies that R j,j1 < 1, and so 1 ss 1 R j,j1 > 0, all j j 1, s { 1,1}. It is easy to show then that the indicator inside the maximum above can be dropped, and hence λ 2 = max j j 1,s { 1,1} su j sr j,j1 U j1 1 ss 1 R j,j1. (15 Our goal now is to calculate the asymptotic distribution of T 1 = λ 1 (λ 1 λ 2 /σ 2, with λ 1 and λ 2 as above, under the null hypothesis; to be clear, since we are testing the significance of the first variable to enter along the lasso path, the null hypothesis is H 0 : y N(0,σ 2 I. (16 The strategy that we use here for the general X case which differs from our extreme value theory approach for the orthogonal X case is to treat the quantities inside the maxima in expressions (14, (15 for λ 1,λ 2 as discrete-time Gaussian processes. First, we consider the zero mean Gaussian process g(j,s = su j, for j = 1,...p, s { 1,1}. (17 We can easily compute the covariance function of this process: E [ g(j,sg(j,s ] = ss R j,j σ 2, where the expectation is taken over the null distribution in (16. From (14, we know that the first knot is simply λ 1 = max g(j,s, j,s and from (15, the second knot is λ 2 = max j j 1,s g(j,s E [ g(j,sg(j 1,s 1 ] /σ 2 g(j 1,s 1 1 E [ g(j,sg(j 1,s 1 ] /σ 2, with j 1 and s 1 being the first variable to enter and its sign, i.e., λ 1 = g(j 1,s 1. Hence in addition to (17, we consider the process h (j1,s1 (j,s = g(j,s E[ g(j,sg(j 1,s 1 ] /σ 2 g(j 1,s 1 1 E [ g(j,sg(j 1,s 1 ] /σ 2. (18 An important property: for fixed j 1,s 1, the entire process h (j1,s1 (j,s is independent of g(j 1,s 1. This can be seen by verifying that E[h (j1,s1 (j,sg(j 1,s 1 ] = 0, and noting that g(j 1,s 1 and h (j1,s1 (j,s, all j j 1, s { 1,1}, are jointly normal for fixed j 1,s 1. Now define M(j 1,s 1 = max j j h(j1,s1 (j,s, (19 1,s and from the above we know that for fixed j 1,s 1, M(j 1,s 1 is independent of g(j 1,s 1. If j 1,s 1 are instead treated as random variables that maximize g(j,s, then λ 2 = M(j 1,s 1. Therefore, to study the distribution of T = λ 1 (λ 1 λ 2 /σ 2, we are interested in the random variable on the event g(j 1,s 1 ( g(j 1,s 1 M(j 1,s 1 /σ 2, {g(j 1,s 1 g(j,s for all j,s}. It turns out that this event, which concerns the argument maximizers of g, can be rewritten as an event concerning only the relative values of g and M [see Taylor et al. (2005 for the analogousresult for continuous-time processes]. 13

14 Lemma 4. With g,m as defined in (17, (18, (19, we have {g(j 1,s 1 g(j,s for all j,s} = {g(j 1,s 1 M(j 1,s 1 }. This is an important realization because the dual representation {g(j 1,s 1 M(j 1,s 1 } is more tractable, once we partition the space over the possible argument minimizers j 1,s 1, and use the fact that M(j 1,s 1 is independent of g(j 1,s 1 for fixed j 1,s 1. In this vein, we express the distribution of T 1 = λ 1 (λ 1 λ 2 /σ 2 in terms of the sum P(T 1 > t = ( P g(j 1,s 1 ( g(j 1,s 1 M(j 1,s 1 /σ 2 > t, g(j 1,s 1 M(j 1,s 1. j 1,s 1 The terms in the above sum can be simplified: dropping for notational convenience the dependence on j 1,s 1, we have g(g M/σ 2 > t, g M g/σ > u(t,m/σ, where u(a,b = (b + b 2 +4a/2, which follows by simply solving for g in the quadratic equation g(g M/σ 2 = t. Therefore P(T 1 > t = P (g(j 1,s 1 /σ > u ( t,m(j 1,s 1 /σ j 1,s 1 = Φ ( u(t,m/σ F M(j1,s 1(dm, (20 j 1,s 1 0 where Φ is the standard normal survival function (i.e., Φ = 1 Φ, for Φ the standard normal CDF, F M(j1,s 1 is the distribution of M(j 1,s 1, and we have used the fact that g(j 1,s 1 and M(j 1,s 1 are independent for fixed j 1,s 1, and also M(j 1,s 1 0 on the event {g(j 1,s 1 M(j 1,s 1 }. (The latter follows as Lemma 4 shows this event to be equivalent to j 1,s 1 being the argument maximizers of g, which means that M(j 1,s 1 = λ 2 0. Continuing from (20, we can write the difference between P(T 1 > t and the standard exponential tail, P(Exp(1 > t = e t, as P(T 1 > t e t = j 1,s 1 0 ( Φ( u(t,m/σ e Φ(m/σF Φ(m/σ t M(j1,s 1(dm, (21 where we used the fact that Φ ( m/σ F M(j1,s 1(dm = ( g(j 1,s 1 M(j 1,s 1 = 1. j1,s1p j 1,s 1 0 We now examine the term inside the braces in (21, the difference between a ratio of normal survival functions and e t ; our next lemma shows that this term vanishes as m. Lemma 5. For any t 0, Φ ( u(t,m lim = e m Φ(m t. Hence, loosely speaking, if each M(j 1,s 1 fast enough as p, then the right-hand side in (21 converges to zero, and T 1 converges weakly to Exp(1. This is made precise below. Lemma 6. Consider M(j 1,s 1 defined in (18, (19 over j 1 = 1,...p and s 1 { 1,1}. If for any fixed m 0 > 0 j 1,s 1 P ( M(j 1,s 1 m 0 0 as p, (22 then the right-hand side in (21 converges to zero as p, and so P(T 1 > t e t for all t 0. 14

15 The assumption in (22 is written in terms of random variables whose distributions are induced by the steps along the lasso path; to make our assumptions more transparent, we show that (22 is implied by a conditional variance bound involving the predictor matrix X alone, and arrive at the main result of this section. Theorem 2 (Exponential null distribution, general X, k = 1. Assume that X R n p has unit norm columns in general position, and let R = X T X. Assume also that there is some δ > 0 such that for each j = 1,...p, there exists a subset of indices S {1,...p}\{j} with and the size of S growing faster than logp, 1 R i,s\{i} (R S\{i},S\{i} 1 R S\{i},i δ for all i S, (23 S as p. (24 logp Then P(T 1 > t e t as p for all t 0, under the null hypothesis in ( Subsequent predictors, k > 1 Here we give a sketch of the distribution theory for the statistic (5 for testing the significance of subsequent predictors, i.e., for k > 1. Precise statements will be made in future work. We assume that there are k 0 large nonzero components of β, large enough that these variables are entered into the lasso model (and not deleted from the active set over the first k 0 steps of the lasso path. We consider the distribution of the covariance statistic when a noise variable j k is added at step k = k Under the same assumptions, we can also show that the constant sign condition (8 holds (see the third point following Lemma 1, so we may use the knot form of the covariance statistic T k = C(A,s A,j k λ k (λ k λ k+1, with A and s A being the active set and signs just before knot λ k. The general calculations for this case all have more or less the same form as they do in the last section. However, a main complication is that the indicator terms inside the maxima in (12, (13 do not vanish for general k as they did in (14, (15. Therefore, in the notation of the last section, we may define g and M analogously, but these two are no longer independent (for a fixed value of the variable j k to enter and its sign s k. We hence introduce a triplet of random variables M +,M,M 0, defined carefully so that we can decompose the distribution of the test statistic as P(T k > t = ( g(j k,s k ( g(j k,s k M(j k,s k > t, g(j k,s k M + (j k,s k, j k,s k P g(j k,s k M (j k,s k, 0 M 0 (j k,s k. (25 In the above expression, for each j k,s k, the random variable g(j k,s k is independent of the triplet M + (j k,s k,m (j k,s k,m 0 (j k,s k. Furthermore, we have the constraint ( g(j k,s k M + (j k,s k, g(j k,s k M (j k,s k, 0 M 0 (j k,s k = 1, j k,s k P which says that these events form a partition (up to a null set. A similar calculation to that given in the last section then shows that the right-hand side in (25, with M(j k,s k replaced by M + (j k,s k, converges to e t as p, under the assumption that for an fixed m 0, j k,s k P ( M + (j k,s k m 0 0. Adding the condition j k,s k P(M + (j k,s k > M(j k,s k 0, we get a (conservative e t bound for P(T k > t in (25. 15

16 Forward Stepwise Lasso T T Chi squared with 1 df Exp(1 Figure 2: Simulated data: quantile-quantile plots of the RSS test from forward stepwise regression (left and covariance test from Lasso (right, for testing the (null 4th predictor. 5 Simulation of the null distribution 5.1 Orthonormal design In this example we generated N = 100 observations each with p = 10 standard Gaussian features. The first three coefficients were equal to 3, and the rest zero. The error standard deviation was σ = 0.5 so that the first three predictors had strong effects and always enter first. Figure 2 shows the results for testing the 4th (null predictor to enter. The panels show the drop in RSS test from forwardstepwise regressionand the covariancetest, with σ 2 assumedknown. We see that the Exp(1 distribution provides a good approximation for the distribution of the covariance statistic, while the χ 2 1 is a poor approximation for the RSS test. Figure 3 shows the results for entering the 5th, 6th and 7th predictors. The test will be conservative: with a nominal level 0.05, the actual type I errors are 0.01, 0.002, and respectively. The solid line has slope 1, while the broken lines have slope 1/2, 1/3, 1/4 respectively, as predicted by Theorem Simulations: general design matrix In Table 2 we simulated null data, and the distribution of the covariance test statistic T for the first predictor to enter. We varied the numbers of predictors p, feature correlation ρ and form of the feature correlation matrix. In the first two correlation setups, the pairwise correlation between each pair of features was ρ, in the data and population, respectively. In the AR(1 setup, the correlation between features j and k is ρ j k. Finally, in the block diagonal setup, the correlation matrix has two equal sized blocks, with population correlation ρ in each block. We see that the exponential (1 distribution is a reasonably good approximation throughout. In Table 3 with the first k coefficients equal to 4.0, and the rest zero, for k = 1,2,3. The rest of the setup was the same as in Table 2, except that the sample size was fixed at 50. We computed the mean, variance and tail probability of the covariance statistic T k+1 for entering the next (null (k + 1st predictor, discarding those simulations in which a non-null predictor was chosen in the first k. (This occurred 1.7%, 4.0%, and 7.0% of the time, respectively The Table shows that the exponential (1 distribution is again a reasonably good approximation. In Figure 4 we estimate the power curves for forward stepwise and the lasso (the latter using 16

17 5th predictor 6th predictor 7th predictor T T T Exp( Exp( Exp(1 Figure 3: Testing the (null 5th, 6th and 7th predictors. The solid line has slope 1, while the broken lines have slope 1/2, 1/3, 1/4 respectively, as predicted by Lemma 2. the covariance statistic, and find that they have similar power. Details are in the figure caption. However the cutpoints for forward stepwise, estimated here by simulation, are not typically available in practice, 6 The case of unknown σ So far we have assumed that the error variance is known. In practice it will typically be unknown: in that case, we can estimate it and proceed by analogy to standard linear model theory. In particular, consider the numerator from (5 ( W k = y,xˆβ(λ k+1 y,x A βa (λ k+1. (26 Consider first the case N < p. Then we can estimate σ 2 by the mean residual error ˆσ 2 = N i=1 (y i ˆµ full 2 /(N p, with ˆµ full the least squares fit for the full model. Then asymptotically F k = W k ˆσ 2 F 2,N p (27 This follows because W j is asymptotically Exp(1 which is the same as χ 2 2 /2, (N p ˆσ2 is asymptotically χ 2 N p, and the two are independent. Here is a proof of the independence. Letting P X project onto the column space of X, the lasso problem is unchanged is we solve it with P X y in place of y. The estimate for ˆσ is a function of (I P X y. It s clear that P X y and (I P X y are uncorrelated, and if we assume that y is normally distributed, then P X y and (I P X are independent. Therefore the lasso fit and estimate of σ are functions of independent quantities, hence independent. This is true for any λ. The statistic that we propose involves both the lasso fit on X and on X A. For the lasso problem on X A, we can still replace y with P X y, because (I P X X A = 0. Hence the same argument applies to the lasso fit on X A, that is, it s independent of ˆσ. As an example, consider one of the setups from Table 2, with N = 100,p = 80, AR(1 feature correlation ρ j k, and the model truly null. Consider testing the first predictor to enter. We chose p N to expose the difference between the σ known and unknown cases. Table 4 shows the results 17

A Significance Test for the Lasso

A Significance Test for the Lasso Richard Lockhart 1 Jonathan Taylor 2 Ryan J. Tibshirani 3 Robert Tibshirani 2 1 Simon Fraser University, 2 Stanford University, 3 Carnegie Mellon University Abstract In