Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September 2014 1 / 45

Matching Results from picadilo.com Ryan Tibshirani, CMU. PhD student of Taylor 2011 Rob Tibshirani Stanford 2 / 45

81% 71% Ryan Tibshirani CMU Rob Tibshirani Stanford ë Top matches from picadilo.com 3 / 45

81% Ryan Tibshirani CMU Rob Tibshirani Stanford 71% 69% 4 / 45

Conclusion Confidence the strength of evidence matters! 5 / 45

Outline Setup and basic question Quick review of least angle regression and the covariance test A new framework for inference after selection Application to forward stepwise and least angle regression Application of these and related ideas to other problems 6 / 45

Setup and basic question Given an outcome vector y R n and a predictor matrix X R n p, we consider the usual linear regression setup: y = Xβ + σɛ, where β R p are unknown coefficients to be estimated, and the components of the noise vector ɛ R n are i.i.d. N(0, 1) Main question: If we apply least angle or forward stepwise regression, how can we compute valid p-values and confidence intervals? 7 / 45

Forward stepwise regression This procedure enters predictors one a time, choosing the predictor that most decreases the residual sum of squares at each stage. Defining RSS to be the residual sum of squares for the model containing k predictors, and RSS null the residual sum of squares before the kth predictor was added, we can form the usual statistic R k = (RSS null RSS)/σ 2 (with σ assumed known), and compare it to a χ 2 1 distribution. 8 / 45

Simulated example: naive forward stepwise Setup: n = 100, p = 10, true model null Test statistic 0 5 10 15 0 2 4 6 8 10 Chi squared on 1 df Test is too liberal: for nominal size 5%, actual type I error is 39%. (Yes, Larry, can get proper p-values by sample splitting: but messy, loss of power) 9 / 45

Quick review of LAR and the covariance test Least angle regression or LAR is a method for constructing the path of solutions for the lasso: min β 0,β j i (y i β 0 j x ij β j ) 2 + λ β j LAR is a more democratic version of forward stepwise regression. Find the predictor most correlated with the outcome Move the parameter vector in the least squares direction until some other predictor has as much correlation with the current residual This new predictor is added to the active set, and the procedure is repeated Optional ( lasso mode ): if a non-zero coefficient hits zero, that predictor is dropped from the active set, and the process is restarted j 10 / 45

Least angle regression in a picture Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 11 / 45

The covariance test for LAR (Lockhart, Taylor, Ryan Tibshirani, Rob Tibshirani, discussion paper in Annals of Statistics, 2014) The covariance test provides a p-value for each variable as it is added to lasso model via the LAR algorithm. In particular it tests the hypothesis: H 0 : A supp(β ) where A is the running active set, at the current step of LAR 12 / 45

The covariance test for LAR Suppose we want a p-value for predictor 2, entering at step 3 Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 13 / 45

Compute covariance at λ 4 : y, X ˆβ(λ 4 ) Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 14 / 45

Drop x 2, yielding active yet A; ( refit at λ 4, and compute resulting covariance at λ 4, giving T 3 = y, X ˆβ(λ ) 4 ) y, X A ˆβA (λ 4 ) /σ 2 Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 15 / 45

Null distribution of the covariance statistic Under the null hypothesis that all signal variables are in the model: H 0 : A supp(β ), the covariance statistic follows T j 1 ( σ 2 y, X ˆβ(λ ) j+1 ) y, X A ˆβA (λ j+1 ) Exp(1) as n, p 16 / 45

Simulated example: covariance test Setup: n = 100, p = 10, true model null Test statistic 0 5 10 15 0 1 2 3 4 5 6 7 Exp(1) 17 / 45

Example: prostate cancer data Data from a study of the level of prostate-specific antigen and p = 8 clinical measures, in n = 67 men about to receive a radical prostatectomy Stepwise, naive LAR, covariance test lcavol 0.000 0.000 lweight 0.000 0.052 svi 0.041 0.174 lbph 0.045 0.929 pgg45 0.226 0.353 age 0.191 0.650 lcp 0.065 0.051 gleason 0.883 0.978 18 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 19 / 45

What s coming next: a roadmap 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 20 / 45

Preview: prostate cancer data 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Naive Selection adjusted lcavol lweight svi lbph pgg45 lcp age gleason 21 / 45

A new framework for inference after selection 22 / 45

The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) 23 / 45

The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) We wish to make inferences on ν T θ a linear contrast of the mean θ conditional on y S, where S is polyhedron S = {y : Γy u} The vector ν = ν(s) is allowed to depend on S 23 / 45

Fundamental result Lemma (Polyhedral selection as truncation) For any ν 0, we have {Γy u} = {V ν T y V +, V 0 0}, where γ = ΓΣν ν T Σν V (y) = max j:γ j >0 V + (y) = min j:γ j <0 u j (Γy) j + γ j ν T y γ j u j (Γy) j + γ j ν T y γ j V 0 (y) = min j:γ j =0 u j (Γy) j, Moreover, the triplet (V, V +, V 0 )(y) is independent of ν T y. 24 / 45

Proof V P ν y y V + ν ν T y {Γy u} 25 / 45

Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 1. For Z N [a,b] (µ, σ 2 ), and F [a,b] µ,σ 2 its CDF, we have ( ) P F [a,b] (Z) α µ,σ 2 = α 26 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), 27 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α 27 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α Furthermore, because the same statement holds for any fixed µ, ( ) P ν T θ=µ 1 F [V,V + ] µ,ν T Σν (νt y) α Γy u = α 27 / 45

Illustration Truncated normal density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 2.0 x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

Application to forward stepwise and least angle regression 29 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps 30 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. 31 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: 31 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k We place no restrictions on predictors X (general position) Interesting case to keep in mind: ν = X Ak (X T A k X Ak ) 1 e k, so that ν T θ = e T k (XT A k X Ak ) 1 X T A k θ which is the kth coefficient in the multiple regression of θ on active variables A k 31 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 32 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 32 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 When ν = X Ak (X T A k X Ak ) 1 e k : this tests the significance of the projected regression coefficient of the last variable entered into the model, 32 / 45

Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot 33 / 45

Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α When ν = X Ak (X T A k X Ak ) 1 e k : this interval traps the projected regression coefficient of the last variable to enter with probability 1 α, 33 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR 34 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) 34 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) Because Γ is especially simple for LAR, this test easy and efficient to implement. The p-value for H 0 : e T k (XT A k X Ak ) 1 X T A k θ = 0 is P = Φ ( Φ V ) σ ω k Φ ( ) V σ ω k Φ ( ) λk σ ω ( k ) V + σ ω k where λ k is the kth LAR knot, and ω k = X Ak (XA T k X Ak ) 1 s Ak X Ak 1 (XA T k 1 X Ak 1 ) 1 2 s Ak 1 34 / 45

Example: selection adjustment under the global null Step= 1 Step= 2 < max-t > Step= 3 Step= 4 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Expected Expected Expected Expected Step= 1 <- Selection-adjusted FS-> Step= 2 Step= 3 T Step= 4 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Expected Expected Expected Expected Step= 1 Step= 2 <- Spacing test-> Step= 3 Step= 4 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Expected Expected Expected Expected 35 / 45

Example: back to prostate data Stepwise, naive Stepwise, adjusted lcavol 0.000 0.000 lweight 0.000 0.006 svi 0.047 0.425 lbph 0.047 0.168 pgg45 0.234 0.577 lcp 0.083 0.273 age 0.137 0.059 gleason 0.883 0.844 LAR, covariance test LAR, spacing test lcavol 0.000 0.000 lweight 0.044 0.050 svi 0.165 0.134 lbph 0.929 0.917 pgg45 0.346 0.016 age 0.648 0.581 lcp 0.043 0.058 gleason 0.978 0.858 36 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (X T A k X Ak ) 1 e k for concreteness; 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, here the (random) projected regression coefficient of the kth variable to enter 37 / 45

Example: selection intervals for LAR Selection interval for predictor 1 Parameter value 4 5 6 7 8 0 20 40 60 80 100 Realization Selection interval for predictor 2 Parameter value 1 2 3 4 5 0 20 40 60 80 100 Realization Selection intervals for first predictor entered, at each LAR step Parameter value 4 6 8 10 12 2 4 6 8 LAR step 38 / 45

Application of these and related ideas to other problems Beyond polyhedra (Taylor, Loftus, Ryan Tibs 2014) Graphical models and clustering (G Sell, Taylor, Tibs 2014) Many normal means (Reid, Taylor, Tibs 2014) Lasso with fixed λ (Lee, Sun, Sun, Taylor 2014) Marginal screening (Sun, Taylor 2014) PCA (Choi, Taylor, Tibs in preparation) 39 / 45

PCA example Scree Plot of (true) rank = 2 Singular values(snr=1.5) Singular values(snr=0.23) Singular values 5 10 15 20 Singular values 4 5 6 7 8 9 10 11 2 4 6 8 10 index 2 4 6 8 10 index (p-values for right: 0.030, 0.064, 0.222, 0.286, 0.197, 0.831, 0.510, 0.185, 0.126.) 40 / 45

Generalization to principal component analysis Model Y = B + ɛ and we want to test rank(b) k. Test is based on singular values of Wishart matrix Y T Y. Largest singular value has a Tracy-Widom distribution asymptotically (Johnstone 2001) and this can be used to construct a test of the global null rank(b) = 0. This can be applied sequentially for other ranks (Kritchman and Nadler, 2008) We derive the conditional distribution of each singular value λ k, conditional on λ k+1... λ p and use this to obtain an exact test 41 / 45

rank=0, step=1 rank=0, step=2 rank=0, step=3 rank=0, step=4 sort(result0[, 1]) 0.0 0.2 0.4 0.6 0.8 1.0 Jon's method integrated out Nadler's sort(result0[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result0[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result0[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qs qs qs qs rank=1, step=1 rank=1, step=2 rank=1, step=3 rank=1, step=4 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qs qs qs qs rank=2, step=1 rank=2, step=2 rank=2, step=3 rank=2, step=4 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qs qs qs qs 42 / 45

Final comments Data analysts need tools for inference after selection, and these are now becoming available But much work is left to be done (power, robustness, computation, etc.) R language software is on its way! 43 / 45

Questions Larry: How do you estimate σ 2? Estimating σ 2 is harder than estimating the signal! Ale: What s the Betti number of the Rips complex of the selection set under Hausdorff paracompactness? Jay: Is there a Bayesian analog? Jim Ramsey: Assuming normality is bogus. Nature just gives you a bunch of numbers Ryan: Why can t we just apply trend filtering? 44 / 45

Questions Larry: You are doing inference for ν T θ but your ν is random. How do you do inference for E (ν) T θ? Ale: How does this work for log linear models? Jing: How does this work for sparse PCA? Max: How dependent are the p-values? Could this work with ForwardStop? 45 / 45