Post-selection Inference for Forward Stepwise and Least Angle Regression

Similar documents
A significance test for the lasso

Post-selection inference with an application to internal inference

A significance test for the lasso

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Exact Post-selection Inference for Forward Stepwise and Least Angle Regression

Statistical Inference

A Significance Test for the Lasso

Some new ideas for post selection inference and model assessment

Recent Advances in Post-Selection Statistical Inference

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities

Recent Developments in Post-Selection Inference

Recent Advances in Post-Selection Statistical Inference

arxiv: v5 [stat.me] 11 Oct 2015

A Significance Test for the Lasso

Least Angle Regression, Forward Stagewise and the Lasso

Post-Selection Inference

Post-selection Inference for Changepoint Detection

Sampling Distributions

Chapter 3. Linear Models for Regression

Data Mining Stat 588

Regularization Paths

Sampling Distributions

Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

STAT 462-Computational Data Analysis

A significance test for the lasso

DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich

Uniform Asymptotic Inference and the Bootstrap After Model Selection

A Significance Test for the Lasso

arxiv: v3 [stat.me] 31 May 2015

Introduction to Statistics and R

Regularization Paths. Theme

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

STAT 518 Intro Student Presentation

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications

Linear Methods for Regression. Lijun Zhang

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Lecture 5: Soft-Thresholding and Lasso

COMP 551 Applied Machine Learning Lecture 2: Linear regression

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Package covtest. R topics documented:

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

CS 340 Lec. 15: Linear Regression

Bias-free Sparse Regression with Guaranteed Consistency

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Advanced Introduction to Machine Learning CMU-10715

Lecture 4: Newton s method and gradient descent

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Cross-Validation with Confidence

Exact Post Model Selection Inference for Marginal Screening

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

The lasso, persistence, and cross-validation

Some Curiosities Arising in Objective Bayesian Analysis

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Lecture 17 May 11, 2018

Covariate-Assisted Variable Ranking

Linear regression methods

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

arxiv: v2 [math.st] 9 Feb 2017

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Introduction to the genlasso package

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

arxiv: v2 [stat.me] 13 Mar 2015

High-dimensional Ordinary Least-squares Projection for Screening Variables

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

Cross-Validation with Confidence

Statistics Ph.D. Qualifying Exam

High-dimensional statistics and data analysis Course Part I

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

The linear model is the most fundamental of all serious statistical models encompassing:

Distribution-Free Predictive Inference for Regression

High-dimensional regression

Outline of GLMs. Definitions

Asymptotic distribution of the largest eigenvalue with application to genetic data

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Bayesian linear regression

Chapter 6. Ensemble Methods

Compressed Sensing in Cancer Biology? (A Work in Progress)

Selective Inference for Effect Modification: An Empirical Investigation

Biostatistics Advanced Methods in Biostatistics IV

arxiv: v1 [math.st] 27 May 2014

Nearest Neighbor Gaussian Processes for Large Spatial Data

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

Inference on distributions and quantiles using a finite-sample Dirichlet process

Robust Testing and Variable Selection for High-Dimensional Time Series

Mallows Cp for Out-of-sample Prediction

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010

Shrinkage Methods: Ridge and Lasso

Linear Model Selection and Regularization

Consistent Selection of Tuning Parameters via Variable Selection Stability

Selective Inference for Effect Modification

Transcription:

Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September 2014 1 / 45

Matching Results from picadilo.com Ryan Tibshirani, CMU. PhD student of Taylor 2011 Rob Tibshirani Stanford 2 / 45

81% 71% Ryan Tibshirani CMU Rob Tibshirani Stanford ë Top matches from picadilo.com 3 / 45

81% Ryan Tibshirani CMU Rob Tibshirani Stanford 71% 69% 4 / 45

Conclusion Confidence the strength of evidence matters! 5 / 45

Outline Setup and basic question Quick review of least angle regression and the covariance test A new framework for inference after selection Application to forward stepwise and least angle regression Application of these and related ideas to other problems 6 / 45

Setup and basic question Given an outcome vector y R n and a predictor matrix X R n p, we consider the usual linear regression setup: y = Xβ + σɛ, where β R p are unknown coefficients to be estimated, and the components of the noise vector ɛ R n are i.i.d. N(0, 1) Main question: If we apply least angle or forward stepwise regression, how can we compute valid p-values and confidence intervals? 7 / 45

Forward stepwise regression This procedure enters predictors one a time, choosing the predictor that most decreases the residual sum of squares at each stage. Defining RSS to be the residual sum of squares for the model containing k predictors, and RSS null the residual sum of squares before the kth predictor was added, we can form the usual statistic R k = (RSS null RSS)/σ 2 (with σ assumed known), and compare it to a χ 2 1 distribution. 8 / 45

Simulated example: naive forward stepwise Setup: n = 100, p = 10, true model null Test statistic 0 5 10 15 0 2 4 6 8 10 Chi squared on 1 df Test is too liberal: for nominal size 5%, actual type I error is 39%. (Yes, Larry, can get proper p-values by sample splitting: but messy, loss of power) 9 / 45

Quick review of LAR and the covariance test Least angle regression or LAR is a method for constructing the path of solutions for the lasso: min β 0,β j i (y i β 0 j x ij β j ) 2 + λ β j LAR is a more democratic version of forward stepwise regression. Find the predictor most correlated with the outcome Move the parameter vector in the least squares direction until some other predictor has as much correlation with the current residual This new predictor is added to the active set, and the procedure is repeated Optional ( lasso mode ): if a non-zero coefficient hits zero, that predictor is dropped from the active set, and the process is restarted j 10 / 45

Least angle regression in a picture Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 11 / 45

The covariance test for LAR (Lockhart, Taylor, Ryan Tibshirani, Rob Tibshirani, discussion paper in Annals of Statistics, 2014) The covariance test provides a p-value for each variable as it is added to lasso model via the LAR algorithm. In particular it tests the hypothesis: H 0 : A supp(β ) where A is the running active set, at the current step of LAR 12 / 45

The covariance test for LAR Suppose we want a p-value for predictor 2, entering at step 3 Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 13 / 45

Compute covariance at λ 4 : y, X ˆβ(λ 4 ) Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 14 / 45

Drop x 2, yielding active yet A; ( refit at λ 4, and compute resulting covariance at λ 4, giving T 3 = y, X ˆβ(λ ) 4 ) y, X A ˆβA (λ 4 ) /σ 2 Coefficients 1 0 1 2 3 4 5 λ 1 λ 2 λ 3 λ 4 λ 5 1 1 1 3 1 3 3 4 2 2 1 0 1 2 log(λ) 15 / 45

Null distribution of the covariance statistic Under the null hypothesis that all signal variables are in the model: H 0 : A supp(β ), the covariance statistic follows T j 1 ( σ 2 y, X ˆβ(λ ) j+1 ) y, X A ˆβA (λ j+1 ) Exp(1) as n, p 16 / 45

Null distribution of the covariance statistic Under the null hypothesis that all signal variables are in the model: H 0 : A supp(β ), the covariance statistic follows T j 1 ( σ 2 y, X ˆβ(λ ) j+1 ) y, X A ˆβA (λ j+1 ) Exp(1) as n, p Equivalent knot form : T j = c j σ 2 λ j(λ j λ j+1 ) with c j = 1 for global null case (j = 1) 16 / 45

Simulated example: covariance test Setup: n = 100, p = 10, true model null Test statistic 0 5 10 15 0 1 2 3 4 5 6 7 Exp(1) 17 / 45

Example: prostate cancer data Data from a study of the level of prostate-specific antigen and p = 8 clinical measures, in n = 67 men about to receive a radical prostatectomy Stepwise, naive LAR, covariance test lcavol 0.000 0.000 lweight 0.000 0.052 svi 0.041 0.174 lbph 0.045 0.929 pgg45 0.226 0.353 age 0.191 0.650 lcp 0.065 0.051 gleason 0.883 0.978 18 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 4. Hard to get confidence statements out of the covariance test statistic (not easily pivotable ) 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 4. Hard to get confidence statements out of the covariance test statistic (not easily pivotable ) 5. Doesn t have a cool enough name 19 / 45

Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 4. Hard to get confidence statements out of the covariance test statistic (not easily pivotable ) 5. Doesn t have a cool enough name We ll discuss a new framework that overcomes 1-4, and especially 5 19 / 45

What s coming next: a roadmap 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: Spacing test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 1 Φ(λ 1 ) Unif(0, 1) for any n, p 1 Φ(λ 2 ) 20 / 45

What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: Spacing test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 1 Φ(λ 1 ) Unif(0, 1) for any n, p 1 Φ(λ 2 ) (Intriguing connection: these are asymptotically equivalent) 20 / 45

Preview: prostate cancer data 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Naive Selection adjusted lcavol lweight svi lbph pgg45 lcp age gleason 21 / 45

A new framework for inference after selection 22 / 45

The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) 23 / 45

The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) We wish to make inferences on ν T θ a linear contrast of the mean θ conditional on y S, where S is polyhedron S = {y : Γy u} The vector ν = ν(s) is allowed to depend on S 23 / 45

The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) We wish to make inferences on ν T θ a linear contrast of the mean θ conditional on y S, where S is polyhedron S = {y : Γy u} The vector ν = ν(s) is allowed to depend on S E.g., we d like a p-value P (y, S, ν) that satisfies ( ) P (y, S, ν) α Γy u = α P ν T θ=0 for any 0 α 1 23 / 45

Fundamental result Lemma (Polyhedral selection as truncation) For any ν 0, we have {Γy u} = {V ν T y V +, V 0 0}, where γ = ΓΣν ν T Σν V (y) = max j:γ j >0 V + (y) = min j:γ j <0 u j (Γy) j + γ j ν T y γ j u j (Γy) j + γ j ν T y γ j V 0 (y) = min j:γ j =0 u j (Γy) j, Moreover, the triplet (V, V +, V 0 )(y) is independent of ν T y. 24 / 45

Proof V P ν y y V + ν ν T y {Γy u} 25 / 45

Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) 26 / 45

Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? 26 / 45

Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 26 / 45

Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 1. For Z N [a,b] (µ, σ 2 ), and F [a,b] µ,σ 2 its CDF, we have ( ) P F [a,b] (Z) α µ,σ 2 = α 26 / 45

Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 1. For Z N [a,b] (µ, σ 2 ), and F [a,b] µ,σ 2 its CDF, we have ( ) P F [a,b] (Z) α µ,σ 2 = α 2. Hence also ( ) P F [V,V + ] ν T θ,ν T Σν (νt y) α Γy u = α 26 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), 27 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α 27 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α Furthermore, because the same statement holds for any fixed µ, ( ) P ν T θ=µ 1 F [V,V + ] µ,ν T Σν (νt y) α Γy u = α 27 / 45

Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α Furthermore, because the same statement holds for any fixed µ, ( ) P ν T θ=µ 1 F [V,V + ] µ,ν T Σν (νt y) α Γy u = α we can compute a conditional confidence interval by inverting the pivot, yielding ( ) P ν T θ [δ α/2, δ 1 α/2 ] Γy u = 1 α 27 / 45

Illustration Truncated normal density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 2.0 x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

Illustration Truncated normal density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 2.0 x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

Illustration Truncated normal density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 2.0 x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

Application to forward stepwise and least angle regression 29 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al at step l. gives the signs of active coefficients 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: For forward stepwise, Γ has 2pk rows 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: For forward stepwise, Γ has 2pk rows For LAR, Γ has only k + 1 rows! 30 / 45

Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: For forward stepwise, Γ has 2pk rows For LAR, Γ has only k + 1 rows! Computation of V, V + is O(number of rows of Γ) 30 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. 31 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: 31 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k 31 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k We place no restrictions on predictors X (general position) 31 / 45

Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k We place no restrictions on predictors X (general position) Interesting case to keep in mind: ν = X Ak (X T A k X Ak ) 1 e k, so that ν T θ = e T k (XT A k X Ak ) 1 X T A k θ which is the kth coefficient in the multiple regression of θ on active variables A k 31 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 32 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 32 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 When ν = X Ak (X T A k X Ak ) 1 e k : this tests the significance of the projected regression coefficient of the last variable entered into the model, 32 / 45

Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 When ν = X Ak (X T A k X Ak ) 1 e k : this tests the significance of the projected regression coefficient of the last variable entered into the model, conditional on all selections that have been made so far 32 / 45

Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot 33 / 45

Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α 33 / 45

Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α When ν = X Ak (X T A k X Ak ) 1 e k : this interval traps the projected regression coefficient of the last variable to enter with probability 1 α, 33 / 45

Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α When ν = X Ak (X T A k X Ak ) 1 e k : this interval traps the projected regression coefficient of the last variable to enter with probability 1 α, conditional on all selections made so far 33 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR 34 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) 34 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) Because Γ is especially simple for LAR, this test easy and efficient to implement. 34 / 45

The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) Because Γ is especially simple for LAR, this test easy and efficient to implement. The p-value for H 0 : e T k (XT A k X Ak ) 1 X T A k θ = 0 is P = Φ ( Φ V ) σ ω k Φ ( ) V σ ω k Φ ( ) λk σ ω ( k ) V + σ ω k where λ k is the kth LAR knot, and ω k = X Ak (XA T k X Ak ) 1 s Ak X Ak 1 (XA T k 1 X Ak 1 ) 1 2 s Ak 1 34 / 45

Example: selection adjustment under the global null Step= 1 Step= 2 < max-t > Step= 3 Step= 4 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 Maxt 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Expected Expected Expected Expected Step= 1 <- Selection-adjusted FS-> Step= 2 Step= 3 T Step= 4 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 Exact FS 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Expected Expected Expected Expected Step= 1 Step= 2 <- Spacing test-> Step= 3 Step= 4 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 Spacing 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 Expected Expected Expected Expected 35 / 45

Example: back to prostate data Stepwise, naive Stepwise, adjusted lcavol 0.000 0.000 lweight 0.000 0.006 svi 0.047 0.425 lbph 0.047 0.168 pgg45 0.234 0.577 lcp 0.083 0.273 age 0.137 0.059 gleason 0.883 0.844 LAR, covariance test LAR, spacing test lcavol 0.000 0.000 lweight 0.044 0.050 svi 0.165 0.134 lbph 0.929 0.917 pgg45 0.346 0.016 age 0.648 0.581 lcp 0.043 0.058 gleason 0.978 0.858 36 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (X T A k X Ak ) 1 e k for concreteness; 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, here the (random) projected regression coefficient of the kth variable to enter 37 / 45

Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, here the (random) projected regression coefficient of the kth variable to enter Roughly speaking, we can think of this interval as trapping the projected coefficient of the kth most important variable as deemed by the algorithm, with probability 1 α 37 / 45

Example: selection intervals for LAR Selection interval for predictor 1 Parameter value 4 5 6 7 8 0 20 40 60 80 100 Realization Selection interval for predictor 2 Parameter value 1 2 3 4 5 0 20 40 60 80 100 Realization Selection intervals for first predictor entered, at each LAR step Parameter value 4 6 8 10 12 2 4 6 8 LAR step 38 / 45

Application of these and related ideas to other problems Beyond polyhedra (Taylor, Loftus, Ryan Tibs 2014) Graphical models and clustering (G Sell, Taylor, Tibs 2014) Many normal means (Reid, Taylor, Tibs 2014) Lasso with fixed λ (Lee, Sun, Sun, Taylor 2014) Marginal screening (Sun, Taylor 2014) PCA (Choi, Taylor, Tibs in preparation) 39 / 45

PCA example Scree Plot of (true) rank = 2 Singular values(snr=1.5) Singular values(snr=0.23) Singular values 5 10 15 20 Singular values 4 5 6 7 8 9 10 11 2 4 6 8 10 index 2 4 6 8 10 index (p-values for right: 0.030, 0.064, 0.222, 0.286, 0.197, 0.831, 0.510, 0.185, 0.126.) 40 / 45

Generalization to principal component analysis Model Y = B + ɛ and we want to test rank(b) k. Test is based on singular values of Wishart matrix Y T Y. Largest singular value has a Tracy-Widom distribution asymptotically (Johnstone 2001) and this can be used to construct a test of the global null rank(b) = 0. This can be applied sequentially for other ranks (Kritchman and Nadler, 2008) We derive the conditional distribution of each singular value λ k, conditional on λ k+1... λ p and use this to obtain an exact test 41 / 45

rank=0, step=1 rank=0, step=2 rank=0, step=3 rank=0, step=4 sort(result0[, 1]) 0.0 0.2 0.4 0.6 0.8 1.0 Jon's method integrated out Nadler's sort(result0[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result0[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result0[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qs qs qs qs rank=1, step=1 rank=1, step=2 rank=1, step=3 rank=1, step=4 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result1[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qs qs qs qs rank=2, step=1 rank=2, step=2 rank=2, step=3 rank=2, step=4 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 sort(result2[, j]) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 qs qs qs qs 42 / 45

Final comments Data analysts need tools for inference after selection, and these are now becoming available But much work is left to be done (power, robustness, computation, etc.) R language software is on its way! 43 / 45

Questions Larry: How do you estimate σ 2? Estimating σ 2 is harder than estimating the signal! Ale: What s the Betti number of the Rips complex of the selection set under Hausdorff paracompactness? Jay: Is there a Bayesian analog? Jim Ramsey: Assuming normality is bogus. Nature just gives you a bunch of numbers Ryan: Why can t we just apply trend filtering? 44 / 45

Questions Larry: You are doing inference for ν T θ but your ν is random. How do you do inference for E (ν) T θ? Ale: How does this work for log linear models? Jing: How does this work for sparse PCA? Max: How dependent are the p-values? Could this work with ForwardStop? 45 / 45