Post-selection Inference for Forward Stepwise and Least Angle Regression

Size: px

Start display at page:

Download "Post-selection Inference for Forward Stepwise and Least Angle Regression"

Maximilian Dawson
5 years ago
Views:

1 Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September / 45

2 Matching Results from picadilo.com Ryan Tibshirani, CMU. PhD student of Taylor 2011 Rob Tibshirani Stanford 2 / 45

3 81% 71% Ryan Tibshirani CMU Rob Tibshirani Stanford ë Top matches from picadilo.com 3 / 45

4 81% Ryan Tibshirani CMU Rob Tibshirani Stanford 71% 69% 4 / 45

5 Conclusion Confidence the strength of evidence matters! 5 / 45

6 Outline Setup and basic question Quick review of least angle regression and the covariance test A new framework for inference after selection Application to forward stepwise and least angle regression Application of these and related ideas to other problems 6 / 45

7 Setup and basic question Given an outcome vector y R n and a predictor matrix X R n p, we consider the usual linear regression setup: y = Xβ + σɛ, where β R p are unknown coefficients to be estimated, and the components of the noise vector ɛ R n are i.i.d. N(0, 1) Main question: If we apply least angle or forward stepwise regression, how can we compute valid p-values and confidence intervals? 7 / 45

8 Forward stepwise regression This procedure enters predictors one a time, choosing the predictor that most decreases the residual sum of squares at each stage. Defining RSS to be the residual sum of squares for the model containing k predictors, and RSS null the residual sum of squares before the kth predictor was added, we can form the usual statistic R k = (RSS null RSS)/σ 2 (with σ assumed known), and compare it to a χ 2 1 distribution. 8 / 45

9 Simulated example: naive forward stepwise Setup: n = 100, p = 10, true model null Test statistic Chi squared on 1 df Test is too liberal: for nominal size 5%, actual type I error is 39%. (Yes, Larry, can get proper p-values by sample splitting: but messy, loss of power) 9 / 45

10 Quick review of LAR and the covariance test Least angle regression or LAR is a method for constructing the path of solutions for the lasso: min β 0,β j i (y i β 0 j x ij β j ) 2 + λ β j LAR is a more democratic version of forward stepwise regression. Find the predictor most correlated with the outcome Move the parameter vector in the least squares direction until some other predictor has as much correlation with the current residual This new predictor is added to the active set, and the procedure is repeated Optional ( lasso mode ): if a non-zero coefficient hits zero, that predictor is dropped from the active set, and the process is restarted j 10 / 45

11 Least angle regression in a picture Coefficients λ 1 λ 2 λ 3 λ 4 λ log(λ) 11 / 45

12 The covariance test for LAR (Lockhart, Taylor, Ryan Tibshirani, Rob Tibshirani, discussion paper in Annals of Statistics, 2014) The covariance test provides a p-value for each variable as it is added to lasso model via the LAR algorithm. In particular it tests the hypothesis: H 0 : A supp(β ) where A is the running active set, at the current step of LAR 12 / 45

13 The covariance test for LAR Suppose we want a p-value for predictor 2, entering at step 3 Coefficients λ 1 λ 2 λ 3 λ 4 λ log(λ) 13 / 45

14 Compute covariance at λ 4 : y, X ˆβ(λ 4 ) Coefficients λ 1 λ 2 λ 3 λ 4 λ log(λ) 14 / 45

15 Drop x 2, yielding active yet A; ( refit at λ 4, and compute resulting covariance at λ 4, giving T 3 = y, X ˆβ(λ ) 4 ) y, X A ˆβA (λ 4 ) /σ 2 Coefficients λ 1 λ 2 λ 3 λ 4 λ log(λ) 15 / 45

16 Null distribution of the covariance statistic Under the null hypothesis that all signal variables are in the model: H 0 : A supp(β ), the covariance statistic follows T j 1 ( σ 2 y, X ˆβ(λ ) j+1 ) y, X A ˆβA (λ j+1 ) Exp(1) as n, p 16 / 45

17 Null distribution of the covariance statistic Under the null hypothesis that all signal variables are in the model: H 0 : A supp(β ), the covariance statistic follows T j 1 ( σ 2 y, X ˆβ(λ ) j+1 ) y, X A ˆβA (λ j+1 ) Exp(1) as n, p Equivalent knot form : T j = c j σ 2 λ j(λ j λ j+1 ) with c j = 1 for global null case (j = 1) 16 / 45

18 Simulated example: covariance test Setup: n = 100, p = 10, true model null Test statistic Exp(1) 17 / 45

19 Example: prostate cancer data Data from a study of the level of prostate-specific antigen and p = 8 clinical measures, in n = 67 men about to receive a radical prostatectomy Stepwise, naive LAR, covariance test lcavol lweight svi lbph pgg age lcp gleason / 45

20 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] 19 / 45

21 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 19 / 45

22 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 19 / 45

23 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 19 / 45

24 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 19 / 45

25 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 4. Hard to get confidence statements out of the covariance test statistic (not easily pivotable ) 19 / 45

26 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 4. Hard to get confidence statements out of the covariance test statistic (not easily pivotable ) 5. Doesn t have a cool enough name 19 / 45

27 Shortcomings of the covariance test The covariance test is highly intuitive and actually pretty broad: it can be extended to other sparse estimation problems [e.g., graphical models and clustering (G Sell et al 2014)] But it has some definite weaknesses: 1. Places correlation restrictions on the predictors X (somewhat similar to standard conditions for exact model recovery) 2. Assumes linearity of the underlying model (i.e., y = Xβ + ɛ) 3. Significance statements are asymptotic 4. Hard to get confidence statements out of the covariance test statistic (not easily pivotable ) 5. Doesn t have a cool enough name We ll discuss a new framework that overcomes 1-4, and especially 5 19 / 45

28 What s coming next: a roadmap 20 / 45

29 What s coming next: a roadmap 1. General framework for inference after polyhedral selection 20 / 45

30 What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 20 / 45

31 What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples 20 / 45

32 What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 20 / 45

33 What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: Spacing test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 1 Φ(λ 1 ) Unif(0, 1) for any n, p 1 Φ(λ 2 ) 20 / 45

34 What s coming next: a roadmap 1. General framework for inference after polyhedral selection 2. Application to forward stepwise regression and LAR 3. For LAR, we obtain the spacing test, which is exact in finite samples Consider the global null, σ 2 = 1, and λ 1, λ 2 first two LAR knots. Covariance test: Spacing test: λ 1 (λ 1 λ 2 ) Exp(1) as n, p 1 Φ(λ 1 ) Unif(0, 1) for any n, p 1 Φ(λ 2 ) (Intriguing connection: these are asymptotically equivalent) 20 / 45

35 Preview: prostate cancer data Naive Selection adjusted lcavol lweight svi lbph pgg45 lcp age gleason 21 / 45

36 A new framework for inference after selection 22 / 45

37 The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) 23 / 45

38 The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) We wish to make inferences on ν T θ a linear contrast of the mean θ conditional on y S, where S is polyhedron S = {y : Γy u} The vector ν = ν(s) is allowed to depend on S 23 / 45

39 The polyhedral testing framework Suppose we observe y N(θ, Σ), with mean parameter θ unknown (but covariance Σ known) We wish to make inferences on ν T θ a linear contrast of the mean θ conditional on y S, where S is polyhedron S = {y : Γy u} The vector ν = ν(s) is allowed to depend on S E.g., we d like a p-value P (y, S, ν) that satisfies ( ) P (y, S, ν) α Γy u = α P ν T θ=0 for any 0 α 1 23 / 45

40 Fundamental result Lemma (Polyhedral selection as truncation) For any ν 0, we have {Γy u} = {V ν T y V +, V 0 0}, where γ = ΓΣν ν T Σν V (y) = max j:γ j >0 V + (y) = min j:γ j <0 u j (Γy) j + γ j ν T y γ j u j (Γy) j + γ j ν T y γ j V 0 (y) = min j:γ j =0 u j (Γy) j, Moreover, the triplet (V, V +, V 0 )(y) is independent of ν T y. 24 / 45

41 Proof V P ν y y V + ν ν T y {Γy u} 25 / 45

42 Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) 26 / 45

43 Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? 26 / 45

44 Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 26 / 45

45 Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 1. For Z N [a,b] (µ, σ 2 ), and F [a,b] µ,σ 2 its CDF, we have ( ) P F [a,b] (Z) α µ,σ 2 = α 26 / 45

46 Using this result for conditional inference In other words, the distribution ν T y Γy u is the same as ν T y V ν T y V +, V 0 0 This is a truncated Gaussian distribution (with random limits) How do we use this result for conditional inference on ν T θ? Follow two main ideas: 1. For Z N [a,b] (µ, σ 2 ), and F [a,b] µ,σ 2 its CDF, we have ( ) P F [a,b] (Z) α µ,σ 2 = α 2. Hence also ( ) P F [V,V + ] ν T θ,ν T Σν (νt y) α Γy u = α 26 / 45

47 Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), 27 / 45

48 Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α 27 / 45

49 Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α Furthermore, because the same statement holds for any fixed µ, ( ) P ν T θ=µ 1 F [V,V + ] µ,ν T Σν (νt y) α Γy u = α 27 / 45

50 Therefore to test H 0 : ν T θ = 0 against H 1 : ν T θ > 0, we can take as a conditional p-value P = 1 F [V,V + ] 0,ν T Σν (νt y), since ( ) P ν T θ=0 1 F [V,V + ] 0,ν T Σν (νt y) α Γy u = α Furthermore, because the same statement holds for any fixed µ, ( ) P ν T θ=µ 1 F [V,V + ] µ,ν T Σν (νt y) α Γy u = α we can compute a conditional confidence interval by inverting the pivot, yielding ( ) P ν T θ [δ α/2, δ 1 α/2 ] Γy u = 1 α 27 / 45

51 Illustration Truncated normal density x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

52 Illustration Truncated normal density x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

53 Illustration Truncated normal density x We observe random limits V, V + (in black), and a random point ν T y (in red); we form truncated normal density f [V,V + ], and the 0,ν T Σν mass to the right of ν T y is our p-value 28 / 45

54 Application to forward stepwise and least angle regression 29 / 45

55 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. 30 / 45

56 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. 30 / 45

57 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al at step l. gives the signs of active coefficients 30 / 45

58 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps 30 / 45

59 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: 30 / 45

60 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: For forward stepwise, Γ has 2pk rows 30 / 45

61 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: For forward stepwise, Γ has 2pk rows For LAR, Γ has only k + 1 rows! 30 / 45

62 Sequential model selection as a polyhedral set Suppose we run k steps of forward stepwise or LAR, and encounter the sequence of active sets A l, l = 1,... k. We can express { y : ( Â l (y), ŝ Al (y) ) } = (A l, s Al ), l = 1,... k = {y : Γy 0} for some matrix Γ. Here s Al gives the signs of active coefficients at step l. This describes all vectors y such that the algorithm would make the same selections (variables and signs) over k steps Important points: For forward stepwise, Γ has 2pk rows For LAR, Γ has only k + 1 rows! Computation of V, V + is O(number of rows of Γ) 30 / 45

63 Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. 31 / 45

64 Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: 31 / 45

65 Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k 31 / 45

66 Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k We place no restrictions on predictors X (general position) 31 / 45

67 Inference for forward stepwise and LAR Using our polyhedral framework, we can derive conditional p-values or confidence intervals for any linear contrast ν T θ. Notes: Here ν can depend on the polyhedron, i.e., can depend on the selection A l, s Al, l = 1,... k We place no restrictions on predictors X (general position) Interesting case to keep in mind: ν = X Ak (X T A k X Ak ) 1 e k, so that ν T θ = e T k (XT A k X Ak ) 1 X T A k θ which is the kth coefficient in the multiple regression of θ on active variables A k 31 / 45

68 Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 32 / 45

69 Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 32 / 45

70 Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 When ν = X Ak (X T A k X Ak ) 1 e k : this tests the significance of the projected regression coefficient of the last variable entered into the model, 32 / 45

71 Our conditional p-value for H 0 : ν T θ = 0 is Φ P = Φ ( V + ) σ ν 2 Φ ( ) V + σ ν 2 Φ ( ) ν T y σ ν 2 ( ) V σ ν 2 This has exact (!) conditional size: ( P α ( Â l (y), ŝ Al (y) ) ) = (A l, s Al ), l = 1,... k = α P ν T θ=0 When ν = X Ak (X T A k X Ak ) 1 e k : this tests the significance of the projected regression coefficient of the last variable entered into the model, conditional on all selections that have been made so far 32 / 45

72 Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot 33 / 45

73 Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α 33 / 45

74 Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α When ν = X Ak (X T A k X Ak ) 1 e k : this interval traps the projected regression coefficient of the last variable to enter with probability 1 α, 33 / 45

75 Our conditional confidence interval is I = [δ α/2, δ 1 α/2 ], where the endpoints are computed by inverting the truncated Gaussian pivot This has exact (!) conditional coverage: ( P ν T θ [δ α/2, δ 1 α/2 ] (Âl (y), ŝ Al (y) ) = (A l, s Al ), l = 1,... k) = 1 α When ν = X Ak (X T A k X Ak ) 1 e k : this interval traps the projected regression coefficient of the last variable to enter with probability 1 α, conditional on all selections made so far 33 / 45

76 The spacing test The spacing test is the name we give to our framework applied to LAR 34 / 45

77 The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) 34 / 45

78 The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) Because Γ is especially simple for LAR, this test easy and efficient to implement. 34 / 45

79 The spacing test The spacing test is the name we give to our framework applied to LAR (... tentative paper title 2014: A Spacing Odyssey ) Because Γ is especially simple for LAR, this test easy and efficient to implement. The p-value for H 0 : e T k (XT A k X Ak ) 1 X T A k θ = 0 is P = Φ ( Φ V ) σ ω k Φ ( ) V σ ω k Φ ( ) λk σ ω ( k ) V + σ ω k where λ k is the kth LAR knot, and ω k = X Ak (XA T k X Ak ) 1 s Ak X Ak 1 (XA T k 1 X Ak 1 ) 1 2 s Ak 1 34 / 45

80 Example: selection adjustment under the global null Step= 1 Step= 2 < max-t > Step= 3 Step= 4 Maxt Maxt Maxt Maxt Expected Expected Expected Expected Step= 1 <- Selection-adjusted FS-> Step= 2 Step= 3 T Step= 4 Exact FS Exact FS Exact FS Exact FS Expected Expected Expected Expected Step= 1 Step= 2 <- Spacing test-> Step= 3 Step= 4 Spacing Spacing Spacing Spacing Expected Expected Expected Expected 35 / 45

81 Example: back to prostate data Stepwise, naive Stepwise, adjusted lcavol lweight svi lbph pgg lcp age gleason LAR, covariance test LAR, spacing test lcavol lweight svi lbph pgg age lcp gleason / 45

82 Selection intervals Interestingly, our conditional intervals also hold unconditionally 37 / 45

83 Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (X T A k X Ak ) 1 e k for concreteness; 37 / 45

84 Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α 37 / 45

85 Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; 37 / 45

86 Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, 37 / 45

87 Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, here the (random) projected regression coefficient of the kth variable to enter 37 / 45

88 Selection intervals Interestingly, our conditional intervals also hold unconditionally Consider ν = X Ak (XA T k X Ak ) 1 e k for concreteness; by averaging over all possible selections (A l, s Al ), l = 1,... k we obtain ( ) P e T k (XṰ XÂk ) 1 X Ṱ θ [δ A k A k α/2, δ 1 α/2 ] = 1 α We call this a selection interval; in contrast to a typical confidence interval, this tracks a moving target, here the (random) projected regression coefficient of the kth variable to enter Roughly speaking, we can think of this interval as trapping the projected coefficient of the kth most important variable as deemed by the algorithm, with probability 1 α 37 / 45

89 Example: selection intervals for LAR Selection interval for predictor 1 Parameter value Realization Selection interval for predictor 2 Parameter value Realization Selection intervals for first predictor entered, at each LAR step Parameter value LAR step 38 / 45

90 Application of these and related ideas to other problems Beyond polyhedra (Taylor, Loftus, Ryan Tibs 2014) Graphical models and clustering (G Sell, Taylor, Tibs 2014) Many normal means (Reid, Taylor, Tibs 2014) Lasso with fixed λ (Lee, Sun, Sun, Taylor 2014) Marginal screening (Sun, Taylor 2014) PCA (Choi, Taylor, Tibs in preparation) 39 / 45

91 PCA example Scree Plot of (true) rank = 2 Singular values(snr=1.5) Singular values(snr=0.23) Singular values Singular values index index (p-values for right: 0.030, 0.064, 0.222, 0.286, 0.197, 0.831, 0.510, 0.185, ) 40 / 45

92 Generalization to principal component analysis Model Y = B + ɛ and we want to test rank(b) k. Test is based on singular values of Wishart matrix Y T Y. Largest singular value has a Tracy-Widom distribution asymptotically (Johnstone 2001) and this can be used to construct a test of the global null rank(b) = 0. This can be applied sequentially for other ranks (Kritchman and Nadler, 2008) We derive the conditional distribution of each singular value λ k, conditional on λ k+1... λ p and use this to obtain an exact test 41 / 45

93 rank=0, step=1 rank=0, step=2 rank=0, step=3 rank=0, step=4 sort(result0[, 1]) Jon's method integrated out Nadler's sort(result0[, j]) sort(result0[, j]) sort(result0[, j]) qs qs qs qs rank=1, step=1 rank=1, step=2 rank=1, step=3 rank=1, step=4 sort(result1[, j]) sort(result1[, j]) sort(result1[, j]) sort(result1[, j]) qs qs qs qs rank=2, step=1 rank=2, step=2 rank=2, step=3 rank=2, step=4 sort(result2[, j]) sort(result2[, j]) sort(result2[, j]) sort(result2[, j]) qs qs qs qs 42 / 45

94 Final comments Data analysts need tools for inference after selection, and these are now becoming available But much work is left to be done (power, robustness, computation, etc.) R language software is on its way! 43 / 45

95 Questions Larry: How do you estimate σ 2? Estimating σ 2 is harder than estimating the signal! Ale: What s the Betti number of the Rips complex of the selection set under Hausdorff paracompactness? Jay: Is there a Bayesian analog? Jim Ramsey: Assuming normality is bogus. Nature just gives you a bunch of numbers Ryan: Why can t we just apply trend filtering? 44 / 45

96 Questions Larry: You are doing inference for ν T θ but your ν is random. How do you do inference for E (ν) T θ? Ale: How does this work for log linear models? Jing: How does this work for sparse PCA? Max: How dependent are the p-values? Could this work with ForwardStop? 45 / 45

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra