Some new ideas for post selection inference and model assessment

Size: px

Start display at page:

Download "Some new ideas for post selection inference and model assessment"

Hope Wiggins
5 years ago
Views:

1 Some new ideas for post selection inference and model assessment Robert Tibshirani, Stanford WHOA!! 2018 Thanks to Jon Taylor and Ryan Tibshirani for helpful feedback 1 / 23

2 Two topics 1. How to improve post-selection inference for the lasso: Keli Liu, Jelena Markovic & RT (with further generalizations by Jon Taylor) 2. Maybe we re answering the wrong question in #1: Post model-fitting exploration via Next-Door analysis Leying Guan & RT 2 / 23

3 Keli Liu Jelena Markovic Leying Guan 3 / 23

4 Post-selection inference for the lasso Data (x i, y i ), i = 1, 2,... N; x i = (x i1, x i2,... x ip ). X fixed. Model y i = β 0 + x ij β j + ɛ i ; ɛ i N(0, σ 2 ). j 4 / 23

5 Post-selection inference for the lasso Data (x i, y i ), i = 1, 2,... N; x i = (x i1, x i2,... x ip ). X fixed. Model y i = β 0 + x ij β j + ɛ i ; ɛ i N(0, σ 2 ). j The Lasso { arg min (y i β 0 β 0,β 1,...,β p i j x ij β j ) 2 + λ j } β j for some λ 0. 4 / 23

6 Review of truncated Gaussian approach Polyhedral selection events Response vector y N(µ, Σ). Suppose we make a selection that can be written as {y : Ay b} with A, b not depending on y. This is true for forward stepwise regression, lasso with fixed λ, least angle regression and other procedures. 5 / 23

7 The polyhedral lemma [Lee et al, Ryan Tibshirani et al.] For any vector η F [V,V + ] η µ,σ 2 η η (η y) {Ay b} Unif(0, 1) (truncated Gaussian distribution), where V, V + are (computable) values that are functions of η, A, b. Typically choose η so that η T y is the partial least squares estimate for a selected variable 6 / 23

8 V (y) P η y y η T y V + (y) η P η y {Ay b} 7 / 23

9 Example: Lasso with fixed-λ HIV data: mutations that predict response to a drug. Selection intervals for lasso with fixed tuning parameter λ. Coefficient Naive interval Selection adjusted interval Predictor 8 / 23

10 A big shortcoming of this approach Intervals are often very wide, can even be infinite. 9 / 23

11 A big shortcoming of this approach Intervals are often very wide, can even be infinite. Why? We have conditioned on too much, leaving not enough variation for inference [Fithian, Taylor- data carving ]. 9 / 23

12 A big shortcoming of this approach Intervals are often very wide, can even be infinite. Why? We have conditioned on too much, leaving not enough variation for inference [Fithian, Taylor- data carving ]. Jonathan Taylor & co-authors have worked to solve this problem by adding noise to the data before model fitting. This is clever and produces shorter intervals and more powerful tests. 9 / 23

13 A big shortcoming of this approach Intervals are often very wide, can even be infinite. Why? We have conditioned on too much, leaving not enough variation for inference [Fithian, Taylor- data carving ]. Jonathan Taylor & co-authors have worked to solve this problem by adding noise to the data before model fitting. This is clever and produces shorter intervals and more powerful tests. Here we show how the problem can be largely solved without randomization to provide shorter intervals. 9 / 23

14 Forming a Data Driven Query: Two Costs 1. Variable Selection: The data is used to decide which variables are worthy of attention, e.g., running the lasso and focusing on the active set. 10 / 23

15 Forming a Data Driven Query: Two Costs 1. Variable Selection: The data is used to decide which variables are worthy of attention, e.g., running the lasso and focusing on the active set. 2. Target Formation: Having settled on a subset M {1,..., p} of variables for careful study, what should be the target of our estimation? Two choices: 10 / 23

16 Forming a Data Driven Query: Two Costs 1. Variable Selection: The data is used to decide which variables are worthy of attention, e.g., running the lasso and focusing on the active set. 2. Target Formation: Having settled on a subset M {1,..., p} of variables for careful study, what should be the target of our estimation? Two choices: full target β F j, j M, where or partial target β F = ( X X ) 1 X µ, β (M) = ( X MX M ) 1 X M µ. 10 / 23

17 Consequences With the full target, our only cost is in #1. Our proposal: instead of conditioning on the entire active set and signs, we can condition just on the event that a given variable X j was chosen. [minimal conditioning: it s the event that leads us to ask a question about X j ] This leads to a truncated Gaussian distribution on the union of two disjoint intervals, with exact coverage under Gaussian errors. 11 / 23

18 Consequences With the full target, our only cost is in #1. Our proposal: instead of conditioning on the entire active set and signs, we can condition just on the event that a given variable X j was chosen. [minimal conditioning: it s the event that leads us to ask a question about X j ] This leads to a truncated Gaussian distribution on the union of two disjoint intervals, with exact coverage under Gaussian errors. With the partial target, we have to deal with both #1 and #2. Details in a few slides. 11 / 23

19 Full Model Coefficients Naïve (0.33) TZ V (0.29) TZ M (0.82) TZ Ms (1.19) lcavol svi lweight age lbph pgg45 gleason Prostate cancer data. Naive ignore selection; TZ V condition just on selected variable; TZ M condition on active set; TZ Ms condition on active set and signs (Lee et al.). 12 / 23

20 Partial targets Idea: we choose a subset Ĥ ˆM of high value targets (details below). How we choose to summarize the effect of a variable j ˆM depends on whether j is a high value target: 13 / 23

21 Partial targets Idea: we choose a subset Ĥ ˆM of high value targets (details below). How we choose to summarize the effect of a variable j ˆM depends on whether j is a high value target: High Value: We summarize the effect of j using βĥj βĥ = ( X Ĥ X Ĥ) 1 X Ĥ µ. where So our choice of target is fully adaptive for high value targets. Low Value: If variable j is selected by the lasso but is not deemed a high value target, we summarize its effect via βĥ {j} j βĥ {j} = ( X Ĥ {j} X Ĥ {j}) 1 X Ĥ {j} µ where and XĤ {j} is the matrix containing the high value targets as well as variable j. The coefficient βĥ {j} j is the effect of variable j after partialing out the effect of the high value targets, i.e., it allows us to ask the question whether variable j contributes any explanatory power beyond the variables in Ĥ. 13 / 23

22 Defining high and low-value targets Stable-t: Take Ĥ to be those variables in ˆM with t-statistics surpassing a Bonferroni corrected threshold. We first fit a OLS model using all the variables in ˆM, i.e., ˆβ ˆM = ( X ˆM X ˆM) 1 X ˆM y and allow j to be a high value target if the t-statistic for ˆβ ˆM j if ˆβ ˆM j ( ) 1 σ X ˆM X > c ˆM for some ( ) cutoff c. If we choose c by Bonferroni, it has the form Φ 1 α 2p 2 log p for large p; jj is large, i.e., We again get a truncated Gaussian over a union of intervals, and exact coverage with finite samples. 14 / 23

23 Partial Model Coefficients High Value Naïve (0.30) TZ stab-t (0.40) TZ M (0.80) TZ Ms (1.12) Low Value lcavol svi lweight age lbph pgg45 gleason Prostate cancer data. Naive ignore selection; TZ V condition just on selected variable; TZ M condition on active set; TZ Ms condition on active set and signs (Lee et al.); TZ stab t stable-t for high value target selection. 15 / 23

24 n=100, p=250, pure noise naive bonf TZ t TZ l1 TZ M TZ Ms len: 0.32, cov: 0.00 len: 0.51, cov: 0.47, len: 0.78, cov: 0.92 len: 0.74, cov: 0.91 len: 0.51, cov: 0.92 Prop. Inf. t: 0.02 l1: 0.00 M: 0.09 Ms: 0.47 len: 7.62, cov: Boxplot of lengths of 90% confidence intervals for partial regression coefficients. Naive ignore selection; Bonf Bonferroni; TZ t stable-t for high value target selection; TZ l1 stable-l 1 for high value target selection; TZ M condition on active set; TZ Ms condition on active set and signs (Lee et al.); 16 / 23

25 Wrapup All of this is for N > p; 17 / 23

26 Wrapup All of this is for N > p; The ideas extended for the high-dimensional full target case via ROSI: in preparation with Kevin Fry, Keli Liu, Jonathan Taylor and Rob Tibshirani. Gets good power as well! Application to large GWAS problems. 17 / 23

27 Wrapup All of this is for N > p; The ideas extended for the high-dimensional full target case via ROSI: in preparation with Kevin Fry, Keli Liu, Jonathan Taylor and Rob Tibshirani. Gets good power as well! Application to large GWAS problems. Will be added to our selectiveinference R and Python packages. 17 / 23

28 Next-door analysis Motivation Having fit a model by e.g. lasso, post-selection inference (as above) focusses on significance and confidence intervals for each chosen feature But scientists will often have different questions: Is the chosen model the uniquely best one? Are there other models with similar prediction performance? Is a given predictor indispensible or can it be swapped out for one or more other predictors? These are model-centric as opposed to feature-centric questions Our proposed solution is an application of the LOCO (leave-one-covariate-out) method of Lei et al (the CMU group) [no data splitting; focus on models, not variables] 18 / 23

29 leave out x1 x2,x3,x4 leave out x2 x1,x3 Chosen model x1,x2,x3 Leave-one out models minimum error higher error leave out x3 x1,x5 19 / 23

30 Algorithm: Next-Door analysis for the lasso 1. Fit the lasso with parameter λ chosen by cross-validation. Let the solutions be ˆβ(ˆλ). Let S be the active set where the coefficient in ˆβ(ˆλ) is non-zero. 2. For each j S, solve the lasso problem with the coefficient for the j th predictor being fixed at 0: 1 { ˆβ 0, ˆβ; ˆλ, j} = argmin βj =0 2 (y i β 0 X il β l ) 2 + ˆλ i l j l β l (1) Let ˆβ(ˆλ; j) be the coefficients and d j be the increase in validation error for this model relative to the base model. 3. Form an approximately unbiased estimate of d j and test if predictor j is indispensable: that is, test whether the increase in estimated prediction error d j is significantly larger than zero. 20 / 23

31 Details Need to condition on selection events: (1) chosen model has minimum CV error, (2) predictor j is in chosen model We use tricks of Markovic and Taylor (adding noise in CV) and Xiaoying Tian (adding ± noise for Cp) to obtain approximately debiased prediction error estimates and the bootstrap to get approximate type I error control 21 / 23

32 Table: Prostate cancer results. The leftmost column shows the fitted model from the lasso, and the remaining columns show the nearby models corresponding to the removal of each predictor.. base lcavol lwt svi lcp lbph pgg45 age lcavol lwt svi lcp lbph pgg age cv error debiased error test error selection freq NextDoor pvalue Feature (Post-Sel) pval Post selection p-value, Frequency of selection Feature indispensability!! 22 / 23

33 Final comments Paper on arxiv by Guan & Tibshirani NextDoor R package will soon be on CRAN. Idea: run glmnet to fit model, then run NextDoor on the output to get post-fitting summary report 23 / 23

Post-selection inference with an application to internal inference

Post-selection inference with an application to internal inference Robert Tibshirani, Stanford University November 23, 2015 Seattle Symposium in Biostatistics, 2015 Joint work with Sam Gross, Will Fithian,