Lecture 17 May 11, 2018

Size: px

Start display at page:

Download "Lecture 17 May 11, 2018"

Madeline Crawford
5 years ago
Views:

1 Stats 300C: Theory of Statistics Spring 2018 Lecture 17 May 11, 2018 Prof. Emmanuel Candes Scribe: Emmanuel Candes and Zhimei Ren) 1 Outline Agenda: Topics in selective inference 1. Inference After Model Selection Via the Lasso 2. Verifying the Winner 2 Lasso 2.1 Lasso Selection This section is about work of Lee, Sun, Sun and Taylor, To get CIs that are shorter than those from POSI (last lecture), we can restrict the analyst s choices by only considering LASSO selection events: The inference for selected model is 1 ˆβ = argmin b 2 y Xb λ b 1 ˆM = {j : ˆβ j 0} Object of inference: β ˆM := X ˆMµ (regression coeff. in reduced model) Goal: CIs covering parameters β ˆM ( ˆM random) The LASSO selection event when p = 3 can be visualized in Figure 1. Each colored region corresponds to a selected set and sign pattern. We see that each region is a polytope which can be written as an intersection of half spaces as in {y : Ay b}. The main idea for the LASSO selection is to condition on the selection event and the signs of the fitted coefficients, i.e. y { ˆM = M, ŝ = s} N (µ, σ 2 I) 1(Ay b) }{{} truncated multivariate normal 1

2 Figure 1: LASSO selection event 2.2 Computation So here we wish inference about β j M = X T j M µ := ηt µ and we need η T y {Ay b}. However,this sometimes involves complicated mixture of truncated normals and it is computationally expensive to sample from this distribution. A computationally tractable approach is to condition on more: η T y {Ay b, P η y} d = TN }{{} truncated normal (η T µ, σ 2 η 2, I }{{}}{{} mean var }{{} truncation interval d = }{{} TN (η T µ, σ 2 η 2, [V (y), V + (y)]) }{{}}{{}}{{} truncated normal mean var truncation interval The idea can be visualized in Figure 2. The equality in distribution above is not trivial and crucially uses the fact that η T y and P η y are independent since they are projections of a Gaussian vector with independent components along orthogonal directions. Having transformed the original sampling distribution, we only need to consider sampling truncated normal distribution. The CDF of a truncated normal is F [a,b] µ,σ 2 (t) = Φ( t µ σ Φ( b µ σ For a fixed value of P η y, [v, v + ] = [V (y), V + (y)], ) Φ( a µ σ ) ) Φ( a µ σ ) F [v,v + ] η T µ,σ 2 η 2 (η T y) {Ay b, P η y} d = Unif(0, 1) Now since a mixture of uniforms is uniform, we obtain: The above can be summarized in Theorem 1. F [V (y),v + (y)] η T µ,σ 2 η 2 (η T y) {Ay b} d = Unif(0, 1) ) 2

Now that we have T := F [V (y),v + (y)] η T µ,σ 2 η 2 (η T y) {Ay b} Unif(0, 1) we can invert the pivotal quantity to obtain intervals with

3 Figure 2: Conditional sampling distributions Figure 3: Pivotal quantity is uniform Theorem 1: Let F I µ,σ 2 denote the CDF of TN (µ, σ 2 ; I). Then F [V (y),v + (y)] η T µ,σ 2 η 2 (η T y) {Ay b} Unif(0, 1) and is a pivotal quantity. Now that we have T := F [V (y),v + (y)] η T µ,σ 2 η 2 (η T y) {Ay b} Unif(0, 1) we can invert the pivotal quantity to obtain intervals with selective Type I Error control T a (η, y) η T µ a + (η, y) As a consequence, the conditional coverage is P(a (η, y) η T µ a + (η, y) Ay b) = 0.95 P(β j M C j ˆM = M, ŝ = s) = 1 α 3

4 We also have False coverage rate (FCR) control E[ #{j ˆM : C j does not cover β j M } ˆM ] α Figure 4: Comparison on diabetes dataset Figure 4 shows an experiment on diabetes dataset. The selective intervals z-intervals for significant variables. The CIs given by data splitting widens intervals by 2 and POSI widens by To summarize, the approach for LASSO provides shorter CIs than POSI while the price to pay is that we need to commit to LASSO with fixed value of λ. We wish to emphasize that there are many other works in this area: Fithian et al. ( 14), Lee et al. ( 15), Lockart et al. ( 14), Van de Geer et al ( 14), Efron ( 14), Javanmard et al ( 14), Leeb et al ( 14)... 3 Verifying the Winner We now discuss another selective inference problem, which has a different flavor. The material below is from Will Fithian s Ph. D. dissertation (Stanford University, 2015.) In his dissertation, Will extended location family results of Gutman & Maymin ( 87). 4

5 3.1 The Iowa Republican Pol (May, 2015) The following table lists the results of a May 2015 poll for the Iowa Republican Vote (there are 667 samples in this poll): Table 1: Iowa Republican Vote Candidate Percentage of votes Scott Walker 21 % Rand Paul 13 % Marco Rubio 13 % Ted Cruz 12 %.. The question that we want to ask ourselves is this: is Scott Walker truly winning? This is a different question than asking before the poll if Scott Walker is winning, because in the first one, what we are truly asking is this: is the current leader in the poll truly leading the vote? whereas the second one focuses on Scott Walker specifically. Is the parameter we are interested in π SW /π RP or π 1 st best/π 2 nd best, and how should we estimate these? 3.2 Selective Hypothesis Testing How could we test if Scott Walker is winning? i.e, how can we test the null H 0 : π SW max i SW π i? Here, we simply want to test (for a fixed i): H i = j i H i j (where H i j denotes the hypothesis π i π j ), where i realizes the maximum of observed counts. From now on, we are therefore interested in the selection event A i = {X i > max j i X j } (i.e, it is not really Scott Walker that we are interested in, but whoever comes first in the poll results). Hence, in this example we are not effectively selecting a model, just a simple hypothesis. Without loss of generality, we let i = 1 in the rest of these notes. 3.3 Selective test Our selective test is as follows. 1. First we construct a selective p-value p 1j for H 1 j on A 1. Again, assume without loss of generality that j = 2. If we condition on the counts for everyone except candidates 1 and 2 (X1 π and on X 1 + X 2, we have X 1 + X 2, X 3,, X k ) Bin(X 1 + X 2, 1 π 1 +π 2 ). If we further condition on X 1 > X 2, we have a truncated binomial, on which we want to test π 1 π 2, 5

6 or equivalently, the binomial probability m = X 1 + X 2 for this test we compute π 1 π 1 +π To construct the p-value p 12, with p 12 = P(Binom(m, 1/2) X 1 Binom(m, 1/2) > m/2) = P(Binom(m, 1/2) X 1) P(Binom(m, 1/2) > m/2) X 1 +X 2 ( ) X1 + X 2 = 2P(Binom(m, 1/2) X 1 ) = 2 2 (X 1+X 2 ). k k=x 1 Should we worry that we condition on so much that we would decrease the power of our test? 2. We iterate this procedure to construct p-values for the different hypotheses H 1 j for all j. To test the union H 1 = j>1 H 1 j, we combine these p-values by defining: p 1 = max j 2 p 1j, and reject if p 1 α. This is valid since since we constructed level-α tests for each H 1,j. P ( p 1 α A 1 ) min j>1 P( p 1j α ) α (1) It is not difficult to see that the maximum p 1,j above is achieved when j is the runner up. 3.4 Examining the selective test: back to classics! Now we consider the question of whether Scott Walker is really more likely to win than Rand Paul, or by how much. It follows from our analysis that we only need to look at Walker vs. Paul, p SW,RP based on L(X SW X SW + X RP = 227, X others SW wins) = L(X SW X SW + X RP = 227, X SW 114) We observe that selective inference recovers the classical answer (see Gutmann & Maymin ( 87)) p sw = max j SW p SW,j = 2P (Binom(227, 1/2) 140) = If Walk and Paul were in fact tied, then Walker s share of their combined 227 votes would be distributed as Binomial (227, 0.5). Because the (two-tailed) p-value for this pairwise test is p = , we reject the null and conclude that Walker is really winning. 6

Statistical Inference

Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park