Recent Developments in Post-Selection Inference

Size: px

Start display at page:

Download "Recent Developments in Post-Selection Inference"

Ami Pitts
5 years ago
Views:

1 Recent Developments in Post-Selection Inference Yotam Hechtlinger Department of Statistics Shashank Singh Department of Statistics Machine Learning Department Abstract It is common in modern applications to use data-dependent model selection tools to select a promising model before drawing inference over the parameters of the selected model. However, this simple series of steps conceals a significant fault that is often left unattended: the act of selection biases the distributions of test statistics and makes standard inference procedures unsound. This is referred to as the problem of post-selection inference. We review recent work showing how to correct for this in certain common settings. 1 Introduction In the linear regression setting, if M denotes the set of indices for the predictors spanning X M, and β M the least square solution for the regression of Y with X M predictors, then it is well known that under the null hypothesis (i.e., the true coefficient is 0), for each j M, Z j M := βj M ) N (0, 1), se ( βj M This pivotal distribution is the driving force behind inference on least square solutions. It is implicitly assumed that β j M was chosen to be the predictor of interest prior to sampling the data, and hence the model is fixed. Furthermore, it is also assumed that the only predictor of interest is j, thought, there is a need to provide inference over a larger set of predictors, standard corrections for multiple comparisons, such as the Bonferonni correction, can be used. In practice, the widespread use of model selection tools introduces a stochastic component to selection of the model. When the data is being used twice (i.e., for both model selection and inference), the assumption of a fixed model is violated and the inference is no longer sound. 1.1 An illustrative simulation The following simulation demonstrates the problem in a simple case, which, unfortunately, can describe an enthusiastic but naive researcher hunting for a discovery. Consider p independent predictors, X 1,..., X p N (0, 1) and an independent observation vector Y N (0, 1) (i.e., all vectors are independent random noise). In order to find the most promising predictor explaining Y, p different univariate regression models are fitted to Y, and the most significant is chosen at each iteration. Figure 1 shows the distribution of max j,m Z j M as a function of the number of predictors p. In the simulation, univariate regression was done in order to simplify the example. In multivariate model selection, a model X M has M different predictors. Although there is dependence between the statistics of the same predictor between different models, naively correcting for the maximum would require p ( p k=1 k) k tests. This exponential growth results in performing 192 different tests for p = 6, emphasizing that the unsound distribution of n = 100, presented in Figure 1, offering 1

2 Figure 1: The distribution of max j,m Z j M, for several values of the dimension, p. When p = 1, Z j M N (0, 1). However, as p increases, rejecting at Zj M > z α/2 is almost certain to provide false positive results. less than 8% coverage, is easily accessible with improper use of model selection, even in lowdimensional settings. 1.2 Historical approaches to the post-selection inference problem One approach to ensuring validity of post-selection inference technically embeds the problem in the field of multiple comparisons. By approaching each coefficient inference as a single inference, controlling the family-wise error rate over the family of inferences guarantees valid inference for the selected model. The simplest such solution is to apply a Bonferroni correction, dividing the type I error threshold by the (large) number of tests being performed. Recently, Berk et al. (2013) provided a less conservative approach which simultaneously bounded all linear functions that arise as coefficient estimates in all submodels. However, due to the exponential growth of the number of inferences as a function of dimension, these simultaneous methods quickly become too conservative. A more precise solution to the problem requires accounting for the specific nature of the selection procedure. For the example above, this is quite straightforward; since each Z j M N (0, 1), max M,j Z j M has the cumulative distribution function (CDF) Φp, where Φ denotes the CDF of N (0, 1), and, therefore, ) P H0 (max j,m Zj M [ c, c] = Φ p (c) Φ p ( c). For p 3, Φ p ( c) 0, and a one sided test typically suffices. Thus a valid acceptance region controls Φ p (c) = 1 α, which is equivalent to rejecting when c z (1 α) 1/p. In more general cases, understanding the distribution of the statistics conditioned on the selection is not as trivial. Conditional inference has been actively researched throughout the years by Buehler and Feddersen (1963); Brown (1967); Olshen (1973), and others. Weinstein et al. (2013) provide a conditional approach towards confidence intervals for the case of Gaussian distributions, which can be utilized when selection is determined by some predefined threshold (say z 1 α/2 ). 2

3 1.3 Recent progress In this project, we review two recent breakthroughs in the conditional inference school, which have contributed substantially to solving the problem of post-selection inference in several common settings. The first, due to Fithian et al. (2014), approaches conditional inference in general way by defining conditional tests and confidence intervals, and then explores in detail how to perform conditional inference in an exponential family setting. This results in an important theoretical framework, but involves quantities that are not always easy to compute in practice. The second paper, due to Taylor et al. (2014), considers the special case of a Gaussian noise model and a specific class of polyhedral selection procedures which select variables based on a set of affine constraints on the response variable. In this case, they are able do derive p-values and selection intervals for exact inference conditioned on the model selected. While this approach has less generality, in contrast to Fithian et al. (2014), Taylor et al. (2014) are able to give easily computable tools for inference, as well as tight, computationally efficient approximations for several common selection procedures. In the following sections we present the methods and results of both Fithian et al. (2014) and Taylor et al. (2014) in greater detail, and then conclude with a general discussion of post-selection inference. 2 Conditioning on selection in exponential families Fithian et al. (2014) provide definitions and notations towards conditional inference, and demonstrate how it can be performed in exponential family setting. Their leading guideline towards the meaning of the inference is that it must be valid, given that the question was asked, where question is precisely defined as a tuple ( M, H M 0,j), testing the hypothesis H M 0,j for the j th predictor in the selected model M. In that case, a level-α selective test is expected to control the selective type I error: P M,H0 ( reject H M 0,j ( M, H M 0,j) selected ) α. However, Fithian et al. (2014) develop an even more general framework for the problem. For any event A, a hypothesis test is be a function φ (y) of the random variable Y, taking values in [0, 1]. φ (y) is said to control the selective type I error if E F [φ (Y ) A] α, for all F H 0. Adopting the conditional point of view, classical statistical theory can be restated in a conditional manner, defining power functions, confidence intervals, tests and admissibility as expected in a conditional context. For example, selective confidence intervals have 1 α coverage for the parameter θ (F ) if P F (θ (F ) C (Y ) A) 1 α, for all F H 0, and they can be obtained by inverting selective tests, as in the classical case. The major question remaining is how to control the conditional distributions of test statistics. The general case is often quite hard, but a main result of the paper is that, for test statistics with exponential family distributions, it follows naturally. If Y follows a multi-parameter exponential family, then by the factorization theorem { } Y exp θ T (y) ψ (θ) f 0 (y), and so, conditioning Y on the event A, { } (Y Y A) exp θ T (y) ψ (θ) f 0 (y) I {y A}, which by is also an exponential family distribution. This reduces the problem to a well known problem of exponential family inference, in which many classical tools are available, such as the uniformly most powerful unbiased (UMPU) test developed by Lehmann and Scheffe (1955). However, even with the UMPU test, in order to compute the test values, it is required to know the likelihood function, L θ (Y A), for all θ Θ. In simple cases this can be computed explicitly, such 3

4 as the selection of the maximum described in the simulation. In most cases, however, computationally expensive Monte Carlo methods are required. The final point to be discussed is the choice of the event A on which to condition. A must be generated from a selection variable partitioning the space for the different models, but there are many ways to partition the space in such a way, and the inference can vary significantly by the selection of the conditioning event. An important result demonstrates that the finer the partitioning, the more information to be gained by conditioning. A particular case of interest is be data splitting, which can be considered as a scenario in which conditioning discards all the information on the training set. A finer partition would condition on the values observed in the training set that resulted in the model selected, yielding greater power. The paper demonstrates that it is almost always be possible to choose a finer partition than the one used for data splitting, and hence, data splitting is inadmissible. 3 The special case of Gaussian inference after polyhedral selection Taylor et al. (2014) pioneer a new exact approach to post-selection inference for a class of variable selection procedures that are polyhedral (as defined in Section 3.1). For such selectors, they derive exact p-values for testing the significance of predictor variables, conditioned on the results of variables selection. To do this, they make two assumptions on the distributions of the true data. First, they assume the covariate matrix X lies in general position (a very weak rank-like assumption, which is satisfied almost surely by continuously distributed variables). The second and more restrictive assumption is that the noise is Gaussian. That is, the true model has the form y = θ(x) + ε, (1) where ε N (0, Σ) and θ : R p R is some (not necessarily linear) function of the predictors. In this section, we first describe the structure of polyhedral selection, and outline how this allows us to perform valid post-selection inference under the model (1). We then show that several common variable selection procedures are polyhedral, and derive the corresponding post-selection inference procedures. Figure 2: Illustration of marginal screening as a polyhedral selector. Here, we observe two samples and have three covariates. If the response y R 2 falls in either shaded region, then variables x 1 and x 3 will be selected (as they are sufficiently correlated with y, while x 2 is not). The bottom left and top right regions corresponds to negative and positive correlations, and hence negative and positive estimated coefficients, respectively. 4

5 3.1 Polyhedral selection Suppose we observe n samples of p predictor variables (x 1,..., x p ) R p and a corresponding response variable y R, giving a covariate matrix X R n p and a response vector y R n. Recall that a polyhedron in R n is any set P R n defined by a finite set of affine constraints (i.e., P = {z R n : Γz u} for some Γ R m n, u R m, where the inequality holds componentwise). A variable selection procedure is called polyhedral if, for any covariate matrix X R n p, the sets of outcomes y R n resulting in any particular selection of variables with particular correlation signs sign(xj T y) are all polyhedra. As an illustrative example, a very simple polyhedral selection procedure is marginal screening (MS), illustrated in Figure 2. MS simply selects those variables which are sufficiently correlated with the response. That is, for some c [0, 1], MS will select a variable x j if and only if X T j y c X j 2, where c can be chosen based on desired number of selected variables. Suppose for instance, that MS selects the l variables {x ki } l i=1 with a particular sequence {s i} l i=1 = {sign(xk T i y)} l i=1 of signs. This occurs if and only if, for each selected x k i, Xk T i y c X ki 2 and, for each unselected x j, both Xj T y c X j 2 and Xj T y c X j 2. Thus, if, for each i {1,..., l}, Γ R (k+2(p k)) n has a row s i Xk T i and u R n has a corresponding entry c X ki 2, and for each j / {k 1,..., k l }, Γ has two rows, Xj T and XT j, and u has two corresponding entries, both c X j 2. See Lee and Taylor (2014) for a detailed study of inference after model selection through marginal screening. The next section presents two lemmas, which together suggest how to perform exact post-selection inference after polyhedral selection. 3.2 Conditional Gaussian inference after polyhedral selection The hypothesis that y is uncorrelated with x j can be phrased as x T j θ = 0 (recalling that y is θ plus noise), suggesting that we are interested in the distribution of v T y for certain v R n. The lemma 1 gives an representation of a polygon in terms of terms of arbitrary such v R n. Figure 3: Illustration of the polyhedron P = {y R n : Γy u} encoded in terms of V lo (y), V hi (y), and V 0 (y). Since v points precisely to the right V lo (y) encodes the rightmost constraint to the left of y, V hi (y) encodes the leftmost constraint to the right of y, and V 0 (y) encodes constraints parallel to v (of which none are shown here). Lemma 1 (Polyhedral selection as truncation): Suppose y N (θ, Σ), where θ R n is unknown but Σ R n n is known. Consider a polyhedron P = {y R n : Γy u}, where Γ R m n and u R m. If v T Σv 0, then, for ρ := ΓΣv v T Σv, V0 (y) := max j:ρ j=0 u j (Γy) j, 5

6 V hi u j (Γy) j + ρ j v T y (y) := min, and V lo u j (Γy) j + ρ j v T y (y) := max, j:ρ j>0 ρ j j:ρ j<0 ρ j (V 0 (y), V hi (y), V lo (y)) is independent of v T y and y P V 0 (y) 0 and v T y [V lo (y), V hi (y)]. The intuition behind Lemma 1 is illustrated in Figure 3. The independence results from the fact that (V 0 (y), V hi (y), V lo (y)) is a function of y ΣvvT y v T Σv, which is independent of vt y because y N (θ, Σ). Since v T y has a Gaussian distribution, Lemma 1 tells us that v T y y P has a 1-dimensional truncated Gaussian distribution. We now establish some notation which will allow us to construct our desired test statistic. For a, b, µ, σ R with a < b, σ > 0, let F [a,b] µ,σ denote the CDF of N (µ, σ 2 ) truncated to [a, b], i.e., 2 F a,b µ,σ (x) = Φ ( ) ( x µ σ Φ a µ ) ( ) σ 2 Φ b µ σ Φ ( ), x [a, b]. a µ σ where Φ : R R + denotes the CDF of N (0, 1). We then have the following pivotal statistic: Lemma 2 (Conditional p-values): For v R n, Σ R n n, if v T Σv 0, then F Vlo (y),v hi (y) v T θ,v T Σv (v T y) y P Unif(0, 1), i.e., F Vlo (y),v hi (y) v T θ,v T Σv (v T y) is a valid p-value conditioned on y P. The proof of Lemma 2 is also fairly simple, and depends essentially on the independence of (V 0, V lo, V hi ) and v T y. Lemma 2 in turn gives rise to the following two-sided conditional hypothesis test (testing H 0 v T θ = 0 against H 1 : v T θ 0): 1 Lemma 3 (Two-sided conditional inference after polyhedral selection): Suppose v T Σv 0. Define the statistic { } T := 2 min F Vlo (y),v hi (y) (v T y), 1 F Vlo (y),v hi (y) (v T y). 0,v T Σv 0,v T Σv Then, T is a valid conditional p-value for H 0 ; i.e., under H 0, T y P Unif(0, 1). 3.3 Application to forward stepwise regression In section 3.1, we showed that a simple selection procedure, marginal screening, is polyhedral. A somewhat more elaborate selection procedure, forward stepwise regression (FS), is discussed by Taylor et al. (2014). FS is an iterative procedure, which, in each iteration, adds the variable causing the largest drop in residual error. 2 In particular, let A k = [j 1,..., j k ] and s Ak = [s 1,..., s k ] denote the ordered list of active variables and their signs (upon entering) after k iterations. Since FS greedily chooses variables which minimize the residual sum of squares (RSS), for any j / {j 1,..., j k }, and RSS(y, X Ak ) RSS(y, X Ak 1 {j}) (2) s k = sign(e T k (X Ak ) + y). (3) This allows us to express FS as a polyhedral selection method, as follows. 1 A more powerful one-sided conditional test for the alternative H 1 : v T θ > 0 is similarly easy to derive. 2 The assumption that X has columns in general position guarantees uniqueness of the selected variables and their signs. 6

7 Theorem 1 (FS selection is polyhedral): The set P := {y R n : Âk(y) = A k, ŝ Ak (y) = s Ak } of response vectors y resulting in the FS active set A k and signs s Ak is a polyhedron P = {y R n : Γy 0}. Proof: We construct Γ by adding 2(p k) rows in each iteration k FS. If j k and s k are the first variable and sign chosen by FS, then the conditions (2) and (3) are equivalent to Thus, we add to 2(p k) rows to Γ, of the form (with one row for each pair of ± and j / A k ). s 1 Xj T 1 ± XT j y, j / A k. X j1 2 X j 2 s 1 Xj T k ± X j, X j1 2 X)j 2 To test the null hypothesis H 0 : v T θ = 0 against H 1 : v t θ 0, we simply compute V lo (y), V hi (y), V 0 (y) as in Lemma 1 and then use the test statistic { } T k := 2 min F [Vlo (y),v hi (y)] (v T y), 1 F [Vlo (y),v hi (y)] (v T y), 0,σ 2 v 2 2 0,σ 2 v 2 2 and, similarly, a (1 α)-confidence intervals for v T θ is [δ α/2, δ 1 α/2 ] satisfying and 1 F [Vlo (y),v hi (y)] (v T y) = α/2 δ α/2,σ 2 v F [Vlo (y),v hi (y)] 1 δ α/2,σ 2 v 2 2 (v T y) = 1 α/ Other consequences The ideas presented in Section 3.2 can be applied and extended in a straightforward manner to derive other useful tools for model selection and post-selection inference, some of which we discuss below. Other Selection Procedures: Lee et al. (2013) apply the polyhedral selection framework for the lasso at any value of the regularization parameter λ and derive appropriate p-values and selection intervals. Taylor et al. (2014) do the same for least angle regression (LAR). Because, in practice the Γ constraint matrix used in Lemma 1, can grow very large (with roughly 4pk rows after k steps of LAR), Taylor et al. (2014) also derive an much smaller approximate set of constraints (with at most k + 1 rows), which they show yields a highly accurate and computationally efficient approximation to the true selection polyhedron. Selection Intervals: The hypothesis test presented in Lemma 3 can be inverted to compute exact selection intervals (as described in Section 4.2). Stopping Rules: Using general methods for sequential hypothesis testing from G Sell et al. (2013b), we can establish a stopping rule for iterative selection methods (such as FS and LAR) with guarantees on the false discovery rate. 4 Discussion 4.1 Coverage in Model Selection It is important to emphasize that the entire field of inference after model selection is highly susceptible to the assumptions. Without the (usually unreasonable) assumption that the true model is generated by the linear model just selected, the meaning of the coefficients is vague. Even assuming that there exists a real vector β generating Y in a linear way, any selected model will typically contain a subset of β, in which each of the selected components is a linear combination of the full vector 7

8 β. In particular, any estimator upon which we perform inference is highly influenced by correlations with predictors that are not selected. This point has been demonstrated in G Sell et al. (2013a). One way of defining coverage in this context is as in Berk et al. (2013), in which an estimator should reliably cover its own expected value. This addresses the problem of defining confidence intervals and valid tests, although the interpretation of the interval is still unclear, as the expected value is also affected by the unselected predictors. 4.2 Selection Intervals As mentioned previously, conditional hypothesis test such as that presented in Lemma 3 can be inverted to compute selection intervals. In contrast with traditional confidence intervals, selection intervals guarantee coverage of the expectation of a parameter (which is itself a random variable resulting from model selection), consistent with the notion of conditional inference given in Section 4.1. That is, we can find an interval I around an estimated coefficient β j with the property [ ] P E[ β j ] I y P = 1 α. 4.3 Unconditional Inference Taylor et al. (2014) point out a simple but important observation that characterizes the utility of conditional hypothesis tests. In particular, if A is the event of a type 1 error, then, for a valid α-level hypothesis test conditioned on some random variable X = x, α P [A X = x] = P [A, X = x] P [X = x] P [A, X = x]. Then, integrating over x gives α P [A]. Thus, a valid conditional α-level hypothesis test is also a valid unconditional α-level hypothesis test. In the case of variable selection, X might be the subset of selected variables. In fact, a similar argument shows that the conditional hypothesis is valid for any coarsening of the condition; for example, a valid test conditioned on X = x would also be a valid test conditioned on X X, for any set X containing x. 8

9 References Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection inference. The Annals of Statistics, 41(2): , Larry Brown. The conditional level of Student s t test. The Annals of Mathematical Statistics, 38(4): , Robert J. Buehler and Alan P. Feddersen. Note on a conditional property of student s t1. The Annals of Mathematical Statistics, 34(3): , William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arxiv preprint arxiv: , Max Grazier G Sell, Trevor Hastie, and Robert Tibshirani. False variable selection rates in regression. arxiv preprint arxiv: , 2013a. Max Grazier G Sell, Stefan Wager, Alexandra Chouldechova, and Robert Tibshirani. procedures and false discovery rate control. arxiv preprint arxiv: , 2013b. Sequential selection Jason D Lee and Jonathan E Taylor. Exact post model selection inference for marginal screening. In Advances in Neural Information Processing Systems, pages , Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference with the lasso. arxiv preprint arxiv: , E.L. Lehmann and Henry Scheffe. Completeness, similar regions and unbiased tests, Part II. Sankhya, 15: , RA Olshen. The Conditional Level of the F - Test. Journal of the American Statistical Association, 68(343): , Jonathan Taylor, Richard Lockhart, Ryan J Tibshirani, and Robert Tibshirani. Post-selection adaptive inference for least angle regression and the lasso. arxiv preprint, Asaf Weinstein, William Fithian, and Yoav Benjamini. Selection Adjusted Confidence Intervals With More Power to Determine the Sign. Journal of the American Statistical Association, 108(501): , March ISSN doi: /

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report