VALID POST-SELECTION INFERENCE

Size: px
Start display at page:

Download "VALID POST-SELECTION INFERENCE"

Transcription

1 Submitted to the Annals of Statistics VALID POST-SELECTION INFERENCE By Richard Berk, Lawrence Brown,, Andreas Buja, Kai Zhang and Linda Zhao The Wharton School, University of Pennsylvania It is common practice in statistical data analysis to perform datadriven variable selection and derive statistical inference from the resulting model. Such inference enjoys none of the guarantees that classical statistical theory provides for tests and confidence intervals when the model has been chosen a priori. We propose to produce valid post-selection inference by reducing the problem to one of simultaneous inference and hence suitably widening conventional confidence and retention intervals. Simultaneity is required for all linear functions that arise as coefficient estimates in all submodels. By purchasing simultaneity insurance for all possible submodels, the resulting post-selection inference is rendered universally valid under all possible model selection procedures. This inference is therefore generally conservative for particular selection procedures, but it is always less conservative than full Scheffé protection. Importantly it does not depend on the truth of the selected submodel, and hence it produces valid inference even in wrong models. We describe the structure of the simultaneous inference problem and give some asymptotic results. 1. Introduction The Problem with Statistical Inference after Model Selection. Classical statistical theory grants validity of statistical tests and confidence intervals assuming a wall of separation between the selection of a model and the analysis of the data being modeled. In practice, this separation rarely exists and more often a model is found by a data-driven selection process. As a consequence inferential guarantees derived from classical theory are invalidated. Among model selection methods that are problematic for classical inference, variable selection stands out because it is regularly taught, commonly practiced, and highly researched as a technology. Even though statisticians may have a general awareness that the data-driven selection of variables (predictors, covariates) must somehow affect subsequent classical inference from F - and t-based tests and confidence Research supported in part by NSF Grant DMS Corresponding Author, lbrown@wharton.upenn.edu. AMS 2000 subject classifications: Primary 62J05, 62J15 Keywords and phrases: Linear Regression, Model Selection, Multiple Comparison, Family-wise Error, High-dimensional Inference, Sphere Packing 1

2 2 BERK ET AL. intervals, the practice is so pervasive that it appears in classical undergraduate textbooks on statistics such as Moore and McCabe (2003). The reason for the invalidation of classical inference guarantees is that a data-driven variable selection process produces a model that is itself stochastic, and this stochastic aspect is not accounted for by classical theory. Models become stochastic when the stochastic component of the data is involved in the selection process. (In regression with fixed predictors the stochastic component is the response.) Models are stochastic in a well-defined way when they are the result of formal variable selection procedures such as stepwise or stagewise forward selection or backward elimination or all-subset searches driven by complexity penalties (such as C p, AIC, BIC, risk-inflation, LASSO,...) or prediction criteria such as cross-validation, or more recent proposals such as LARS and the Dantzig selector (for an overview see, for example, Hastie, Tibshirani, and Friedman (2009)). Models are also stochastic but in an ill-defined way when they are informally selected through visual inspection of residual plots or normal quantile plots or other regression diagnostics. Finally, models become stochastic in an opaque way when their selection is affected by human intervention based on post hoc considerations such as in retrospect only one of these two variables should be in the model or it turns out the predictive benefit of this variable is too weak to warrant the cost of collecting it. In practice, all three modes of variable selection may be exercised in the same data analysis: multiple runs of one or more formal search algorithms may be performed and compared, the parameters of the algorithms may be subjected to experimentation, and the results may be critiqued with graphical diagnostics; a round of fine-tuning based on substantive deliberations may finalize the analysis. Posed so starkly, the problems with statistical inference after variable selection may well seem insurmountable. At a minimum, one would expect technical solutions to be possible only when a formal selection algorithm is (1) well-specified (1a) in advance and (1b) covering all eventualities, (2) strictly adhered to in the course of data analysis, and (3) not improved on by informal and post-hoc elements. It may, however, be unrealistic to expect this level of rigor in most data analysis contexts, with the exception of well-conducted clinical trials. The real challenge is therefore to devise statistical inference that is valid following any type of variable selection, be it formal, informal, post hoc, or a combination thereof. Meeting this challenge with a relatively simple proposal is the goal of this article. This proposal for valid Post-Selection Inference, or PoSI for short, consists of a large-scale family-wise error guarantee that can be shown to account for all types of variable selection, including those of the informal and post-hoc varieties. On

3 VALID POST-SELECTION INFERENCE 3 the other hand, the proposal is no more conservative than necessary to account for selection, and in particular it can be shown to be less conservative than Scheffé s simultaneous inference. The framework for our proposal is in outline as follows details to be elaborated in subsequent sections: We consider linear regression with predictor variables whose values are considered fixed, and with a response variable that has normal and homoscedastic errors. The framework does not require that any of the eligible linear models is correct, not even the full model, as long as a valid error estimate is available. We assume that the selected model is the result of some procedure that makes use of the response, but the procedure does not need to be fully specified. A crucial aspect of the framework concerns the use and interpretation of the selected model: We assume that, after variable selection is completed, the selected predictor variables and only they will be relevant; all others will be eliminated from further consideration. This assumption, seemingly innocuous and natural, has critical consequences: It implies that statistical inference will be sought for the coefficients of the selected predictors only and in the context of the selected model only. Thus the appropriate targets of inference are the best linear coefficients within the selected model, where each coefficient is adjusted for the presence of all other included predictors but not those that were eliminated. Therefore the coefficient of an included predictor generally requires inference that is specific to the model in which it appears. Summarizing in a motto, a difference in adjustment implies a difference in parameters and hence in inference. The goal of the present proposal is therefore simultaneous inference for all coefficients within all submodels. Such inference can be shown to be valid following any variable selection procedure, be it formal, informal, post hoc, fully or only partly specified. Problems associated with post-selection inference were recognized long ago, for example, by Buehler and Fedderson (1963), Brown (1967), Olshen (1973), Sen (1979), Sen and Saleh (1987), Dijkstra and Veldkamp (1988), Pötscher (1991), Kabaila (1998). More recently specific problems have been the subject of incisive analyses and criticisms by the Vienna School of Pötscher, Leeb and Schneider; see, for example, Leeb and Pötscher (2003; 2005; 2006a; 2006b; 2008a; 2008b), Pötscher (2006), Leeb (2006), Pötscher and Leeb (2009), Pötscher and Schneider (2009, 2010, 2011), as well as Kabaila and Leeb (2006) and Kabaila (2009). Important progress was made by Hjort and Claeskens (2003) and Claeskens and Hjort (2003). This article proceeds as follows: In Section 2 we first develop the submodel view of the targets of inference after model selection and contrast it with the full model view (Section 2.1); we then introduce assumptions

4 4 BERK ET AL. with a view toward valid inference in wrong models (Section 2.2). Section 3 is about estimation and its targets from the submodel point of view. Section 4 develops the methodology for PoSI confidence intervals (CIs) and tests. After some structural results for the PoSI problem in Section 5, we show in Section 6 that with increasing number of predictors p the width of PoSI CIs can range between the asymptotic rates O( log p) and O( p). We give examples for both rates and, inspired by problems in sphere packing and covering, we give upper bounds for the limiting constant in the O( p) case. We conclude with a discussion in Section 7. Some proofs are deferred to the appendix, and some elaborations to the online appendix. Computations will be described in a separate article. Simulation-based methods yield satisfactory accuracy specific to a design matrix up to p 20, while non-asymptotic universal upper bounds can be computed for larger p. 2. Targets of Inference and Assumptions. It is a natural intuition that model selection distorts inference by distorting sampling distributions of parameter estimates: Estimates in selected models should tend to generate more Type I errors than conventional theory allows because the typical selection procedure favors models with strong, hence highly significant predictors. This intuition correctly points to a multiplicity problem that grows more severe as the number of predictors subject to selection increases. This is the problem we address in this article. Model selection poses additional problems that are less obvious but no less fundamental: There exists an ambiguity as to the role and meaning of the parameters in submodels. On one view, the relevant parameters are always those of the full model, hence the selection of a submodel is interpreted as estimating the deselected parameters to be zero and estimating the selected parameters under a zero constraint on the deselected parameters. On another view, the submodel has its own parameters, and the deselected parameters are not zero but non-existent. These distinctions are not academic as they imply fundamentally different ideas regarding the targets of inference, the measurement of statistical performance, and the problem of post-selection inference. The two views derive from different purposes of equations: Underlying the full model view of parameters is the use of a full equation to describe a data generating mechanism for the response; the equation hence has a causal interpretation. Underlying the submodel view of parameters is the use of any equation to merely describe association between predictor and response variables; no data generating or causal claims are implied. In this article we address the latter use of equations. Issues relating to the

5 VALID POST-SELECTION INFERENCE 5 former use are discussed in the online appendix, Section B The Submodel Interpretation of Parameters. In what follows we elaborate three points that set the submodel interpretation of coefficients apart from the full model interpretation, with important consequences for the rest of this article: (1) The full model has no special status other than being the repository of available predictors. (2) The coefficients of excluded predictors are not zero; they are not defined and therefore don t exist. (3) The meaning of a predictor s coefficient depends on which other predictors are included in the selected model. (1) The full model available to the statistician often cannot be argued to have special status because of inability to identify and measure all relevant predictors. Additionally, even when a large and potentially complete suite of predictors can be measured there is generally a question of predictor redundancy that may make it desirable to omit some of the measurable predictors from the final model. It is a common experience in the social sciences that models proposed on theoretical grounds are found on empirical grounds to have their predictors entangled by collinearities that permit little meaningful statistical inference. This situation is not limited to the social sciences: in gene expression studies it may well occur that numerous sites have a tendency to be expressed concurrently, hence as predictors in disease studies they will be strongly confounded. The emphasis on full models may be particularly strong in econometrics where there is a notion that a longer regression... has a causal interpretation, while a shorter regression does not (Angrist and Pischke 2009, p. 59). Even in causal models, however, there is a possibility that included adjustor variables will adjust away some of the causal variables of interest. Generally, in any creative observational study involving novel predictors it will be difficult a priori to exclude collinearities that might force a rethinking of the predictors. In conclusion, whenever predictor redundancy is a potential issue, it cannot a priori be claimed that the full model provides the parameters of primary interest. (2) In the submodel interpretation of parameters, claiming that the coefficients of deselected predictors are zero does not properly describe the role of predictors. Deselected predictors have no role in the submodel equation; they become no different than predictors that had never been considered. The selected submodel becomes the vehicle of substantive research irrespective of what the full model was. As such the submodel stands on its own. This view is especially appropriate if the statistician s task is to determine

6 6 BERK ET AL. which predictors are to be measured in the future. (3) The submodel interpretation of parameters is deeply seated in how we teach regression. We explain that the meaning of a regression coefficient depends on which of the other predictors are included in the model: the slope is the average difference in the response for a unit difference in the predictor, at fixed levels of all other predictors in the model. This ceteris paribus clause is essential to the meaning of a slope. That there is a difference in meaning when there is a difference in covariates is most drastically evident when there is a case of Simpson s paradox. For example, if purchase likelihood of a high-tech gadget is predicted from Age, it might be found against expectations that younger people have lower purchase likelihood, whereas a regression on Age and Income might show that at fixed levels of income younger people have indeed higher purchase likelihood. This case of Simpson s paradox would be enabled by the expected positive collinearity between Age and Income. Thus the marginal slope on Age is distinct from the Income-adjusted slope on Age as the two slopes answer different questions, apart from having opposite signs. In summary, different models result in different parameters with different meanings. Must we use the full model with both predictors? Not if Income data is difficult to obtain or if it provides little improvement in R 2 beyond Age. The model based on Age alone cannot be said to be a priori wrong. If, for example, the predictor and response variables have jointly multivariate normal distributions, then every linear submodel is correct. These considerations drive home, once again, that sometimes no model has special status. In summary, a range of applications call for a framework in which the full model is not the sole provider of parameters, where rather each submodel defines its own. The consequences of this view will be developed in Section Assumptions, Models as Approximations, and Error Estimates. We state assumptions for estimation and for the construction of valid tests and CIs when fitting arbitrary linear equations. The main goal is to prepare the ground for valid statistical inference after model selection not assuming that selected models are correct. We consider a quantitative response vector Y R n, assumed random, and a full predictor matrix X = (X 1, X 2,..., X p ) R n p, assumed fixed. We allow X to be of non-full rank, and n and p to be arbitrary. In particular, we allow n < p. Throughout the article we let (2.1) d rank(x) = dim(span(x)), hence d min(n, p). Due to frequent reference we call d = p ( n) the classical case.

7 VALID POST-SELECTION INFERENCE 7 It is common practice to assume the full model Y N n (Xβ, σ 2 I) to be correct. In the present framework, however, first-order correctness, E[Y] = Xβ, will not be assumed. By implication, first-order correctness of any submodel will not be assumed either. Effectively, (2.2) µ E[Y] R n is allowed to be unconstrained and, in particular, need not reside in the column space of X. That is, the model given by X is allowed to be first-order wrong, and hence we are in a well-defined sense serious about G.E.P. Box famous quote. What he calls wrong models we prefer to call approximations : All predictor matrices X provide approximations to µ, some better than others, but the degree of approximation plays no role in the clarification of statistical inference. The main reason for elaborating this point is as follows: after model selection the case for correct models is clearly questionable, even for consistent model selection procedures (Leeb and Pötscher 2003, p. 101); but if correctness of submodels is not assumed, it is only natural to abandon this assumption for the full model also, in line with the idea that the full model has no special status. As we proceed with estimation and inference guarantees in the absence of first-order correctness we will rely on assumptions as follows: For estimation (Section 3), we will only need the existence of µ=e[y]. For testing and CI guarantees (Section 4), we will make conventional second order and distributional assumptions: (2.3) Y N (µ, σ 2 I). The assumptions (2.3) of homoscedasticity and normality are as questionable as first order correctness, and we will report elsewhere on approaches that avoid them. For now we follow the vast model selection literature that relies on the technical advantages of assuming homoscedastic and normal errors. Accepting the assumption (2.3), we address the issue of estimating the error variance σ 2, because the valid tests and CIs we construct require a valid estimate ˆσ 2 of σ 2 that is independent of LS estimates. In the classical case, the most common way to assert such an estimate is to assume that the full model is first order correct, µ = Xβ in addition to (2.3), in which case the mean squared residual (MSR) ˆσ 2 F = Y X ˆβ 2 /(n p) of the full model will do. However, other possibilities for producing a valid estimate ˆσ 2 exist, and they may allow relaxing the assumption of first order correctness: Exact replications of the response obtained under identical conditions might be available in sufficient numbers. An estimate ˆσ 2 can be obtained as the MSR of the one-way ANOVA of the groups of replicates.

8 8 BERK ET AL. In general, a larger linear model than the full model might be considered as correct, hence ˆσ 2 could be the MSR from this larger model. A different possibility is to use another dataset, similar to the one currently being analyzed, to produce an independent estimate ˆσ 2 by whatever valid estimation method. A special case of the preceding is a random split-sample approach whereby one part of the data is reserved for producing ˆσ 2 and the other part for estimating coefficients, selecting models, and carrying out post-model selection inference. A different type of estimates ˆσ 2 may be based on considerations borrowed from non-parametric function estimation (Hall and Carroll 1989). The purpose of pointing out these possibilities is to separate at least in principle the issue of first-order model incorrectness from the issue of error estimation under the assumption (2.3). This separation puts the case n < p within our framework as the valid and independent estimation of σ 2 is a problem faced by all n<p approaches. 3. Estimation and its Targets in Submodels. Following Section 2.1, the value and meaning of a regression coefficient depends on what the other predictors in the model are. An exception occurs, of course, when the predictors are perfectly orthogonal, as in some designed experiments or in function fitting with orthogonal basis functions. In this case a coefficient has the same value and meaning across all submodels. This article is hence a story of (partial) collinearity Multiplicity of Regression Coefficients. We will give meaning to LS estimators and their targets in the absence of any assumptions other than the existence of µ = E[Y], which in turn is permitted to be entirely unconstrained in R n. Besides resolving the issue of estimation in first order wrong models, the major purpose here is to elaborate the idea that the slope of a predictor generates different parameters in different submodels. As each predictor appears in 2 p 1 submodels, the p regression coefficients of the full model generally proliferate into a plethora of as many as p 2 p 1 distinct regression coefficients according to the submodels they appear in. To describe the situation we start with notation. To denote a submodel we use the (non-empty) index set M = {j 1, j 2,..., j m } M F = {1,..., p} of the predictors X ji in the submodel; the size of the submodel is m = M and that of the full model is p = M F. Let X M = (X j1,..., X jm ) denote the n m submatrix of X with columns in-

9 VALID POST-SELECTION INFERENCE 9 dexed by M. We will only allow submodels M for which X M is of full rank: rank(x M ) = m d. We let ˆβ M be the unique least squares estimate in M: (3.1) ˆβM = (X T MX M ) 1 X T MY. Now that ˆβ M is an estimate, what is it estimating? Following Section 2.1, we will not interpret ˆβ M as estimates of the full model coefficients and, more generally, of any model other than M. Thus it is natural to ask that ˆβ M define its own target through the requirement of unbiasedness: (3.2) β M E[ ˆβ M ] = (X T MX M ) 1 X T M E[Y] = argmin β R m µ X M β 2. This definition requires no other assumption than the existence of µ = E[Y]. In particular there is no need to assume first order correctness of M or M F. Nor does it matter to what degree M provides a good approximation to µ in terms of approximation error µ X M β M 2. In the classical case d = p n, we can define the target of the full-model estimate ˆβ = (X T X) 1 X T Y as a special case of (3.2) with M = M F : (3.3) β E[ ˆβ] = (X T X) 1 X T E[Y]. In the general (including the non-classical) case, let β be any (possibly nonunique) minimizer of µ Xβ 2 ; the link between β and β M is as follows: (3.4) β M = (X T MX M ) 1 X T MXβ. Thus the target β M is an estimable linear function of β, without first-order correctness assumptions. Equation (3.4) follows from span(x M ) span(x). Notation: To distinguish the regression coefficients of the predictor X j relative to the submodel it appears in, we write β j M = E[ ˆβ j M ] for the components of β M = E[ ˆβ M ] with j M. An important convention is that indexes are always elements of the full model, j {1, 2,..., p}=m F, for what we call full model indexing Interpreting Regression Coefficients in First-Order Incorrect Models. The regression coefficient β j M is conventionally interpreted as the average difference in the response for a unit difference in X j, ceteris paribus in the model M. This interpretation no longer holds when the assumption of first order correctness is given up. Instead, the phrase average difference in the

10 10 BERK ET AL. response should be replaced with the unwieldy phrase average difference in the response approximated in the submodel M. The reason is that the target of the fit ŶM = X M ˆβ M in the submodel M is µ M = X M β M, hence in M we estimate unbiasedly not the true µ but its LS approximation µ M. A second interpretation of regression coefficients is in terms of adjusted predictors: For j M define the M-adjusted predictor X j M as the residual vector of the regression of X j on all other predictors in M. Multiple regression coefficients, both estimates ˆβ j M and parameters β j M, can be expressed as simple regression coefficients with regard to the M-adjusted predictors: (3.5) ˆβj M = XT j M Y X j M 2, β j M = XT j M µ X j M 2. The left hand formula lends itself to an interpretation of ˆβ j M in terms of the well-known leverage plot which shows Y plotted against X j M and the line with slope ˆβ j M. This plot is valid without first-order correctness assumption. A third interpretation can be derived from the second: To unclutter notation let x = (x i ) i=1...n be any adjusted predictor X j M, so that ˆβ = x T Y/ x 2 and β = x T µ/ x 2 are the corresponding ˆβ j M and β j M. Introduce (1) case-wise slopes through the origin, both as estimates ˆβ (i) = Y i /x i and as parameters β (i) = µ i /x i, and (2) case-wise weights w (i) = x 2 i / i =1...n x2 i. Equations (3.5) are then equivalent to the following: ˆβ = i w (i) ˆβ(i), β = i w (i) β (i). Hence regression coefficients are weighted averages of case-wise slopes, and this interpretation holds without first-order assumptions. 4. Universally Valid Post-Selection Confidence Intervals Test Statistics with One Error Estimate for All Submodels. We consider inference for ˆβ M and its target β M. Following Section 2.2 we require a normal homoscedastic model for Y, but we leave its mean µ=e[y] entirely unspecified: Y N (µ, σ 2 I). We then have equivalently ˆβ M N (β M, σ 2 (X T MX M ) 1 ) and ˆβj M N (β j M, σ 2 / X j M 2 ). Again following Section 2.2 we assume the availability of a valid estimate ˆσ 2 of σ 2 that is independent of all estimates ˆβ j M, and we further assume ˆσ 2 σ 2 χ 2 r/r for r degrees of freedom. If the full model is assumed correct,

11 VALID POST-SELECTION INFERENCE 11 n > p and ˆσ 2 = ˆσ F 2, then r = n p. In the limit r we obtain ˆσ = σ, the case of known σ, which will be used starting with Section 6. Let t j M denote a t-ratio for β j M that uses ˆσ irrespective of M: (4.1) t j M ˆβ j M β j M ( (X T M X M ) 1) 1 2 jj ˆσ = ˆβ j M β j M ˆσ/ X j M = (Y µ)t X j M, ˆσ X j M where (...) jj refers to the diagonal element corresponding to X j. The quantity t j M = t j M (Y) has a central t-distribution with r degrees of freedom. Essential is that the standard error estimate in the denominator of (4.1) does not involve the MSR ˆσ M from the submodel M, for two reasons: We do not assume that the submodel M is first-order correct, hence ˆσ 2 M would in general have a distribution that is a multiple of a non-central χ 2 distribution with unknown non-centrality parameter. More disconcertingly, ˆσ 2 M would be the result of selection: ˆσ2ˆM (see Section 4.2). Not much of real use is known about its distribution (see, for example, Brown 1967 and Olshen 1973). These problems are avoided by using one valid estimate ˆσ 2 that is independent of all submodels. With this choice of ˆσ, confidence intervals for β j M take the form (4.2) CI j M (K) = [ ˆβ j M ± K [ (X T MX M ) 1] 1 2 [ ˆβj M ± K ˆσ/ X j M ]. jj ˆσ ] If K = t r,1 α/2 is the 1 α/2 quantile of a t-distribution with r degrees of freedom, then the interval is marginally valid with a 1 α coverage guarantee: P[β j M CI j M (K)] ( ) = 1 α. This holds if the submodel M is not the result of variable selection Model Selection and Its Implications for Parameters. In practice, the model M tends to be the result of some form of model selection that makes use of the stochastic component of the data, which is the response vector Y (X being fixed, Section 2.2). This model should therefore be expressed as ˆM = ˆM(Y). In general we allow a variable selection procedure to be any (measurable) map (4.3) ˆM : Y ˆM(Y), R n M all,

12 12 BERK ET AL. where M all is the set of all full-rank submodels: (4.4) M all {M M {1, 2,..., p}, rank(x M ) = M } Thus the procedure ˆM is a discrete map that divides R n into as many as M all different regions with shared outcome of model selection. Data dependence of the selected model ˆM has strong consequences: Most fundamentally, the selected model ˆM = ˆM(Y) is now random. Whether the model has been selected by an algorithm or by human choice, if the response Y has been involved in the selection, the resulting model is a random object because it could have been different for a different realization of the random vector Y. Associated with the random model ˆM(Y) is the parameter vector of coefficients β ˆM(Y), which is now randomly chosen also: It has a random dimension m(y)= ˆM(Y) : β ˆM(Y) R m(y). For any fixed j, it may or may not be the case that j ˆM(Y). Conditional on j ˆM(Y), the parameter β j ˆM(Y) changes randomly as the adjustor covariates in ˆM(Y) vary randomly. Thus the set of parameters for which inference is sought is random also Post-Selection Coverage Guarantees for Confidence Intervals. With randomness of the selected model and its parameters in mind, what is a desirable form of post-selection coverage guarantee for confidence intervals? A natural requirement would be a 1 α confidence guarantee for the coefficients of the predictors that are selected into the model: [ (4.5) P j ˆM ] : β j ˆM CI j ˆM(K) 1 α. Several points should be noted: The guarantee is family-wise for all selected predictors j ˆM, though the sense of family-wise is unusual because ˆM = ˆM(Y) is random. The guarantee has nothing to say about predictors j / ˆM that have been deselected, regardless of the substantive interest they might have. Predictors of overarching interest should be protected from variable selection, and for these one can use a modification of the PoSI approach which we call PoSI1 ; see Section Because predictor selection is random, ˆM = ˆM(Y), two realized samples y (1), y (2) R n from Y may result in different sets of selected

13 VALID POST-SELECTION INFERENCE 13 predictors, ˆM(y (1) ) ˆM(y (2) ). It would be a fundamental misunderstanding to wonder whether the guarantee holds for both realizations. Instead, the guarantee (4.5) is about the procedure Y ˆσ(Y), ˆM(Y), ˆβ ˆM(Y) (Y) CI j ˆM(K) (j ˆM) for the long run of independent realizations of Y (by the LLN), and not for any particular realizations y (1), y (2). A standard formulation used to navigate these complexities after a realization y of Y has been analyzed is the following: For j ˆM we have 1 α confidence that the interval CI j ˆM(y) (K) contains β j ˆM(y). Marginal guarantees for individual predictors require some care because β j ˆM does not exist for j / ˆM. This makes β j ˆM CI j ˆM(K) an incoherent statement that does not define an event. Guarantees are possible if the condition j ˆM is added with a conjunction or is being conditioned on: the marginal and conditional probabilities [ P j ˆM ] [ & β j ˆM CI j ˆM(K j ) and P β j ˆM CI j ˆM(K j ) j ˆM ] are both well-defined and can be the subject of coverage guarantees; see the online appendix, Section B.4. Finally, we note that the smallest constant K that satisfies the guarantee (4.5) is specific to the procedure ˆM. Thus different variable selection procedures would require different constants. Finding procedure-specific constants is a challenge that will be intentionally bypassed by the present proposals Universal Validity for all Selection Procedures. The PoSI procedure proposed here produces a constant K that provides universally valid post-selection inference for all model selection procedures ˆM: [ (4.6) P β j ˆM CI j ˆM(K) j ˆM ] 1 α ˆM. Universal validity irrespective of the model selection procedure ˆM is a strong property that raises questions of whether the approach is too conservative. There are, however, some arguments in its favor: (1) Universal validity may be desirable or even essential for applications in which the model selection procedure is not specified in advance or for which the analysis involves some ad hoc elements that cannot be accurately prespecified. Even so, we should think of the actually chosen model as part of a procedure Y ˆM(Y), and though the ad hoc steps are not specified for Y other than the observed one, this is not a problem because our protection

14 14 BERK ET AL. is irrespective of what a specification might have been. This view also allows data analysts to change their minds, to improvise and informally decide in favor of a model other than that produced by a formal selection procedure, or to experiment with multiple selection procedures. (2) There exists a model selection procedure that requires the full strength of universally valid PoSI, and this procedure may not be entirely unrealistic as an approximation to some types of data analytic activities: significance hunting, that is, selecting that model which contains the statistically most significant coefficient; see Section 4.9. (3) There is a general question about the wisdom of proposing ever tighter confidence and retention intervals for practical use when in fact these intervals are valid only under tightly controlled conditions. It might be realistic to suppose that much applied work involves more data peeking than is reported in published articles. With inference that is universally valid after any model selection procedure we have a way to establish which rejections are safe, irrespective of unreported data peeking as part of selecting a model. (4) Related to the previous point is the fact that today there is a realization that a considerable fraction of published empirical work is unreproducible or reports exaggerated effects (well-known in this regard is Ioannidis 2005). A factor contributing to this problem might well be liberal handling of variable selection and absent accounting for it in subsequent inference Restricted Model Selection. The concerns over PoSI s conservative nature can be alleviated somewhat by introducing a degree of flexibility to the PoSI problem with regard to the universe of models being searched. Such flexibility is additionally called for from a practical point of view because it is not true that all submodels in M all (4.4) are always being searched. Rather, the search is often limited in a way that can be specified a priori, without involvement of Y. For example, a predictor of interest may be forced into the submodels of interest, or there may be a restriction on the size of the submodels. Indeed, if p is large, a restriction to a manageable set of submodels is a computational necessity. In much of what follows we can allow the universe M of allowable submodels to be an (almost) arbitrary but pre-specified non-empty subset of M all ; w.l.o.g. we can assume M M M = {1, 2,..., p}. Because we allow only non-singular submodels (see (4.4)) we have M d M M, where as always d=rank(x). Selection procedures are now maps (4.7) ˆM : Y ˆM(Y), R n M. The following are examples of model universes with practical relevance (see also Leeb and Pötscher (2008a), Section 1.1, Example 1).

15 VALID POST-SELECTION INFERENCE 15 (1) Submodels that contain the first p predictors (1 p p): M 1 = {M M all {1, 2,..., p } M}. Classical: M 1 = 2 p p. Example: forcing an intercept into all models. (2) Submodels of size m or less ( sparsity option ): M 2 = {M M all M m }. Classical: M 2 = ( ( p 1) p ) m. (3) Submodels with fewer than m predictors dropped from the full model: M 3 = {M M all M > p m }. Classical: M 3 = M 2. (4) Nested models: M 4 = {{1,..., j} j {1,..., p}}. M 4 = p. Example: selecting the degree up to p 1 in a polynomial regression. (5) Models dictated by an ANOVA hierarchy of main effects and interactions in a factorial design. This list is just an indication of possibilities. In general, the smaller the set M = {(j, M) j M M} is, the less conservative the PoSI approach is, and the more computationally manageable the problem becomes. With sufficiently strong restrictions, in particular using the sparsity option (2) and assuming the availability of an independent valid estimate ˆσ, it is possible to apply PoSI in certain non-classical p > n situations. Further reduction of the PoSI problem is possible by pre-screening adjusted predictors without the response Y. In a fixed-design regression, any variable selection procedure that does not involve Y does not invalidate statistical inference. For example, one may decide not to seek inference for predictors in submodels that impart a Variance Inflation Factor (VIF ) above a user-chosen threshold: VIF j M = X j 2 / X j M 2 if X j is centered, hence does not make use of Y, and elimination according to VIF j M > c does not invalidate inference Reduction of Universally Valid Post-Selection Inference to Simultaneous Inference. We show that universally valid post-selection inference (4.6) follows from simultaneous inference in the form of family-wise error control for all parameters in all submodels. The argument depends on the following lemma that may fall into the category of the trivial but not immediately obvious. Lemma 4.1. ( Significant Triviality Bound ) For any model selection procedure ˆM : R n M, the following inequality holds for all Y R n : max t j ˆM(Y) j ˆM(Y) (Y) max max t j M(Y) M M j M Proof: This is a special case of the triviality f( ˆM(Y)) max M f(m), where f(m) = max j M t j M (Y).

16 16 BERK ET AL. The right hand max- t bound of the lemma is sharp in the sense that there exists a variable selection procedure ˆM that attains the bound; see Section 4.9. Next we introduce the 1 α quantile of the right hand max- t statistic of the lemma: Let K be the minimal value that satisfies [ ] (4.8) P max max t j M K 1 α. M M j M This value will be called the PoSI constant. It does not depend on any model selection procedures, but it does depend on the design matrix X, the universe M of models subject to selection, the desired coverage 1 α, and the degrees of freedom r in ˆσ, hence K = K(X, M, α, r). Theorem 4.1. For all model selection procedures ˆM : R n M we have [ ] (4.9) P max t j ˆM j ˆM K where K =K(X, M, α, r) is the PoSI constant. 1 α, This follows immediately from Lemma 4.1. Although mathematically trivial we give the above the status of a theorem as it is the central statement of the reduction of universal post-selection inference to simultaneous inference. The following is just a repackaging of Theorem 4.1: Corollary 4.1. Simultaneous Post-Selection Confidence Guarantees hold for any model selection procedure ˆM: R n M: [ ] (4.10) P β j ˆM CI j ˆM(K) j ˆM 1 α, where K =K(X, M, α, r) is the PoSI constant. Simultaneous inference provides strong family-wise error control, which in turn translates to strong error control for tests following model selection. Corollary 4.2. Strong Post-Selection Error Control holds for any model selection procedure ˆM: R n M: [ P j ˆM ] : β j ˆM 0 & t (0) j ˆM > K α, where K =K(X, M, α, r) is the PoSI constant and t (0) the null hypothesis β j M =0. j M is the t-statistic for

17 VALID POST-SELECTION INFERENCE 17 The proof is standard (see the online appendix, Section B.3). The corollary states that, with probability 1 α, in a selected model all PoSI-significant rejections have detected true alternatives Computation of the POSI Constant. Several portions of the following treatment are devoted to a better understanding of the structure and value of the POSI constant K(X, M, α, r). Except for very special choices it does not seem possible to provide closed form expressions for its value. However the structural geometry and other properties to be described later do enable a reasonably efficient computational algorithm. R-code for computing the POSI constant for small to moderate values of p is available on the authors web pages. This code is accompanied by a manuscript that will be published elsewhere describing the computational algorithm and generalizations. For the basic setting involving M all the algorithm will conveniently provide values of K(X, M all, α, r) for matrices X of rank 20, or slightly larger depending on available computing speed and memory. It can also be adapted to compute K for some other families contained within M all, such as some discussed in Section Scheffé Protection. Realizing the idea that the LS estimators in different submodels are generally unbiased estimates of different parameters, we generated a simultaneous inference problem involving up to p 2 p 1 linear contrasts β j M. In view of the enormous number of linear combinations for which simultaneous inference is sought, one should wonder whether the problem is not best solved by Scheffé s method (1959) which provides simultaneous inference for all linear combinations. To accommodate rank-deficient X, we cast Scheffé s result in terms of t-statistics for arbitrary non-zero x span(x): (4.11) t x (Y µ)t x. ˆσ x The t-statistics in (4.1) are obtained for x = X j M. Scheffé s guarantee is [ ] (4.12) P sup t x K Sch = 1 α, x span(x) where the Scheffé constant is (4.13) K Sch = K Sch (α, d, r) = df d,r,1 α. It provides an upper bound for all PoSI constants:

18 18 BERK ET AL. Proposition 4.1. K(X, M, α, r) K Sch (α, d, r) X, M, d=rank(x). Thus for j ˆM a parameter estimate ˆβ j ˆM whose t-ratio exceeds K Sch in magnitude is universally safe from having the rejection of H 0 : β j ˆM = 0 invalidated by variable selection. The universality of the Scheffé constant is a tip-off that it may be too loose for some predictor matrices X, and obtaining the sharper constant K(X) may be worthwhile. An indication is given by the following comparison as r : For the Scheffé constant it holds K Sch d. For orthogonal designs it holds K orth 2 log d. (For orthogonal designs see Section 5.5.) Thus the PoSI constant K orth is much smaller than K Sch. The large gap between the two suggests that the Scheffé constant may be too conservative at least in some cases. We will study certain non-orthogonal designs for which the PoSI constant is O( log(d)) in Section 6.1. On the other hand, the PoSI constant can approach the order O( d) of the Scheffé constant K Sch as well, and we will study an example in Section 6.2. Even though in this article we will give asymptotic results for d = p and r only, we mention another kind of asymptotics whereby r is held constant while d = p : In this case K Sch is in the order of the product of d and the 1 α quantile of the inverse-root-chi-square distribution with r degrees of freedom. In a similar way, the constant K orth for orthogonal designs is in the order of the product of 2 log d and the 1 α quantile of the inverse-chi-square distribution with r degrees of freedom PoSI-Sharp Model Selection SPAR. There exists a model selection procedure that requires the full protection of the simultaneous inference procedure (4.8). It is the significance hunting procedure that selects the model containing the most significant effect : ˆM SPAR (Y) argmax M M max t j M(Y). j M We name this procedure SPAR for Single Predictor Adjusted Regression. It achieves equality with the significant triviality bound in Lemma 4.1 and is therefore the worst case procedure for the PoSI problem. In the submodel ˆM SPAR (Y) the less significant predictors matter only in so far as they boost the significance of the winning predictor by adjusting it accordingly. This procedure ignores the quality of the fit to Y provided by the model. While our present purpose is to point out the existence of a selection procedure that requires full PoSI protection, SPAR could be of practical interest when the analysis is centered on strength of effects, not quality of model fit.

19 VALID POST-SELECTION INFERENCE One Primary Predictor and Controls PoSI1. Sometimes a regression analysis is centered on a predictor of interest, X j, and on inference for its coefficient β j M. The other predictors in M act as controls, so their purpose is to adjust the primary predictor for confounding effects and possibly to boost the primary predictor s own effect. This situation is characterized by two features: Variable selection is limited to models that contain the primary predictor. We therefore define for any model universe M a sub-universe M j of models that contain the primary predictor X j : M j {M j M M }, so that for M M we have j M iff M M j. Inference is sought for the primary predictor X j only, hence the relevant test statistic is now t j M and no longer max j M t j M. The former statistic is coherent because it is assumed that j M. We call this the PoSI1 situation in contrast to the unconstrained PoSI situation. Similar to PoSI, PoSI1 starts with a significant triviality bound : Lemma 4.2. ( Primary Predictor s Significant Triviality Bound ) For a fixed predictor X j and model selection procedure ˆM: R n M j, it holds: t j ˆM(Y) (Y) max M M j t j M (Y). For a proof, the only thing to note is j ˆM(Y) by the assumption ˆM(Y) M j. We next define the PoSI1 constant for the predictor X j as the 1 α quantile of the max- t statistic on the right side of the lemma: Let K j = K j (X, M, α, r) be the minimal value that satisfies [ ] (4.14) P max t j M K j 1 α. M M j Importantly, this constant is dominated by the general PoSI constant: (4.15) K j (X, M, α, r) K(X, M, α, r), for the obvious reason that the present max- t is smaller than the general PoSI max- t due to M j M and the restriction of inference to X j. The constant K j provides the following PoSI1 guarantee shown as the analog of Theorems 4.1 and Corollary 4.1 folded into one:

20 20 BERK ET AL. Theorem 4.2. Let ˆM : R n M j be a selection procedure that always includes the predictor X j in the model. Then we have [ (4.16) P t j ˆM K j ] 1 α, and accordingly we have the following post-selection confidence guarantee: [ ] (4.17) P β j ˆM CI j ˆM(K j ) 1 α. Inequality (4.16) is immediate from Lemma 4.2. The triviality bound of the lemma is attained by the following variable selection procedure which we name SPAR1 : (4.18) ˆMj (Y) argmax t j M (Y). M M j It is a potentially realistic description of some data analyses when a predictor of interest is determined a priori, and the goal is to optimize this predictor s effect. This procedure requires the full protection of the PoSI1 constant K j. In addition to its methodological interest, the PoSI1 situation addressed by Theorem 4.2 is of theoretical interest: Even though the PoSI1 constant K j is dominated by the unrestricted PoSI constant K, we will construct in Section 6.2 an example of predictor matrices for which the PoSI1 constant increases at the Scheffé rate and is asymptotically more than 63% of the Scheffé constant K Sch. It follows that near-scheffé protection can be needed even for SPAR1 variable selection. 5. The Structure of the PoSI Problem Canonical Coordinates. We can reduce the dimensionality of the PoSI problem from n p to d p, where d = rank(x) min(n, p), by introducing Scheffé s canonical coordinates. This reduction is important both geometrically and computationally because the PoSI coverage problem really takes place in the column space of X. DEFINITION: Let Q = (q 1,..., q d ) R n d be any orthonormal basis of the column space of X. Note that Ŷ = QQ T Y is the orthogonal projection of Y onto the column space of X even if X is not of full rank. We call X = Q T X R d p and Ỹ = Q T Ŷ R d canonical coordinates of X and Ŷ. We extend the notation X M for extraction of subsets of columns to canonical coordinates X M. Accordingly slopes obtained from canonical coordinates

21 VALID POST-SELECTION INFERENCE 21 will be denoted by ˆβ M ( X, Ỹ) = ( X T X M M ) 1 X T MỸ to distinguish them from the slopes obtained from the original data ˆβ M (X, Y) = (X T M X M) 1 X T M Y, if only to state in the following proposition that they are identical. Proposition 5.1. (1) Ỹ = Q T Y. Properties of canonical coordinates: (2) X T M X M = X T M X M and X T MỸ = XT M Y. (3) ˆβ M ( X, Ỹ) = ˆβ M (X, Y) for all submodels M. (4) Ỹ N ( µ, σ 2 I d ), where µ = Q T µ. (5) X j M = Q T X j M, where j M and X j M R d is the residual vector of the regression of X j onto the other columns of X M. (6) t j M = ( ˆβ j M ( X, Ỹ) β j M )/(ˆσ/ X j M ). (7) In the classical case d = p, X can be chosen to be an upper triangular or a symmetric matrix. The proofs of (1)-(6) are elementary. As for (7), an upper triangular X can be obtained from a QR-decomposition based on a Gram-Schmidt procedure: X = QR, X = R. A symmetric X is obtained from a singular value decomposition: X = UDV T, Q = UV T, X = VDV T. Canonical coordinates allow us to analyze the PoSI coverage problem in R d. In what follows we will freely assume that all objects are rendered in canonical coordinates and write X and Y for X and Ỹ, implying that the predictor matrix is of size d p and the response is of size d PoSI Coefficient Vectors in Canonical Coordinates. We simplify the PoSI coverage problem (4.8) as follows: Due to pivotality of t-statistics, the problem is invariant under translation of µ and rescaling of σ (see equation (4.1)). Hence it suffices to solve coverage problems for µ = 0 and σ = 1. In canonical coordinates this implies E[Ỹ] = 0 d, hence Ỹ N (0 d, I d ). For this reason we use the more familiar notation Z instead of Ỹ. The random vector Z/ˆσ has a d-dimensional t-distribution with r degrees of freedom, and any linear combination u T Z/ˆσ with a unit vector u has a 1-dimensional t-distribution. Letting X j M be the adjusted predictors in canonical coordinates, the estimates (3.5) and their t-statistics (4.1) simplify to (5.1) ˆβj M = XT j M Z X j M 2 = lt j MZ, t j M = XT j M Z X j M ˆσ = l T j MZ/ˆσ,

PoSI Valid Post-Selection Inference

PoSI Valid Post-Selection Inference PoSI Valid Post-Selection Inference Andreas Buja joint work with Richard Berk, Lawrence Brown, Kai Zhang, Linda Zhao Department of Statistics, The Wharton School University of Pennsylvania Philadelphia,

More information

PoSI and its Geometry

PoSI and its Geometry PoSI and its Geometry Andreas Buja joint work with Richard Berk, Lawrence Brown, Kai Zhang, Linda Zhao Department of Statistics, The Wharton School University of Pennsylvania Philadelphia, USA Simon Fraser

More information

VALID POST-SELECTION INFERENCE

VALID POST-SELECTION INFERENCE Submitted to the Annals of Statistics VALID POST-SELECTION INFERENCE By Richard Berk, Lawrence Brown,, Andreas Buja, Kai Zhang and Linda Zhao The Wharton School, University of Pennsylvania It is common

More information

On Post-selection Confidence Intervals in Linear Regression

On Post-selection Confidence Intervals in Linear Regression Washington University in St. Louis Washington University Open Scholarship Arts & Sciences Electronic Theses and Dissertations Arts & Sciences Spring 5-2017 On Post-selection Confidence Intervals in Linear

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

Construction of PoSI Statistics 1

Construction of PoSI Statistics 1 Construction of PoSI Statistics 1 Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI 2018 1 Joint work with "Larry s Group" at Wharton,

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Lawrence D. Brown* and Daniel McCarthy*

Lawrence D. Brown* and Daniel McCarthy* Comments on the paper, An adaptive resampling test for detecting the presence of significant predictors by I. W. McKeague and M. Qian Lawrence D. Brown* and Daniel McCarthy* ABSTRACT: This commentary deals

More information

Post-Selection Inference for Models that are Approximations

Post-Selection Inference for Models that are Approximations Post-Selection Inference for Models that are Approximations Andreas Buja joint work with the PoSI Group: Richard Berk, Lawrence Brown, Linda Zhao, Kai Zhang Ed George, Mikhail Traskin, Emil Pitkin, Dan

More information

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory Andreas Buja joint with the PoSI Group: Richard Berk, Lawrence Brown, Linda Zhao, Kai Zhang Ed George, Mikhail Traskin, Emil Pitkin,

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

POL 681 Lecture Notes: Statistical Interactions

POL 681 Lecture Notes: Statistical Interactions POL 681 Lecture Notes: Statistical Interactions 1 Preliminaries To this point, the linear models we have considered have all been interpreted in terms of additive relationships. That is, the relationship

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1 Last time Problem: Many clinical covariates which are important to a certain medical

More information

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Andreas Buja joint with: Richard Berk, Lawrence Brown, Linda Zhao, Arun Kuchibhotla, Kai Zhang Werner Stützle, Ed George, Mikhail

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Introductory Econometrics

Introductory Econometrics Based on the textbook by Wooldridge: : A Modern Approach Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna November 23, 2013 Outline Introduction

More information

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In

More information

Variable Selection Insurance aka Valid Statistical Inference After Model Selection

Variable Selection Insurance aka Valid Statistical Inference After Model Selection Variable Selection Insurance aka Valid Statistical Inference After Model Selection L. Brown Wharton School, Univ. of Pennsylvania with Richard Berk, Andreas Buja, Ed George, Michael Traskin, Linda Zhao,

More information

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project MLR Model Selection Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17 Model selection I February 17 Remedial measures Suppose one of your diagnostic plots indicates a problem with the model s fit or assumptions; what options are available to you? Generally speaking, you

More information

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression: Biost 518 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture utline Choice of Model Alternative Models Effect of data driven selection of

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Mallows Cp for Out-of-sample Prediction

Mallows Cp for Out-of-sample Prediction Mallows Cp for Out-of-sample Prediction Lawrence D. Brown Statistics Department, Wharton School, University of Pennsylvania lbrown@wharton.upenn.edu WHOA-PSI conference, St. Louis, Oct 1, 2016 Joint work

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Reading for Lecture 6 Release v10

Reading for Lecture 6 Release v10 Reading for Lecture 6 Release v10 Christopher Lee October 11, 2011 Contents 1 The Basics ii 1.1 What is a Hypothesis Test?........................................ ii Example..................................................

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Empirical Economic Research, Part II

Empirical Economic Research, Part II Based on the text book by Ramanathan: Introductory Econometrics Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna December 7, 2011 Outline Introduction

More information

Recent Developments in Post-Selection Inference

Recent Developments in Post-Selection Inference Recent Developments in Post-Selection Inference Yotam Hechtlinger Department of Statistics yhechtli@andrew.cmu.edu Shashank Singh Department of Statistics Machine Learning Department sss1@andrew.cmu.edu

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

with the usual assumptions about the error term. The two values of X 1 X 2 0 1 Sample questions 1. A researcher is investigating the effects of two factors, X 1 and X 2, each at 2 levels, on a response variable Y. A balanced two-factor factorial design is used with 1 replicate. The

More information

Generalized Cp (GCp) in a Model Lean Framework

Generalized Cp (GCp) in a Model Lean Framework Generalized Cp (GCp) in a Model Lean Framework Linda Zhao University of Pennsylvania Dedicated to Lawrence Brown (1940-2018) September 9th, 2018 WHOA 3 Joint work with Larry Brown, Juhui Cai, Arun Kumar

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

Regression Diagnostics for Survey Data

Regression Diagnostics for Survey Data Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics

More information

Geographically Weighted Regression as a Statistical Model

Geographically Weighted Regression as a Statistical Model Geographically Weighted Regression as a Statistical Model Chris Brunsdon Stewart Fotheringham Martin Charlton October 6, 2000 Spatial Analysis Research Group Department of Geography University of Newcastle-upon-Tyne

More information

Selective Inference for Effect Modification

Selective Inference for Effect Modification Inference for Modification (Joint work with Dylan Small and Ashkan Ertefaie) Department of Statistics, University of Pennsylvania May 24, ACIC 2017 Manuscript and slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/.

More information

Lecture 4 Multiple linear regression

Lecture 4 Multiple linear regression Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary

More information

3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems

3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems 3/10/03 Gregory Carey Cholesky Problems - 1 Cholesky Problems Gregory Carey Department of Psychology and Institute for Behavioral Genetics University of Colorado Boulder CO 80309-0345 Email: gregory.carey@colorado.edu

More information

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j. Chapter 9 Pearson s chi-square test 9. Null hypothesis asymptotics Let X, X 2, be independent from a multinomial(, p) distribution, where p is a k-vector with nonnegative entries that sum to one. That

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Orthogonal, Planned and Unplanned Comparisons

Orthogonal, Planned and Unplanned Comparisons This is a chapter excerpt from Guilford Publications. Data Analysis for Experimental Design, by Richard Gonzalez Copyright 2008. 8 Orthogonal, Planned and Unplanned Comparisons 8.1 Introduction In this

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Homoskedasticity. Var (u X) = σ 2. (23)

Homoskedasticity. Var (u X) = σ 2. (23) Homoskedasticity How big is the difference between the OLS estimator and the true parameter? To answer this question, we make an additional assumption called homoskedasticity: Var (u X) = σ 2. (23) This

More information

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

More information

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement EconS 450 Forecasting part 3 Forecasting with Regression Using regression to study economic relationships is called econometrics econo = of or pertaining to the economy metrics = measurement Econometrics

More information

Comparisons among means (or, the analysis of factor effects)

Comparisons among means (or, the analysis of factor effects) Comparisons among means (or, the analysis of factor effects) In carrying out our usual test that μ 1 = = μ r, we might be content to just reject this omnibus hypothesis but typically more is required:

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

Tutorial on Mathematical Induction

Tutorial on Mathematical Induction Tutorial on Mathematical Induction Roy Overbeek VU University Amsterdam Department of Computer Science r.overbeek@student.vu.nl April 22, 2014 1 Dominoes: from case-by-case to induction Suppose that you

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Linear Models in Econometrics

Linear Models in Econometrics Linear Models in Econometrics Nicky Grant At the most fundamental level econometrics is the development of statistical techniques suited primarily to answering economic questions and testing economic theories.

More information

Regression With a Categorical Independent Variable: Mean Comparisons

Regression With a Categorical Independent Variable: Mean Comparisons Regression With a Categorical Independent Variable: Mean Lecture 16 March 29, 2005 Applied Regression Analysis Lecture #16-3/29/2005 Slide 1 of 43 Today s Lecture comparisons among means. Today s Lecture

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Computational Tasks and Models

Computational Tasks and Models 1 Computational Tasks and Models Overview: We assume that the reader is familiar with computing devices but may associate the notion of computation with specific incarnations of it. Our first goal is to

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

ECNS 561 Multiple Regression Analysis

ECNS 561 Multiple Regression Analysis ECNS 561 Multiple Regression Analysis Model with Two Independent Variables Consider the following model Crime i = β 0 + β 1 Educ i + β 2 [what else would we like to control for?] + ε i Here, we are taking

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

An overview of applied econometrics

An overview of applied econometrics An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Class (or 3): Summary of steps in building unconditional models for time What happens to missing predictors Effects of time-invariant predictors

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u Interval estimation and hypothesis tests So far our focus has been on estimation of the parameter vector β in the linear model y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i = x iβ + u i for i = 1, 2,...,

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Checking model assumptions with regression diagnostics

Checking model assumptions with regression diagnostics @graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Conflicts of interest None Assistant Editor

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Lecture 10: Generalized likelihood ratio test

Lecture 10: Generalized likelihood ratio test Stat 200: Introduction to Statistical Inference Autumn 2018/19 Lecture 10: Generalized likelihood ratio test Lecturer: Art B. Owen October 25 Disclaimer: These notes have not been subjected to the usual

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research Linear models Linear models are computationally convenient and remain widely used in applied econometric research Our main focus in these lectures will be on single equation linear models of the form y

More information

Introduction to Confirmatory Factor Analysis

Introduction to Confirmatory Factor Analysis Introduction to Confirmatory Factor Analysis Multivariate Methods in Education ERSH 8350 Lecture #12 November 16, 2011 ERSH 8350: Lecture 12 Today s Class An Introduction to: Confirmatory Factor Analysis

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information