Resampling-Based Information Criteria for Best- Subset Regression

Size: px
Start display at page:

Download "Resampling-Based Information Criteria for Best- Subset Regression"

Transcription

1 University of Haifa From the SelectedWorks of Philip T. Reiss 202 Resampling-Based Information Criteria for Best- Subset Regression Philip T. Reiss, New York University Lei Huang, Columbia University Joseph E. Cavanaugh, University of Iowa Amy Krain Roy, Fordham University Available at:

2 Annals of the Institute of Statistical Mathematics manuscript No. (will be inserted by the editor) Resampling-Based Information Criteria for Best-Subset Regression Philip T. Reiss Lei Huang Joseph E. Cavanaugh Amy Krain Roy Received: date / Revised: date Abstract When a linear model is chosen by searching for the best subset among a set of candidate predictors, a fixed penalty such as that imposed by the Akaike information criterion may penalize model complexity inadequately, leading to biased model selection. We study resampling-based information criteria that aim to overcome this problem through improved estimation of the effective model dimension. The first proposed approach builds upon previous work on bootstrap-based model selection. We then propose a more novel approach based on cross-validation. Simulations and analyses of a functional neuroimaging data set illustrate the strong performance of our resamplingbased methods, which are implemented in a new R package. Keywords Adaptive model selection Covariance inflation criterion Cross-validation Extended information criterion Functional connectivity Overoptimism Introduction A popular strategy for model selection is to choose the candidate model minimizing an estimate of the expected value of some loss function for a hypo- P. T. Reiss Department of Child and Adolescent Psychiatry, New York University, New York, NY 006, USA, and Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY 0962, USA phil.reiss@nyumc.org L. Huang Department of Child and Adolescent Psychiatry, New York University, New York, NY 006, USA J. E. Cavanaugh Department of Biostatistics, University of Iowa College of Public Health, Iowa City, IA 52242, USA A. K. Roy Department of Psychology, Fordham University, Bronx, NY 0458, USA

3 2 Philip T. Reiss et al. thetical future set of outcomes. That estimate may take the form of the value of the loss function for the given data plus a penalty or correction term. The latter term compensates for the fact that the loss function will tend to have a lower value for the data from which the model fit was obtained than for a different data set to which the same fitted model is applied. In other words, the correction represents the overoptimism (Efron, 983; Pan and Le, 200) inherent in using the observed loss as an estimate of the expected loss in a future data set. This general template can lead to ostensibly very different model selection criteria, such as the Akaike (973, 974) information criterion (AIC), Mallows (973) C p statistic, and more recent covariance penalty methods (Efron, 2004). In this paper we are concerned with the problem of subset selection for linear models (Miller, 2002), which can be stated as follows. Suppose we are given an n-dimensional outcome vector y, and an n P matrix X full, where the first column of X full consists of s, and the remaining columns correspond } to P candidate predictors. Let A = {{} Ã : Ã {2,..., P }. Any A A defines a model matrix X = X A of those columns X full indexed by the elements of A, and thereby defines a model y = Xβ + ε, () where ε is an n-dimensional vector of mean-zero errors. (The definition of A implies that we consider only models that include an intercept.) Our goal is to choose A A for which the estimated expected loss for a future data set is lowest. Following Akaike, we use -2 times the log likelihood as the loss function, which favors a model whose predictive distribution is closest to the true distribution in the sense of Kullback-Leibler information (Konishi and Kitagawa, 2008). The Akaike paradigm is built upon a solid foundation, thanks to its connection with likelihood theory and information theory. In part for this reason, AIC remains the best-known and most widely used criterion for model selection. Our contribution addresses a problem that is often overlooked in practice: the fact that, when selecting one among many possible models, the overoptimism is inflated. For example, AIC adds the penalty 2p to -2 times the log likelihood, where p = A. Selecting the AIC-minimizing model is best suited to settings in which there is only one candidate model of each size p. But when searching among all size-p subsets for each p, the overoptimism associated with selecting the best size-p subset is greater than 2p. Adaptive model selection procedures, in the sense of Tibshirani and Knight (999), increase the overoptimism penalty to account for searching among a number of possible models of each size (cf. Shen and Ye, 2002, who define adaptive model selection somewhat differently). We develop an approach to adaptive model selection that is particularly relevant to applications with a moderate number of candidate predictors, i.e., P/n less than but not necessarily near zero. Whereas AIC is based on an asymptotic approximation that assumes p n, the corrected AIC (AIC c )

4 Resampling-Based Information Criteria for Best-Subset Regression 3 (Sugiura, 978; Hurvich and Tsai, 989) applies an exact penalty term that is not proportional to p. This penalty term (see (7) below) indicates that as p grows while n remains fixed, the rate of growth in overoptimism increases; in other words, complexity is more costly for small-to-moderate than for large samples, so that complexity penalization should depend on both p and n. This important case of moderate predictor dimension has not been fully addressed in previous work on adaptive linear model selection. Since in this case it is problematic to view the overoptimism as proportional to p, adaptive model selection methods that seek to replace the AIC penalty 2p with λp for some λ > 2 (e.g., Foster and George, 994; Ye, 998; Shen and Ye, 2002) may be less than ideal. Our proposed methods more closely resemble the covariance inflation criterion of Tibshirani and Knight (999), which uses permuted versions of the data to estimate the overoptimism. While this estimate works well when the true model is null, these authors acknowledge (p. 543) that it is biased when the true model is non-null. Instead of data permutation, the adaptive methods that we describe rely on two alternative resampling approaches, bootstrapping and cross-validation, to produce overoptimism estimates that are appropriate whether or not the true model is the null model. This work was motivated by research relating psychological outcomes to functional connectivity (FC), the temporal correlation of activity levels in different brain regions of interest. In a study of eight left-hemisphere regions, Stein et al. (2007) identified 0 between-region connections (i.e., pairs of regions; see Figure ) whose FC, assessed using functional magnetic resonance imaging (fmri), may be related to psychological measures. We were interested in exploring FC for these 0 connections as predictors of two such measures: the Rosenberg (965) self-esteem score, and the General Distress: Depressive Symptoms (GDD) subscale of the Mood and Anxiety Symptom Questionnaire (MASQ) (Watson et al., 995). Given the paucity of scientific theory to guide model building, it is natural to turn to automatic model selection criteria to find the best subset of the 0 candidate predictors. Section 2 introduces information criteria for linear model selection, and explains in more detail the need for adaptive approaches. Section 3 describes the bootstrap-based extended information criterion of Ishiguro et al. (997), and its extension to adaptive linear model selection. Section 4 proposes an alternative adaptive criterion, based on cross-validatory estimation of the overoptimism, which may overcome some of the limitations of the bootstrap method. Simulations in Section 5 demonstrate that our adaptive methods perform well compared with previous approaches. Section 6 presents analyses of our functional connectivity data set, and Section 7 offers concluding remarks.

5 4 Philip T. Reiss et al. Fig. The eight brain regions studied by Stein et al. (2007), who identified the 0 indicated connections (pairs of regions joined by edges) as potential predictors of psychological outcomes. Black dots indicate regions that lie on the mid-sagittal plane shown, while grey dots indicate left-hemisphere regions that have been projected onto this plane for illustration. Abbreviations: AMY, amygdala; INS, insula; OFC, orbitofrontal cortex; PCC, posterior cingulate cortex; PFC, prefrontal cortex; PHG, parahippocampal gyrus; SUB, subgenual cingulate; SUP, supragenual cingulate. 2 Information criteria for linear models 2. AIC and corrected AIC In what follows, we shall refer to model () with X = X A as model A. The notation X for the design matrix, and ˆβ, ˆσ 2 for the parameter estimates, will refer to an a priori model A with A = p, fitted to the n observations. We shall add subscripts to X, ˆβ, ˆσ 2 (i) to denote a model selected as the best model of a particular dimension, as opposed to a priori, and/or (ii) to refer to resampled data sets and the associated model fits. Imagine a future realization of the data with the same matrix of predictors X = (x,..., x n ) T as in (), but a new outcome vector y + that is independent of y, conditionally on X. A good model will give rise to maximum likelihood estimates (MLEs) ˆβ, ˆσ 2 such that the expected -2 times log likelihood for the fitted model at the future data (y +, X), i.e., [ ] E 2l(ˆβ, ˆσ 2 y +, X), (2) will be as small as possible; here the expectation is with respect to the joint likelihood of (y, y + ) conditional on X. Information criteria estimate the expected loss (2) by 2l(ˆβ, ˆσ 2 y, X) + Ĉ, (3) where Ĉ is an estimate of the overoptimism [ ] C E 2l(ˆβ, ˆσ 2 y +, X) E [ ] 2l(ˆβ, ˆσ 2 y, X). (4)

6 Resampling-Based Information Criteria for Best-Subset Regression 5 The idea of (3) is that 2l(ˆβ, ˆσ 2 y, X) has a downward bias as an estimate of (2), since (y, X) is the original data set for which the likelihood was maximized to obtain (ˆβ, ˆσ 2 ). C is this bias, and the penalty term Ĉ in (3) is an estimate of it. If model () holds with ε N(0, σ 2 I), then 2l(ˆβ, ˆσ 2 y, X) = n log ˆσ 2 +n, so that [ C = E (n log ˆσ 2 + y + X ˆβ ] 2 /ˆσ 2 ) (n log ˆσ 2 + n) = E( y + X ˆβ 2 /ˆσ 2 ) n, (5) and the generic information criterion (3) reduces to n log ˆσ 2 + n + Ĉ. (6) Sugiura (978) and Hurvich and Tsai (989) derive the exact value C = C AICc (p) n(2p + 2) n p 2 (see also the succinct treatment of Davison, 2003, pp ). Substituting this value for Ĉ in (6) gives the AIC c n log ˆσ 2 + (7) n(n + p) n p 2. (8) If p n, (7) is approximately 2(p + ) or equivalently 2p, the ordinary AIC penalty. The fact that formula (8) assumes that () is a correct model, i.e. either the true data-generating model or a larger model, is sometimes seen as a limitation of choosing among candidate models by minimizing AIC c. In practice, though, model selection by minimal AIC c has proved effective, since it weeds out both overfitted models (which are heavily penalized) and underfitted models (which have low likelihood). However, this procedure is less than ideal when choosing among all possible subsets, as we explain next. 2.2 Selection bias in best-subset regression For p {,..., P }, let M(p) denote the set in A of size p such that model M(p) attains the highest maximized likelihood of any A A with A = p. A naïve approach to subset selection would choose the AIC c -minimizing subset M(p AICc ) where p AICc minimizes n log ˆσ 2 M(p) + n(n+p) n p 2 over p {,..., P }. Note, however, that this criterion equals min A =p (n log ˆσ A 2 ) + n(n+p) n p 2, which is smaller on average than (8) for a fixed a priori size-p model, and hence is a downward-biased estimate of (2). Equivalently, whereas the overoptimism (5) is equal to C AICc (p) = n(2p+2) n p 2 for a fixed size-p model, it is larger when considering the best size-p model. We shall denote this larger value by C ad (p).

7 6 Philip T. Reiss et al. The difference C AICc (p) C ad (p) can be thought of as the selection bias associated with using (7) to estimate the overoptimism of the selected model of size p. Of course, this selection bias would have no impact on model selection if it did not depend on p. But it does: in particular, it equals zero for p {, P }, since there are only one size- (null) model and one size-p (full) model, but it is negative for < p < P, so that AIC c minimization will be tilted against the null model. This situation is somewhat akin to multiple hypothesis testing (cf. George and Foster, 2000): with many candidate predictors, even if the null model is true, there will be an unacceptably high probability of choosing a non-null model, unless some sort of correction is applied. An appropriate correction is to use an information criterion (6) in which, instead of C AICc (p), we take Ĉ to be an estimate of C ad(p). This is the strategy pursued by the resampling-based information criteria described in the next two sections. 3 The extended (bootstrap) information criterion 3. The fixed-model case Ishiguro et al. (997) propose a nonparametric bootstrap approach to estimating C in a much more general setting than model (). (See also Konishi and Kitagawa (996), and the bootstrap model selection criterion of Shao (996), which is based on prediction error loss rather than likelihood.) Suppose we sample n pairs (y i, x i ) from the data, with replacement, B times, and denote the bth bootstrap data set thus generated by (y b, X b) and the associated MLEs by ˆβ b, ˆσ 2. Ishiguro et al. s extended information criterion (EIC) uses b Ĉ boot = B [ 2l(ˆβ B b, ˆσ b 2 y, X) + 2l(ˆβ ] b, ˆσ b 2 y b, X b) b= (9) as an estimate of the overoptimism (4). Intuitively, bootstrapping creates simulated replicates of the data-generating process in which the empirical distribution of (y b, X b) replaces that of (y, X), whereas the empirical distribution of (y, X) replaces the population distribution; these correspondences motivate using (9) to estimate (4). In the linear model setting this reduces to estimating E( y + X ˆβ 2 /ˆσ 2 ), the first term of (5), by B B b= y X ˆβ b 2 /ˆσ b 2, so that (9) and (6) yield Ĉ boot = B y X ˆβ b 2 n, (0) B b= EIC = n log ˆσ 2 + B ˆσ 2 b B y X ˆβ b 2. () b= ˆσ 2 b

8 Resampling-Based Information Criteria for Best-Subset Regression The best-subset case To correct for the selection bias problem described in Section 2.2, one can modify () to define an information criterion associated with selection of the best subset of size p. Whereas the bootstrap overoptimism estimate (0) is appropriate for a fixed model, a natural extension to the best-subset case is to estimate C ad (p) by Ĉ ad,boot (p) = B B y X ˆβ M b (p) b;mb (p) 2 ˆσ b;m 2 n, (2) b (p) b= where Mb (p) is the best subset of size p for the bth resampled data set, and ˆβ b;m b (p) and ˆσ b;m 2 b (p) are the MLEs from the corresponding model for that data set. Substituting into the generic criterion (6) gives an adaptive EIC EIC ad (p) = n log ˆσ 2 M(p) + B B y X ˆβ M b (p) b;mb (p) 2 ˆσ b;m 2, (3) b (p) which is to be minimized with respect to p. To understand the motivation for (2), (3), recall that in bootstrap estimation of the overoptimism, each bootstrap data set takes the place of the original data, while the original data stands in for the new data set. Estimate (2) substitutes the bootstrap data for the original data both for selection of the best size-p model Mb (p) and for the resulting parameter estimates ˆβ b;mb (p), ˆσ b;m 2 b (p), in order to capture, and compensate for, the overoptimism arising from the entire procedure of choosing the best size-p subset. See Appendix A. for an alternative definition of best-subset EIC. Criterion (3) is defined only for the best size-p model for each p. To be able to assign a score to model A for any set A A, we extend the definition by applying penalty (2) not only to the best p-term model but to every such model, and thus obtain EIC ad (A) = n log ˆσ 2 A + B b= B y X ˆβ M b ( A ) b;mb ( A ) 2 ˆσ b;m 2. (4) b ( A ) b= The best size-p model Mb (p) for each bootstrap data set can be computed efficiently using the branch-and-bound algorithm implemented in the package leaps (Lumley, 2009) for R (R Development Core Team, 200). This algorithm would be difficult to integrate with parametric bootstrap samples; hence our preference for nonparametric bootstrapping. 3.3 Small-sample performance of EIC Returning to the fixed-model setting of Section 3., we undertook to study the performance of the bootstrap overoptimism estimate Ĉboot (0) when the true

9 8 Philip T. Reiss et al. Table Median (and range from 5th to 95th percentile) of E(Ĉboot) C in 300 simulations. p n ( ) 5. (-6.-23) 29 ( ) 0.6 (7.4-4) ( ) (-3-3.9) (-4.6-.) 5.4 (-5.8-9) 2.8 ( ) 23.4 ( ) ( ) 0.8 ( ) 2.2 (-5.8-) 5.4 ( ) 0.6 ( ) expected loss (2) is known: namely, when model () is correct and the errors are independent and identically distributed (IID) normal. As noted above, we then have C = n(2p+2) n p 2. We were thus able to compare the penalty Ĉboot to this gold standard in 300 Monte Carlo simulations for each of the values of n and p displayed in Table (see Appendix Appendix B: for details of the simulation procedure). For fixed n, E(Ĉboot) C is seen to increase with p, so that EIC will be biased toward smaller models. The positive bias of Ĉboot can be better understood by rewriting Ĉboot as B B n i= ( rb i )(y i x T ˆβ i b) 2 b= ˆσ 2 b = B (y i xt ˆβ i b) 2 B ˆσ 2 (rb i )(y i x T ˆβ i b) 2 b= i:ri b=0 b ˆσ 2, (5) i:ri b > b where ri b is the number of occurrences of the ith observation (x i, y i ) in the bth bootstrap sample. The first term within the brackets is the sum of scaled squared errors for those observations not included in the bth bootstrap sample; the second is a weighted sum of scaled squared errors for observations which are included multiple times, and which therefore have high influence on the estimate ˆβ b. Consequently, the summands in this second term will tend to be atypically small, making (5) a positively biased estimate of the overoptimism. The key point for our purposes is that this bias increases with p, especially when n is small (in agreement with the EIC results of Konishi and Kitagawa (2008), p. 209), so that EIC will tend to underfit in the fixed-model case. The simulation results in Section 5 suggest that the same is true of EIC ad in the best-subset case. 4 A cross-validation information criterion 4. Motivation and initial definition We have seen that EIC seeks to gauge the extent to which the original-data likelihood overestimates the expected likelihood for an independent data set, using bootstrap samples as surrogates for the original data, and the full data as a surrogate for an independent data set. But clearly the full-data outcomes are not independent of the bootstrap-sample outcomes, and the argument at

10 Resampling-Based Information Criteria for Best-Subset Regression 9 the end of the previous section suggests that this lack of independence is what makes EIC biased in small samples. This led us to consider an alternative: using cross-validation (CV) to estimate the overoptimism. We remark that the bootstrap.632 and.632+ estimators of prediction error (Efron, 983; Efron and Tibshirani, 997; cf. Efron, 2004), which seek to improve upon CV, use bootstrap samples in a similar spirit to CV essentially, the data points excluded from bootstrap samples are used to correct for overoptimism. However, this work is concerned with a different class of loss functions and with prediction error for a fixed model, and it is not clear how to adapt the.632 and.632+ estimators to our context of selecting among all possible subsets. Let y = y. y K and X = X. X K where, for k =,..., K, y k R nv and X k has n v rows, with n v = n/k. Let (y k, X k ) denote the kth training set, i.e. the n t = n n v observations not included in (y k, X k ) (the kth validation set ), and let ˆβ k, ˆσ k 2 be the MLEs obtained by fitting model () to (y k, X k ). We can then define a CV-based analogue of the overoptimism ( K ) C y = E k X k ˆβ k 2 n, (6) k= where the expectation is with respect to the joint distribution of (y, X) along with its natural unbiased estimate Ĉ CV = K k= y k X k ˆβ k 2 /ˆσ k 2 n, a CV-based analogue of the EIC overoptimism estimate (0). Two remarks are in order:. C is approximately equal to ˆσ 2 k E( y + X + ˆβ 2 /ˆσ 2 ) n, (7) where (y +, X + ) is an independent set of n observations drawn from the same joint distribution, and the expectation is with respect to the distribution of (y, X, y +, X + ). But (6) is slightly larger than (7) because the estimates ˆβ k, ˆσ 2 k are based on n t < n observations, inflating the mean prediction error. 2. (7) is similar to the overoptimism C (5), but the definition of C assumes the original predictor matrix X is fixed and we draw an independent set of outcomes y + from the same conditional (on X) distribution as y. In order to apply C to selecting the model dimension, we will need to express it as C (p), an explicit function of p, analogous to the overoptimism formula C AICc (p) of (7). Since C depends on the distribution of X as well as that of y, our derivation of C (p) (see below, Section 4.2) will rely on assumptions regarding the predictor distribution.

11 0 Philip T. Reiss et al. Similarly to (2), we can define adaptive extensions of ĈCV M(p), the best model of dimension p: and C for K y Ĉad,CV k X ˆβ k;m k (p) k;m k (p) 2 (p) = n, k= ˆσ 2 k;m k (p) C ad(p) = E[Ĉ ad,cv (p)], where M k (p) is the best size-p model for the kth training set; ˆβ k;m k (p), ˆσ 2 k;m k (p) are the associated MLEs for this training set; and X k;m k (p) comprises the corresponding columns of X k. The function C (p), given explicitly in (23), is strictly increasing on its domain and hence invertible, so we have C AICc (p) = C AICc C C (p). This suggests the following plug-in estimator of the overoptimism of M(p): where Ĉ ad,cv (p) = C AICc C Ĉ ad,cv (p) (8) = C AICc (df p ) = n(2df p + 2) n df p 2 df p = C [Ĉ ad,cv (p)]. (9) One can think of df p as the effective degrees of freedom, or effective model dimension, of M(p). As we show in Appendix Appendix C:, generally speaking df p p. The overoptimism estimate (8) leads to CV IC(p) = n log ˆσ 2 M(p) + n + Ĉad,CV (p) = n log ˆσ 2 M(p) + n(n + df p) n df p 2 (20) i.e., AIC c for M(p), but with p replaced by df p in the penalty as our model selection criterion. Like (3), this criterion is defined only for the best model of each size; but it can be extended to all candidate models, analogously to (4). 4.2 Derivation of C (p) To evaluate (8) and hence (20), we must derive the aforementioned function C (p). We begin by writing H = (h ij ) i,j n X(X T X) X T = H... H K..... H K... H KK, where each of the above blocks is n v n v. We can then state the following result, which gives the expectation of Ĉ CV, conditional on the K folds that make up X. Theorems and 2 are proved in Appendix Appendix D:.

12 Resampling-Based Information Criteria for Best-Subset Regression Theorem Suppose model () holds where X is an n p matrix and ε N(0, σ 2 I n ). Assume that n t > p + 2 and that I nv H kk is invertible for each k. Then ( K ) y E k X k ˆβ k 2 X,..., X K = k= ˆσ 2 k n t n t p 2 K tr[(i nv H kk ) ]. k= (2) To derive C (p), i.e. an expression for E(Ĉ CV ) that depends only on the model dimension p but not on X, we require distributional assumptions on the rows of X. Theorem 2 Let X = ( X) where x,..., x n, the rows of X, are IID (p )-variate normal. Under the assumptions of Theorem, if X k has rank p for each k, then ( K ) y E k X k ˆβ k 2 k= ˆσ 2 k By (6) and (22), we may take [ ] C (nt + )(n t 2) (p) = n (n t p 2) 2 = n(n t + )(n t 2) (n t p 2) 2. (22) (23) provided that the predictor vectors are IID multivariate normal. In practice this condition may not hold; but Appendix Appendix E: offers evidence that (23) is a reasonable approximation in most realistic settings. We therefore adopt (23) as our formula for C (p) in what follows. 4.3 Refining the criterion by constrained monotone smoothing Three obvious limitations of CVIC (20) are:. It is based on an overoptimism estimate Ĉad,CV (p) that is stochastic, unlike the fixed penalty C AICc (p). 2. For p = 2,..., P, we may be willing to accept the stochastic estimate Ĉ ad,cv (p), since no closed-form expression for C ad (p) exists. But as noted in Section 2.2, for p =, P, there is only one candidate model of size p, so C ad (p) reduces to C AICc (p) = n(2p+2) n p 2. In effect, for p =, P, df p is just a noisy version of p, and CVIC is just a noisy version of AIC c and hence inferior to simply using AIC c. 3. Whereas C ad is an increasing function of p, Ĉad,CV need not be. This may cause CVIC to favor overfitting in some cases. We can mitigate the first of these problems, and eliminate the other two, by means of the constrained penalized spline algorithm implemented in the R package mgcv (Wood, 2006). This algorithm allows us to compute a smooth function df(p), approximating df p at p =,..., P, such that (i) df() = and

13 2 Philip T. Reiss et al. df(p ) = P, and (ii) df is constrained to be monotonically increasing, by the method of Wood (994). (Figure 8 displays the raw df p plotted against the smoothed df(p) for a real data example.) We can then replace the raw CVIC (20) with the monotonic variant CV IC mon (p) = n log ˆσ 2 M(p) + n[n + df(p)] n df(p) Summary of the proposed adaptive methods We provide here a brief summary of the resampling-based information criteria considered above. Section 3 introduced Ishiguro et al. s (997) EIC, based on using bootstrap samples to estimate the overoptimism (5), and proposed the adaptive extension EIC ad. In Section 3.3 we showed that the bootstrap tends to overpenalize larger models in small samples, and this motivated an alternative, cross-validatory approach to adaptive linear model selection. Section 4. defined the CVIC in terms of a quantity C (p) which we derived in Section 4.2. In Section 4.3 we sought to overcome some of CVIC s limitations by applying constrained monotone smoothing to the effective degrees of freedom, resulting in the new criterion CVIC mon. A somewhat unappealing feature of the CVIC overoptimism estimate (8) is that it estimates C ad (p) indirectly, by applying C AICc C to Ĉ ad,cv (p), in contrast to the direct bootstrap estimate Ĉad,boot(p) (2). On the other hand, whereas Ĉad,boot(p) is an adaptive extension of Ĉboot(p), which we have shown to be a biased overoptimism estimator, Ĉad,CV (p) is an adaptive extension of the unbiased estimator Ĉ CV = Ĉ CV (p) of C (p) offering some hope that the resulting model selection criterion CVIC will outperform the bootstrap criterion EIC ad. Appendix A.2 discusses two alternative CV-based approaches to subset selection. 5 Simulation study 5. Setup We conducted a simulation study to compare the performance of minimization of () AIC; (2) AIC c ; (3) BIC; (4) CIC; (5) EIC ad ; (6) CVIC; and (7) CVIC mon. Four sets of 300 simulations were performed. Each set began by choosing a predictor matrix X = (x... x 50 ) T whose rows were independently generated from a multivariate normal distribution with mean zero and covariance matrix having (j, k) entry 0.7 j k. (Note that here, as in Tibshirani and Knight s (999) simulation study, X does not include a column of s, since the true intercept is 0.) Then, for each individual simulation, outcomes y,..., y 50 were generated from the model y i = x T i β + ε i, where ε,..., ε 50 are independent normal variates with mean 0 and variance

14 Resampling-Based Information Criteria for Best-Subset Regression 3 σ 2 =. In the first set of simulations the true coefficient vector β R 20 was set to zero. For the remaining simulations, there were two sets of nonzero coefficients centered around the 5th and 5th predictors. These were set initially to β 5+j = β 5+j = h j for j < h, where h was, 2, 3 in the second, third, and fourth sets of simulations, resulting in 2, 6, and 0 nonzero coefficients. We then multiplied β by a constant chosen so that the final β would satisfy R 2 β T X T Xβ/(nσ 2 + β T X T Xβ) = This quantity is the theoretical R 2 used by Tibshirani and Knight (999), i.e., the regression sum of squares divided by the expected total sum of squares for the true model; note that the more general expression given by Luo et al. (2006) would be required if the true model had a nonzero intercept. For CIC and EIC ad, resampling was performed 40 times per simulation; for the two CVIC variants, leave-one-out CV was used. 5.2 Comparative performance Figures 2 5 display the true coefficients for each set of simulations, along with the proportion of simulations in which each coefficient was included in the model i.e., the detection rate for truly nonzero coefficients, and the false alarm rate for zero coefficients for each method. Figure 6 shows boxplots of the model error X ˆβ Xβ 2 (i.e., the mean prediction error minus σ 2 ) for each method, in each set of simulations. For the true models with 0 or 2 nonzero coefficients, the proposed methods EIC ad, CVIC and CVIC mon are the best performers, achieving near-perfect variable selection. The constrained smoothing makes CVIC mon somewhat superior to CVIC, but EIC ad attains the lowest model error of all the methods. For the models with 6 or 0 nonzero coefficients, the three proposed methods are again notably resistant to false alarms, but have difficulty detecting the truly nonzero coefficients; hence these methods model error is no better than that of the other methods, and indeed EIC ad, which tends to overpenalize large models (see Section 3.3), has the highest model error when the true model has 0 nonzero coefficients. Since the selection bias discussed in Section 2.2 tends to favor non-null over null models, it is unsurprising that our adaptive methods provide the greatest benefit when all or most of the true coefficients are zero. CIC, like our proposed methods, aims to take into account the process of searching among numerous candidate models, but the simulation results suggest that its ability to protect against spurious predictors diminishes as the true model grows: CIC s false alarm rates are lower than AIC c s when the true model is null, but higher for the larger models. 5.3 Variability A key criticism of resampling-based overoptimism estimators is that their inherent variability leads to unstable model selection. To investigate the vari-

15 4 Philip T. Reiss et al True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 2 This figure and the three figures that follow show detection rates and false alarm rates for four true models with 20 candidate predictors. The top left subfigure shows the true coefficients, which are all zero in this case, while the other subfigures display the relative frequency of inclusion of each predictor, in 300 simulations, based on the seven methods listed in the text. Table 2 Mean, over 00 simulated data sets, of the mean and standard deviation of the overoptimism estimates from 30 replicates of each of the proposed adaptive procedures. The last row shows C AICc (7) for comparison. Note that the value 4.26 in the first column is the true overoptimism, and is recovered exactly by CVIC mon. In the remaining columns, the true overoptimism C ad is greater than C AICc due to the effect of model selection. Number of predictors with nonzero coefficients EIC ad 4.8 (.3) 0.76 (.5) 72.6 (5.05) (2.7) CVIC 3.7 (.24) 7.86 (2.09) (7.06) (8.20) CVIC mon 4.26 (0) 2.8 (2.9) (5.9) (7.23) AIC c ability of our estimators of C ad, we performed another four sets of 00 simulations, using the exact same specifications for the sample size, predictors and outcomes as above. For each simulated data set, we computed 30 replicates of the EIC ad overoptimism estimate for the true model size using 00 bootstrap samples, and 30 replicates of the CVIC and CVIC mon overoptimism estimates using 0-fold CV; we then obtained the mean and standard deviation (SD) of the 30 estimates by each method. The means of these values over the 00 data sets are given in Table 2.

16 Resampling-Based Information Criteria for Best-Subset Regression True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 3 Detection and false alarm rates for the true coefficient vector shown at top left, i.e., two equal positive coefficients and all other coefficients equal to zero. Grey bars indicate detection rates for truly nonzero coefficients, and white bars give false alarm rates for zero coefficients True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 4 Detection and false alarm rates for the true coefficient vector shown at top left (six nonzero coefficients).

17 6 Philip T. Reiss et al. True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 5 Detection and false alarm rates for the true coefficient vector shown at top left (ten nonzero coefficients). AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients Fig. 6 Model error X ˆβ Xβ 2 under the four simulation settings.

18 Resampling-Based Information Criteria for Best-Subset Regression 7 The raw CVIC penalty is seen to have somewhat higher SD than that of EIC ad, except in the column for 0 true predictors, where EIC ad has much higher SD. Note that in that column the mean is much higher for EIC ad than for CVIC; although the true value C ad is unknown, the EIC ad penalty seems to be positively biased here, in line with the fixed-model results in Table. The SD values for CVIC mon suggest that the constrained monotone smoothing succeeds in reducing the penalty variability; the only exception is that with 2 true predictors, the SD is slightly higher for CVIC mon than for raw CVIC, but here the mean is also notably higher. 6 Application: functional connectivity in the human brain We now turn to the application outlined in the Introduction. Self-esteem and MASQ-GDD (depression) scores, and FC values for the 0 connections described above, were acquired in a sample of 43 participants scanned with resting-state fmri (Biswal et al., 995) at New York University. The regions of interest were defined, and FC was computed, as in Stark et al. (2008). Approximately optimal Box-Cox transformations were applied to the two psychological outcomes (the third power for self-esteem, and the logarithm for MASQ-GDD), which were then regressed on subsets of the 0 FC predictors. We compare the subset selection results for AIC, AIC c, EIC ad, and CVIC mon. For regression of self-esteem score on the 0 connections, there are 3 models, with 5 predictors, having somewhat lower AIC than the null model. The best model (AIC value 77.2, versus 77.7 for the second-best model and for the null model) includes the AMY-SUB, OFC-AMY, SUB-INS, and SUP- PCC connections. But since a number of models have AIC near the minimum, it seems reasonable to adopt a pluralistic approach that considers all of the near-optimal models and asks which predictors appear most often in them. Of the 3 models that outperform the null model, two of the above four connections stand out in terms of inclusion frequency: SUB-INS appears in all 3 models, while OFC-AMY occurs in all but two suggesting that these two connections may be most strongly associated with self-esteem. However, AIC c, EIC ad and CVIC mon choose the null model as the best model, suggesting that chance variation accounts for the AIC-based findings. For the MASQ-GDD score, on the other hand, the adaptive criteria yield rather different results than either AIC or AIC c. The number of models outscoring the null model is 78 for AIC and 56 for AIC c, but only 3 for CVIC mon (the models including PCC-AMY only, SUB-INS only, or both), and for EIC ad (the model including both PCC-AMY and SUB-INS). Figure 7 displays the 0 best models according to AIC c, and the 3 models with lower CVIC mon values than the null model. All four criteria choose the model with the PCC-AMY and SUB-INS connections as the best. Moreover, these two connections occur in all 0 of the lowest-aic c models, whereas no other connection occurs in more than 3 of these models. But whereas the AIC c results indicate that one or two additional predictors might improve the model, the EIC ad and CVIC mon

19 8 Philip T. Reiss et al. AIC c CVIC mon AMY PHG AMY SUB INS AMY OFC AMY OFC PFC PCC AMY SUB INS SUB SUP SUP AMY SUP PCC AMY PHG AMY SUB INS AMY OFC AMY OFC PFC PCC AMY SUB INS SUB SUP SUP AMY SUP PCC Criterion value Fig. 7 Information criterion values for the best models for MASQ-GDD score, based on AIC c and on CVIC mon. The set of line segments at a given y-coordinate indicates the connections included in the model attaining that value of the information criterion, and the dotted line in the right plot indicates CVIC mon for the null model. results imply that such predictors would be spurious. The 0 lowest-aic models (not shown in Figure 7), like the lowest-aic c models, consistently include PCC-AMY and SUB-INS, but the former models are generally even larger, including as many as 6 of the connections. Figure 8 shows the effective degrees of freedom obtained by the CVIC mon method. The fitted models regressing log-transformed GDD on PCC-AMY and/or SUB-INS suggest that both are positively related with depression: a onestandard-deviation increase in either connectivity score is associated with approximately a 0% increase in the GDD subscale. The coefficients of determination are roughly % for PCC-AMY alone, 3% for SUB-INS alone, and 25% for the model containing both. 7 Discussion In the Introduction, we argued that our resampling approach to model selection should be particularly relevant when the predictor dimension p is not negligible compared with the sample size n. On the other hand, when p becomes too large to compute a score such as AIC for all subsets, it is more common to turn to other methods such as partial least squares (Helland, 988), ridge regression (Hoerl and Kennard, 970) or the lasso (Tibshirani, 996). Indeed, such methods have been applied successfully in moderate-p applications. Nevertheless, when it is feasible to evaluate a score such as CVIC mon for all subsets, there are advantages to doing so including the fact that this approach provides a natural framework for obtaining not just a single model

20 Resampling-Based Information Criteria for Best-Subset Regression 9 Effective model dimension Raw Smoothed Model dimension Fig. 8 CVIC mon for the GDD model: effective model dimension, raw (df p) and smoothed ( df(p)), for the selected model of each size. The line of identity is shown in grey. estimate but a collection of leading candidate models, as illustrated in our functional connectivity example (Section 6 above). The above data analysis also demonstrates that the relatively low falsealarm rate of resampling-based information criteria makes them particularly useful for highly exploratory discovery science studies in which preventing spurious findings is of central importance. In other settings, however, these criteria s tendency to err toward non-detection of true predictors may represent a serious limitation. Another concern is the variability of our methods overoptimism estimates, which cannot be reduced by the technique of Konishi and Kitagawa (996) since that device is applicable only in the fixed-model case. In ongoing research, we are seeking ways to surmount these limitations. To extend our resampling-based information criteria from linear to generalized linear models (GLMs), two difficulties must be surmounted. First, finding the best model for each resampled data set requires efficient subset selection algorithms, which are readily available for linear models (e.g., the above-cited leaps algorithm) but much less developed for GLMs. Second, the closedform expressions for expected overoptimism used in the CVIC criterion are valid only for the linear case. The methods of Lawless and Singhal (978) and Cerdeira et al. (2009) have enabled us to overcome the first difficulty and implement a logistic regression version of EIC ad, but this criterion appears to suffer from a marked tendency to underfit. This preliminary finding adds to our motivation to overcome the second difficulty and extend CVIC to the generalized linear case.

21 20 Philip T. Reiss et al. The methods of this paper have been implemented in an R package called reams (resampling-based adaptive model selection), available at Appendix A: Some alternatives to our adaptive criteria A. An alternative form of best-subset EIC An alternative to (3) would be n log ˆσ M(p) 2 + B y X M(p) ˆβ b;m(p) 2 B ˆσ 2. (24) b= b;m(p) This criterion differs from (3) in that, although new parameter estimates ˆβ b;m(p), ˆσ 2 b;m(p) are obtained for each bootstrap sample, we reuse the original-data selected subset M(p), rather than selecting a new subset Mb (p) for each bootstrap sample as in (3). Criterion (24) is more computationally efficient than (3), and may be equivalent to the EIC 2 criterion of Konishi and Kitagawa (2008, pp ). However, because this criterion uses the bootstrap data as a surrogate for the real data only for parameter estimation, but not for model selection, it may not fully capture the overoptimism associated with parameter estimates from a selected model. Indeed, in simulations similar to those reported in Section 5, we found criterion (24) to be susceptible, like nonadaptive criteria, to a high rate of false detections. A.2 Two alternative CV methods An alternative to the CVIC (20), which at first glance may seem much simpler, is to compute the usual K-fold CV estimate of prediction error K k= y k X k;aβ k;a 2 for every possible subset A, and choose A for which this quantity is minimized. However, to the best of our knowledge, it is nontrivial to extend efficient all-subsets regression algorithms to perform all-subsets CV. Moreover, this approach would be biased in favor of model dimensions p for which the number of subsets of size p is highest. Another alternative to CVIC is to minimize n log ˆσ M(p) 2 + n + Ĉ ad,cv (p), (25) i.e., to define our criterion directly in terms of Ĉad,CV (p) instead of the overoptimism estimate Ĉad,CV (p) = C AICc C Ĉ ad,cv (p) appearing in (20). However, we chose the latter quantity for consistency with the standard definitions (4), (5) of the overoptimism C. Had we opted for (25), it would not have been clear whether our criterion performed differently from AIC c due to its adaptive nature, or due to substituting the estimand C for the traditional C. Appendix B: The simulation procedure of Section 3.3 Suppose the bth bootstrap sample consists of cases i b, ib 2,..., ib n. The resampled data set can be written as y b = S by and X b = S bx where S b is the sampling-with-replacement matrix (e i b e i b 2... e i b n ) T ; here e k is the n-dimensional vector with in the kth position and 0 elsewhere. At first glance, Monte Carlo estimation of E(Ĉboot) for a given design matrix X

22 Resampling-Based Information Criteria for Best-Subset Regression 2 entails randomly generating (i) the bootstrap sampling matrix S b and (ii) the error vector ε. However, since ( B y X E(Ĉboot) = E ˆβ ) b 2 B ˆσ 2 n b= b [ ( B y X = E E ˆβ )] b 2 B ˆσ 2 S b n, (26) b= b we can eliminate the second source of simulation error by means of an analytic expression for the inner expectation in (26). We do this in two steps: expressing the fractional expression in (26) as a quotient of quadratic forms, and exploiting a formula for the expectation of such a quotient. Step. It is easily seen that S T b S b = D b diag(r b, rb 2,..., rb n), where rk b is the number of occurrences of k in the bth bootstrap sample. If fewer than p of r b, rb 2,..., rb n are positive, i.e., if the bootstrap sample contains fewer than p distinct cases, then ˆσ b 2 = 0 and EIC is undefined. On the other hand, if the bootstrap sample contains at least p distinct cases, then X T D b X would ordinarily be nonsingular. Assuming this to be the case, one can show that y X ˆβ b 2 /ˆσ b 2 = (ny T Q T b Q b y)/(yt Q T b D bq b y), where Q b = I X(X T D b X) X T D b. If X represents a correct model, i.e. () holds, then since Q b X = 0, we obtain y X ˆβ b 2 /ˆσ b 2 = nεt Q T b Q b ε ε T Q T b D bq b ε. (27) Step 2. If ε is a vector of n IID normal variates with mean zero, A is a symmetric n n matrix, and B is a positive semidefinite n n matrix with singular value decomposition P ΛP T, then by Theorem 6 of Magnus (986), E ( ε T ) Aε ε T = Bε (I + 2tΛ) /2 tr [(I + 2tΛ) P T AP 0 ] dt, (28) provided the expectation exists. By (27), the inner expectation in (26) is given by this integral with A = nq T b Q b, B = QT b D bq b. One can thus estimate E(Ĉboot) for a given X by numerically computing the integral (28) for each of a large set of bootstrap sampling matrices S b. In our simulation, for each (n, p), we generated 300 design matrices X by sampling each element independently from the standard normal distribution, and drew a single bootstrap sample for each X. Appendix C: An explicit expression for df p We attempt here to provide some intuition regarding df p. By (9), C (df p) = Ĉ ad,cv (p); (23) then leads to n(nt + )(nt 2) df p = n t n + Ĉ ad,cv (p) 2, or alternatively df p = p ( + ) (n t 2), (29) κ κ where κ = n+ĉ ad,cv (p) n+c, i.e., the ratio of K (p) k= y k X k;m k (p) ˆβ k;m k (p) 2 /ˆσ k;m 2 k (p) to E( K k= y k X k ˆβ k 2 /ˆσ k 2 ) as given by (22). The convex combination formula (29) makes it clear that df p p if and only if Ĉ ad,cv (p) C (p), as would ordinarily be case.

23 22 Philip T. Reiss et al. Appendix D: Proofs of Theorems and 2 D. Theorem We suppress the conditioning on X,..., X K in what follows. Since ˆβ k and ˆσ 2 k are independent, so are the numerator and denominator on the left side of (2). Therefore, since ˆσ 2 k σ2 χ 2 n t p /nt, ( K ) ( ) y k X k ˆβ k 2 K K E ˆσ 2 = E k= k ˆσ 2 E( y k X k ˆβ k 2 ) = nt k= E( y k X k ˆβ k 2 ) k= k σ 2. (n t p 2) It therefore suffices to show that K K E( y k X k ˆβ k 2 ) = σ 2 tr[(i nv H kk ) ]. (30) k= To prove (30) we use the identity k= y k X k ˆβ k = (I nv H kk ) (y k X k ˆβ), the generalization to K-fold CV of a well-known result for leave-one-out CV (e.g., Reiss et al., 200). Combining this with y X ˆβ = (I n H)ε implies that the left side of (30) equals E[ε T (I n H)B(I n H)ε] = σ 2 tr[b(i n H)] where (I nv H ) (I nv H 22 ) B = (I nv H KK ) 2 Clearly tr[b(i n H)] = K k= tr[(inv H kk ) ], so (30) follows. D.2 Theorem 2 ( K Since E k= y k X k ˆβ ) [ ( K k 2 /ˆσ k 2 = E E k= y k X k ˆβ k 2 /ˆσ k 2 X,..., X K )], (2) implies that it suffices to prove K E [ tr{(i nv H kk ) } ] = k= n(nt + )(nt 2). (3) n t(n t p 2) By the Sherman-Morrison-Woodbury lemma (e.g., Harville, 2008) and the fact that H kk = X k (X T X) X k, [ ] tr[(i nv H kk ) ] = tr I nv + X k (X T k X k) X T k = n v + x T i (XT k X k) x i i V k = n v + i V k ( x T i ) ( n t n t x k n t x T k XT k X k ) ( x i ),

24 Resampling-Based Information Criteria for Best-Subset Regression 23 where V k = {i : x i belongs to the kth validation set}, X k comprises the last p columns of X k, and x k is the vector of column means of X k. By a standard formula for the inverse of a partitioned matrix, the above yields tr[(i nv H kk ) ] = n v + 2 x T i i V k [ /n t + x T ct k( X X c k k) x k ct ( X X c k k) x k + x T ct i ( X X c ] k k) x i, where X c k is the column-centered version of X k. If x k = µ + η k and x i = µ + η i where µ R p is the mean of the predictor distribution, then some algebra leads to E [ tr{(i nv H kk ) } ] = n v + [ /n t + E{ η T ct k ( X X c k k) η k } i V k + E{η T ct i ( X X c ] k k) η i }. The two quadratic forms above are distributed as n t (n t ) times the Hotelling T 2 (p, n t ) distribution and as n t times the T 2 (p, n t ) distribution, respectively. Hence E [ tr{(i nv H kk ) } ] = n v + Summing over k leads to (3). i V k [ { + n t n t(n t ) + n t } ] (nt )p. n t p 2 Appendix E: Robustness of (22) E. Expectation of n + Ĉ CV in general Our formula for C (p) is based on (22), which gives the expectation of n+ĉ CV = K k= y k X k ˆβ k 2 /ˆσ k 2 assuming IID multivariate normal predictor vectors. Here we consider the robustness of (22) to departures from this distribution, for the most popular form of CV: leave-one-out CV, i.e., K = n. In this case the right-hand expressions in (2) and (22) reduce n to n n p 3 i= and n2 (n 3) h ii (n p 3) 2, respectively. Let r n,p(x) be the ratio of these two values, i.e., r n,p(x) = (n )(n p 3) n 2 (n 3) n i= h ii. (32) The ratio of the actual expectation of n + Ĉ CV to that given by Theorem 2 is then rn,p(x)dx, where the integration is with respect to the distribution of the random matrix X (if one wishes to view X as fixed, this distribution becomes a point mass). We then have the following result. Proposition Let h max = h max(x) be the largest diagonal element of H. If the assumptions of Theorem 2 hold and h max <, then { where c X = min (n )(n p 3) (n 3)(n p) h max, + r n,p(x) c X(n )(n p 3), n(n 3) p n( h max) 2, + p n + phmax n( h max) 3 }.

25 24 Philip T. Reiss et al. Lower bound Upper bound h max h max p/n p/n Fig. 9 Contour plots, for n = 00, of the bounds imposed by Proposition on the ratio r n,p(x). Note that points below the line of identity are excluded, since h max must be at least p/n. We can gain insight into the practical implications of Proposition by plotting the above bounds on r n,p(x) with respect to p/n and h max. The bounds depend only weakly on n; Figure 9 displays them for n = 00. For p not too large, the lower bound is just slightly below, and the random quantity h max will ordinarily have most of its mass in the region where the upper bound is also near ; hence r n,p(x)dx, i.e., (22) should provide a good approximation. The bounds deviate markedly from only when either p/n or h max is very high, i.e., either the model dimension is very large or there are inordinately high-leverage observations two scenarios in which the entire enterprise of linear modeling is generally unreliable. E.2 Proof of Proposition By (32), it suffices to show that p/n n n i= h ii c X. The first inequality holds since the first expression is the harmonic mean, whereas the second expression is the arithmetic mean, of,...,. The second inequality is derived h h nn from the following three inequalities: n i= n ; h ii h max n i= h ii = n i= < n + [ + h ii ( h ii )2 p ( h max) 2 ; ] where 0 < h ii < h ii for each i

Resampling-based information criteria for best-subset regression

Resampling-based information criteria for best-subset regression Ann Inst Stat Math (202) 64:6 86 DOI 0.007/s0463-02-0353- Resampling-based information criteria for best-subset regression Philip T. Reiss Lei Huang Joseph E. Cavanaugh Amy Krain Roy Received: 5 July 200

More information

Resampling-Based Information Criteria for Adaptive Linear Model Selection

Resampling-Based Information Criteria for Adaptive Linear Model Selection Resampling-Based Information Criteria for Adaptive Linear Model Selection Phil Reiss January 5, 2010 Joint work with Joe Cavanaugh, Lei Huang, and Amy Krain Roy Outline Motivating application: amygdala

More information

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression University of Haifa From the SelectedWorks of Philip T. Reiss August 7, 2008 Simultaneous Confidence Bands for the Coefficient Function in Functional Regression Philip T. Reiss, New York University Available

More information

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING

SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING Submitted to the Annals of Applied Statistics SUPPLEMENTARY APPENDICES FOR WAVELET-DOMAIN REGRESSION AND PREDICTIVE INFERENCE IN PSYCHIATRIC NEUROIMAGING By Philip T. Reiss, Lan Huo, Yihong Zhao, Clare

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Junfeng Shang Bowling Green State University, USA Abstract In the mixed modeling framework, Monte Carlo simulation

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Discrepancy-Based Model Selection Criteria Using Cross Validation

Discrepancy-Based Model Selection Criteria Using Cross Validation 33 Discrepancy-Based Model Selection Criteria Using Cross Validation Joseph E. Cavanaugh, Simon L. Davies, and Andrew A. Neath Department of Biostatistics, The University of Iowa Pfizer Global Research

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of

More information

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression Andrew A. Neath Department of Mathematics and Statistics; Southern Illinois University Edwardsville; Edwardsville, IL,

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods Chapter 4 Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods 4.1 Introduction It is now explicable that ridge regression estimator (here we take ordinary ridge estimator (ORE)

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Department of Mathematics Carl von Ossietzky University Oldenburg Sonja Greven Department of

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Institute of Statistics and Econometrics Georg-August-University Göttingen Department of Statistics

More information

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17 Model selection I February 17 Remedial measures Suppose one of your diagnostic plots indicates a problem with the model s fit or assumptions; what options are available to you? Generally speaking, you

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Model Choice Lecture 2

Model Choice Lecture 2 Model Choice Lecture 2 Brian Ripley http://www.stats.ox.ac.uk/ ripley/modelchoice/ Bayesian approaches Note the plural I think Bayesians are rarely Bayesian in their model selection, and Geisser s quote

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Multivariate Regression Generalized Likelihood Ratio Tests for FMRI Activation

Multivariate Regression Generalized Likelihood Ratio Tests for FMRI Activation Multivariate Regression Generalized Likelihood Ratio Tests for FMRI Activation Daniel B Rowe Division of Biostatistics Medical College of Wisconsin Technical Report 40 November 00 Division of Biostatistics

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Bootstrap Variants of the Akaike Information Criterion for Mixed Model Selection

Bootstrap Variants of the Akaike Information Criterion for Mixed Model Selection Bootstrap Variants of the Akaike Information Criterion for Mixed Model Selection Junfeng Shang, and Joseph E. Cavanaugh 2, Bowling Green State University, USA 2 The University of Iowa, USA Abstract: Two

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Correction for Tuning Bias in Resampling Based Error Rate Estimation

Correction for Tuning Bias in Resampling Based Error Rate Estimation Correction for Tuning Bias in Resampling Based Error Rate Estimation Christoph Bernau & Anne-Laure Boulesteix Institute of Medical Informatics, Biometry and Epidemiology (IBE), Ludwig-Maximilians University,

More information

STAT 730 Chapter 5: Hypothesis Testing

STAT 730 Chapter 5: Hypothesis Testing STAT 730 Chapter 5: Hypothesis Testing Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 28 Likelihood ratio test def n: Data X depend on θ. The

More information

x 21 x 22 x 23 f X 1 X 2 X 3 ε

x 21 x 22 x 23 f X 1 X 2 X 3 ε Chapter 2 Estimation 2.1 Example Let s start with an example. Suppose that Y is the fuel consumption of a particular model of car in m.p.g. Suppose that the predictors are 1. X 1 the weight of the car

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition Hirokazu Yanagihara 1, Tetsuji Tonda 2 and Chieko Matsumoto 3 1 Department of Social Systems

More information

Testing Some Covariance Structures under a Growth Curve Model in High Dimension

Testing Some Covariance Structures under a Growth Curve Model in High Dimension Department of Mathematics Testing Some Covariance Structures under a Growth Curve Model in High Dimension Muni S. Srivastava and Martin Singull LiTH-MAT-R--2015/03--SE Department of Mathematics Linköping

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

An Unbiased C p Criterion for Multivariate Ridge Regression

An Unbiased C p Criterion for Multivariate Ridge Regression An Unbiased C p Criterion for Multivariate Ridge Regression (Last Modified: March 7, 2008) Hirokazu Yanagihara 1 and Kenichi Satoh 2 1 Department of Mathematics, Graduate School of Science, Hiroshima University

More information

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap University of Zurich Department of Economics Working Paper Series ISSN 1664-7041 (print) ISSN 1664-705X (online) Working Paper No. 254 Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and

More information

1 Overview. Coefficients of. Correlation, Alienation and Determination. Hervé Abdi Lynne J. Williams

1 Overview. Coefficients of. Correlation, Alienation and Determination. Hervé Abdi Lynne J. Williams In Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage. 2010 Coefficients of Correlation, Alienation and Determination Hervé Abdi Lynne J. Williams 1 Overview The coefficient of

More information

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u Interval estimation and hypothesis tests So far our focus has been on estimation of the parameter vector β in the linear model y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i = x iβ + u i for i = 1, 2,...,

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

An Akaike Criterion based on Kullback Symmetric Divergence in the Presence of Incomplete-Data

An Akaike Criterion based on Kullback Symmetric Divergence in the Presence of Incomplete-Data An Akaike Criterion based on Kullback Symmetric Divergence Bezza Hafidi a and Abdallah Mkhadri a a University Cadi-Ayyad, Faculty of sciences Semlalia, Department of Mathematics, PB.2390 Marrakech, Moroco

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

The Nonparametric Bootstrap

The Nonparametric Bootstrap The Nonparametric Bootstrap The nonparametric bootstrap may involve inferences about a parameter, but we use a nonparametric procedure in approximating the parametric distribution using the ECDF. We use

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Statistics 262: Intermediate Biostatistics Model selection

Statistics 262: Intermediate Biostatistics Model selection Statistics 262: Intermediate Biostatistics Model selection Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Today s class Model selection. Strategies for model selection.

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST Econometrics Working Paper EWP0402 ISSN 1485-6441 Department of Economics TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST Lauren Bin Dong & David E. A. Giles Department

More information

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny 008 by The University of Chicago. All rights reserved.doi: 10.1086/588078 Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny (Am. Nat., vol. 17, no.

More information

Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening

Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening David J. Edwards 1 (with Robert W. Mee 2 and Eric D. Schoen 3 ) 1 Virginia Commonwealth University, Richmond, VA 2 University of

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Bias and variance reduction techniques for bootstrap information criteria

Bias and variance reduction techniques for bootstrap information criteria Ann Inst Stat Math (2010) 62:209 234 DOI 10.1007/s10463-009-0237-1 Bias and variance reduction techniques for bootstrap information criteria Genshiro Kitagawa Sadanori Konishi Received: 17 November 2008

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Theorems. Least squares regression

Theorems. Least squares regression Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Forecast comparison of principal component regression and principal covariate regression

Forecast comparison of principal component regression and principal covariate regression Forecast comparison of principal component regression and principal covariate regression Christiaan Heij, Patrick J.F. Groenen, Dick J. van Dijk Econometric Institute, Erasmus University Rotterdam Econometric

More information

Appendix A: Review of the General Linear Model

Appendix A: Review of the General Linear Model Appendix A: Review of the General Linear Model The generallinear modelis an important toolin many fmri data analyses. As the name general suggests, this model can be used for many different types of analyses,

More information

2.2 Classical Regression in the Time Series Context

2.2 Classical Regression in the Time Series Context 48 2 Time Series Regression and Exploratory Data Analysis context, and therefore we include some material on transformations and other techniques useful in exploratory data analysis. 2.2 Classical Regression

More information

Geographically Weighted Regression as a Statistical Model

Geographically Weighted Regression as a Statistical Model Geographically Weighted Regression as a Statistical Model Chris Brunsdon Stewart Fotheringham Martin Charlton October 6, 2000 Spatial Analysis Research Group Department of Geography University of Newcastle-upon-Tyne

More information