Resampling-Based Information Criteria for Best- Subset Regression

Size: px

Start display at page:

Download "Resampling-Based Information Criteria for Best- Subset Regression"

Jocelin Maude Wilcox
5 years ago
Views:

Philip T. Reiss, New York University Lei Huang, Columbia University Joseph E.

1 University of Haifa From the SelectedWorks of Philip T. Reiss 202 Resampling-Based Information Criteria for Best- Subset Regression Philip T. Reiss, New York University Lei Huang, Columbia University Joseph E. Cavanaugh, University of Iowa Amy Krain Roy, Fordham University Available at:

2 Annals of the Institute of Statistical Mathematics manuscript No. (will be inserted by the editor) Resampling-Based Information Criteria for Best-Subset Regression Philip T. Reiss Lei Huang Joseph E. Cavanaugh Amy Krain Roy Received: date / Revised: date Abstract When a linear model is chosen by searching for the best subset among a set of candidate predictors, a fixed penalty such as that imposed by the Akaike information criterion may penalize model complexity inadequately, leading to biased model selection. We study resampling-based information criteria that aim to overcome this problem through improved estimation of the effective model dimension. The first proposed approach builds upon previous work on bootstrap-based model selection. We then propose a more novel approach based on cross-validation. Simulations and analyses of a functional neuroimaging data set illustrate the strong performance of our resamplingbased methods, which are implemented in a new R package. Keywords Adaptive model selection Covariance inflation criterion Cross-validation Extended information criterion Functional connectivity Overoptimism Introduction A popular strategy for model selection is to choose the candidate model minimizing an estimate of the expected value of some loss function for a hypo- P. T. Reiss Department of Child and Adolescent Psychiatry, New York University, New York, NY 006, USA, and Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY 0962, USA phil.reiss@nyumc.org L. Huang Department of Child and Adolescent Psychiatry, New York University, New York, NY 006, USA J. E. Cavanaugh Department of Biostatistics, University of Iowa College of Public Health, Iowa City, IA 52242, USA A. K. Roy Department of Psychology, Fordham University, Bronx, NY 0458, USA

3 2 Philip T. Reiss et al. thetical future set of outcomes. That estimate may take the form of the value of the loss function for the given data plus a penalty or correction term. The latter term compensates for the fact that the loss function will tend to have a lower value for the data from which the model fit was obtained than for a different data set to which the same fitted model is applied. In other words, the correction represents the overoptimism (Efron, 983; Pan and Le, 200) inherent in using the observed loss as an estimate of the expected loss in a future data set. This general template can lead to ostensibly very different model selection criteria, such as the Akaike (973, 974) information criterion (AIC), Mallows (973) C p statistic, and more recent covariance penalty methods (Efron, 2004). In this paper we are concerned with the problem of subset selection for linear models (Miller, 2002), which can be stated as follows. Suppose we are given an n-dimensional outcome vector y, and an n P matrix X full, where the first column of X full consists of s, and the remaining columns correspond } to P candidate predictors. Let A = {{} Ã : Ã {2,..., P }. Any A A defines a model matrix X = X A of those columns X full indexed by the elements of A, and thereby defines a model y = Xβ + ε, () where ε is an n-dimensional vector of mean-zero errors. (The definition of A implies that we consider only models that include an intercept.) Our goal is to choose A A for which the estimated expected loss for a future data set is lowest. Following Akaike, we use -2 times the log likelihood as the loss function, which favors a model whose predictive distribution is closest to the true distribution in the sense of Kullback-Leibler information (Konishi and Kitagawa, 2008). The Akaike paradigm is built upon a solid foundation, thanks to its connection with likelihood theory and information theory. In part for this reason, AIC remains the best-known and most widely used criterion for model selection. Our contribution addresses a problem that is often overlooked in practice: the fact that, when selecting one among many possible models, the overoptimism is inflated. For example, AIC adds the penalty 2p to -2 times the log likelihood, where p = A. Selecting the AIC-minimizing model is best suited to settings in which there is only one candidate model of each size p. But when searching among all size-p subsets for each p, the overoptimism associated with selecting the best size-p subset is greater than 2p. Adaptive model selection procedures, in the sense of Tibshirani and Knight (999), increase the overoptimism penalty to account for searching among a number of possible models of each size (cf. Shen and Ye, 2002, who define adaptive model selection somewhat differently). We develop an approach to adaptive model selection that is particularly relevant to applications with a moderate number of candidate predictors, i.e., P/n less than but not necessarily near zero. Whereas AIC is based on an asymptotic approximation that assumes p n, the corrected AIC (AIC c )

4 Resampling-Based Information Criteria for Best-Subset Regression 3 (Sugiura, 978; Hurvich and Tsai, 989) applies an exact penalty term that is not proportional to p. This penalty term (see (7) below) indicates that as p grows while n remains fixed, the rate of growth in overoptimism increases; in other words, complexity is more costly for small-to-moderate than for large samples, so that complexity penalization should depend on both p and n. This important case of moderate predictor dimension has not been fully addressed in previous work on adaptive linear model selection. Since in this case it is problematic to view the overoptimism as proportional to p, adaptive model selection methods that seek to replace the AIC penalty 2p with λp for some λ > 2 (e.g., Foster and George, 994; Ye, 998; Shen and Ye, 2002) may be less than ideal. Our proposed methods more closely resemble the covariance inflation criterion of Tibshirani and Knight (999), which uses permuted versions of the data to estimate the overoptimism. While this estimate works well when the true model is null, these authors acknowledge (p. 543) that it is biased when the true model is non-null. Instead of data permutation, the adaptive methods that we describe rely on two alternative resampling approaches, bootstrapping and cross-validation, to produce overoptimism estimates that are appropriate whether or not the true model is the null model. This work was motivated by research relating psychological outcomes to functional connectivity (FC), the temporal correlation of activity levels in different brain regions of interest. In a study of eight left-hemisphere regions, Stein et al. (2007) identified 0 between-region connections (i.e., pairs of regions; see Figure ) whose FC, assessed using functional magnetic resonance imaging (fmri), may be related to psychological measures. We were interested in exploring FC for these 0 connections as predictors of two such measures: the Rosenberg (965) self-esteem score, and the General Distress: Depressive Symptoms (GDD) subscale of the Mood and Anxiety Symptom Questionnaire (MASQ) (Watson et al., 995). Given the paucity of scientific theory to guide model building, it is natural to turn to automatic model selection criteria to find the best subset of the 0 candidate predictors. Section 2 introduces information criteria for linear model selection, and explains in more detail the need for adaptive approaches. Section 3 describes the bootstrap-based extended information criterion of Ishiguro et al. (997), and its extension to adaptive linear model selection. Section 4 proposes an alternative adaptive criterion, based on cross-validatory estimation of the overoptimism, which may overcome some of the limitations of the bootstrap method. Simulations in Section 5 demonstrate that our adaptive methods perform well compared with previous approaches. Section 6 presents analyses of our functional connectivity data set, and Section 7 offers concluding remarks.

5 4 Philip T. Reiss et al. Fig. The eight brain regions studied by Stein et al. (2007), who identified the 0 indicated connections (pairs of regions joined by edges) as potential predictors of psychological outcomes. Black dots indicate regions that lie on the mid-sagittal plane shown, while grey dots indicate left-hemisphere regions that have been projected onto this plane for illustration. Abbreviations: AMY, amygdala; INS, insula; OFC, orbitofrontal cortex; PCC, posterior cingulate cortex; PFC, prefrontal cortex; PHG, parahippocampal gyrus; SUB, subgenual cingulate; SUP, supragenual cingulate. 2 Information criteria for linear models 2. AIC and corrected AIC In what follows, we shall refer to model () with X = X A as model A. The notation X for the design matrix, and ˆβ, ˆσ 2 for the parameter estimates, will refer to an a priori model A with A = p, fitted to the n observations. We shall add subscripts to X, ˆβ, ˆσ 2 (i) to denote a model selected as the best model of a particular dimension, as opposed to a priori, and/or (ii) to refer to resampled data sets and the associated model fits. Imagine a future realization of the data with the same matrix of predictors X = (x,..., x n ) T as in (), but a new outcome vector y + that is independent of y, conditionally on X. A good model will give rise to maximum likelihood estimates (MLEs) ˆβ, ˆσ 2 such that the expected -2 times log likelihood for the fitted model at the future data (y +, X), i.e., [ ] E 2l(ˆβ, ˆσ 2 y +, X), (2) will be as small as possible; here the expectation is with respect to the joint likelihood of (y, y + ) conditional on X. Information criteria estimate the expected loss (2) by 2l(ˆβ, ˆσ 2 y, X) + Ĉ, (3) where Ĉ is an estimate of the overoptimism [ ] C E 2l(ˆβ, ˆσ 2 y +, X) E [ ] 2l(ˆβ, ˆσ 2 y, X). (4)

6 Resampling-Based Information Criteria for Best-Subset Regression 5 The idea of (3) is that 2l(ˆβ, ˆσ 2 y, X) has a downward bias as an estimate of (2), since (y, X) is the original data set for which the likelihood was maximized to obtain (ˆβ, ˆσ 2 ). C is this bias, and the penalty term Ĉ in (3) is an estimate of it. If model () holds with ε N(0, σ 2 I), then 2l(ˆβ, ˆσ 2 y, X) = n log ˆσ 2 +n, so that [ C = E (n log ˆσ 2 + y + X ˆβ ] 2 /ˆσ 2 ) (n log ˆσ 2 + n) = E( y + X ˆβ 2 /ˆσ 2 ) n, (5) and the generic information criterion (3) reduces to n log ˆσ 2 + n + Ĉ. (6) Sugiura (978) and Hurvich and Tsai (989) derive the exact value C = C AICc (p) n(2p + 2) n p 2 (see also the succinct treatment of Davison, 2003, pp ). Substituting this value for Ĉ in (6) gives the AIC c n log ˆσ 2 + (7) n(n + p) n p 2. (8) If p n, (7) is approximately 2(p + ) or equivalently 2p, the ordinary AIC penalty. The fact that formula (8) assumes that () is a correct model, i.e. either the true data-generating model or a larger model, is sometimes seen as a limitation of choosing among candidate models by minimizing AIC c. In practice, though, model selection by minimal AIC c has proved effective, since it weeds out both overfitted models (which are heavily penalized) and underfitted models (which have low likelihood). However, this procedure is less than ideal when choosing among all possible subsets, as we explain next. 2.2 Selection bias in best-subset regression For p {,..., P }, let M(p) denote the set in A of size p such that model M(p) attains the highest maximized likelihood of any A A with A = p. A naïve approach to subset selection would choose the AIC c -minimizing subset M(p AICc ) where p AICc minimizes n log ˆσ 2 M(p) + n(n+p) n p 2 over p {,..., P }. Note, however, that this criterion equals min A =p (n log ˆσ A 2 ) + n(n+p) n p 2, which is smaller on average than (8) for a fixed a priori size-p model, and hence is a downward-biased estimate of (2). Equivalently, whereas the overoptimism (5) is equal to C AICc (p) = n(2p+2) n p 2 for a fixed size-p model, it is larger when considering the best size-p model. We shall denote this larger value by C ad (p).

7 6 Philip T. Reiss et al. The difference C AICc (p) C ad (p) can be thought of as the selection bias associated with using (7) to estimate the overoptimism of the selected model of size p. Of course, this selection bias would have no impact on model selection if it did not depend on p. But it does: in particular, it equals zero for p {, P }, since there are only one size- (null) model and one size-p (full) model, but it is negative for < p < P, so that AIC c minimization will be tilted against the null model. This situation is somewhat akin to multiple hypothesis testing (cf. George and Foster, 2000): with many candidate predictors, even if the null model is true, there will be an unacceptably high probability of choosing a non-null model, unless some sort of correction is applied. An appropriate correction is to use an information criterion (6) in which, instead of C AICc (p), we take Ĉ to be an estimate of C ad(p). This is the strategy pursued by the resampling-based information criteria described in the next two sections. 3 The extended (bootstrap) information criterion 3. The fixed-model case Ishiguro et al. (997) propose a nonparametric bootstrap approach to estimating C in a much more general setting than model (). (See also Konishi and Kitagawa (996), and the bootstrap model selection criterion of Shao (996), which is based on prediction error loss rather than likelihood.) Suppose we sample n pairs (y i, x i ) from the data, with replacement, B times, and denote the bth bootstrap data set thus generated by (y b, X b) and the associated MLEs by ˆβ b, ˆσ 2. Ishiguro et al. s extended information criterion (EIC) uses b Ĉ boot = B [ 2l(ˆβ B b, ˆσ b 2 y, X) + 2l(ˆβ ] b, ˆσ b 2 y b, X b) b= (9) as an estimate of the overoptimism (4). Intuitively, bootstrapping creates simulated replicates of the data-generating process in which the empirical distribution of (y b, X b) replaces that of (y, X), whereas the empirical distribution of (y, X) replaces the population distribution; these correspondences motivate using (9) to estimate (4). In the linear model setting this reduces to estimating E( y + X ˆβ 2 /ˆσ 2 ), the first term of (5), by B B b= y X ˆβ b 2 /ˆσ b 2, so that (9) and (6) yield Ĉ boot = B y X ˆβ b 2 n, (0) B b= EIC = n log ˆσ 2 + B ˆσ 2 b B y X ˆβ b 2. () b= ˆσ 2 b

8 Resampling-Based Information Criteria for Best-Subset Regression The best-subset case To correct for the selection bias problem described in Section 2.2, one can modify () to define an information criterion associated with selection of the best subset of size p. Whereas the bootstrap overoptimism estimate (0) is appropriate for a fixed model, a natural extension to the best-subset case is to estimate C ad (p) by Ĉ ad,boot (p) = B B y X ˆβ M b (p) b;mb (p) 2 ˆσ b;m 2 n, (2) b (p) b= where Mb (p) is the best subset of size p for the bth resampled data set, and ˆβ b;m b (p) and ˆσ b;m 2 b (p) are the MLEs from the corresponding model for that data set. Substituting into the generic criterion (6) gives an adaptive EIC EIC ad (p) = n log ˆσ 2 M(p) + B B y X ˆβ M b (p) b;mb (p) 2 ˆσ b;m 2, (3) b (p) which is to be minimized with respect to p. To understand the motivation for (2), (3), recall that in bootstrap estimation of the overoptimism, each bootstrap data set takes the place of the original data, while the original data stands in for the new data set. Estimate (2) substitutes the bootstrap data for the original data both for selection of the best size-p model Mb (p) and for the resulting parameter estimates ˆβ b;mb (p), ˆσ b;m 2 b (p), in order to capture, and compensate for, the overoptimism arising from the entire procedure of choosing the best size-p subset. See Appendix A. for an alternative definition of best-subset EIC. Criterion (3) is defined only for the best size-p model for each p. To be able to assign a score to model A for any set A A, we extend the definition by applying penalty (2) not only to the best p-term model but to every such model, and thus obtain EIC ad (A) = n log ˆσ 2 A + B b= B y X ˆβ M b ( A ) b;mb ( A ) 2 ˆσ b;m 2. (4) b ( A ) b= The best size-p model Mb (p) for each bootstrap data set can be computed efficiently using the branch-and-bound algorithm implemented in the package leaps (Lumley, 2009) for R (R Development Core Team, 200). This algorithm would be difficult to integrate with parametric bootstrap samples; hence our preference for nonparametric bootstrapping. 3.3 Small-sample performance of EIC Returning to the fixed-model setting of Section 3., we undertook to study the performance of the bootstrap overoptimism estimate Ĉboot (0) when the true

9 8 Philip T. Reiss et al. Table Median (and range from 5th to 95th percentile) of E(Ĉboot) C in 300 simulations. p n ( ) 5. (-6.-23) 29 ( ) 0.6 (7.4-4) ( ) (-3-3.9) (-4.6-.) 5.4 (-5.8-9) 2.8 ( ) 23.4 ( ) ( ) 0.8 ( ) 2.2 (-5.8-) 5.4 ( ) 0.6 ( ) expected loss (2) is known: namely, when model () is correct and the errors are independent and identically distributed (IID) normal. As noted above, we then have C = n(2p+2) n p 2. We were thus able to compare the penalty Ĉboot to this gold standard in 300 Monte Carlo simulations for each of the values of n and p displayed in Table (see Appendix Appendix B: for details of the simulation procedure). For fixed n, E(Ĉboot) C is seen to increase with p, so that EIC will be biased toward smaller models. The positive bias of Ĉboot can be better understood by rewriting Ĉboot as B B n i= ( rb i )(y i x T ˆβ i b) 2 b= ˆσ 2 b = B (y i xt ˆβ i b) 2 B ˆσ 2 (rb i )(y i x T ˆβ i b) 2 b= i:ri b=0 b ˆσ 2, (5) i:ri b > b where ri b is the number of occurrences of the ith observation (x i, y i ) in the bth bootstrap sample. The first term within the brackets is the sum of scaled squared errors for those observations not included in the bth bootstrap sample; the second is a weighted sum of scaled squared errors for observations which are included multiple times, and which therefore have high influence on the estimate ˆβ b. Consequently, the summands in this second term will tend to be atypically small, making (5) a positively biased estimate of the overoptimism. The key point for our purposes is that this bias increases with p, especially when n is small (in agreement with the EIC results of Konishi and Kitagawa (2008), p. 209), so that EIC will tend to underfit in the fixed-model case. The simulation results in Section 5 suggest that the same is true of EIC ad in the best-subset case. 4 A cross-validation information criterion 4. Motivation and initial definition We have seen that EIC seeks to gauge the extent to which the original-data likelihood overestimates the expected likelihood for an independent data set, using bootstrap samples as surrogates for the original data, and the full data as a surrogate for an independent data set. But clearly the full-data outcomes are not independent of the bootstrap-sample outcomes, and the argument at

10 Resampling-Based Information Criteria for Best-Subset Regression 9 the end of the previous section suggests that this lack of independence is what makes EIC biased in small samples. This led us to consider an alternative: using cross-validation (CV) to estimate the overoptimism. We remark that the bootstrap.632 and.632+ estimators of prediction error (Efron, 983; Efron and Tibshirani, 997; cf. Efron, 2004), which seek to improve upon CV, use bootstrap samples in a similar spirit to CV essentially, the data points excluded from bootstrap samples are used to correct for overoptimism. However, this work is concerned with a different class of loss functions and with prediction error for a fixed model, and it is not clear how to adapt the.632 and.632+ estimators to our context of selecting among all possible subsets. Let y = y. y K and X = X. X K where, for k =,..., K, y k R nv and X k has n v rows, with n v = n/k. Let (y k, X k ) denote the kth training set, i.e. the n t = n n v observations not included in (y k, X k ) (the kth validation set ), and let ˆβ k, ˆσ k 2 be the MLEs obtained by fitting model () to (y k, X k ). We can then define a CV-based analogue of the overoptimism ( K ) C y = E k X k ˆβ k 2 n, (6) k= where the expectation is with respect to the joint distribution of (y, X) along with its natural unbiased estimate Ĉ CV = K k= y k X k ˆβ k 2 /ˆσ k 2 n, a CV-based analogue of the EIC overoptimism estimate (0). Two remarks are in order:. C is approximately equal to ˆσ 2 k E( y + X + ˆβ 2 /ˆσ 2 ) n, (7) where (y +, X + ) is an independent set of n observations drawn from the same joint distribution, and the expectation is with respect to the distribution of (y, X, y +, X + ). But (6) is slightly larger than (7) because the estimates ˆβ k, ˆσ 2 k are based on n t < n observations, inflating the mean prediction error. 2. (7) is similar to the overoptimism C (5), but the definition of C assumes the original predictor matrix X is fixed and we draw an independent set of outcomes y + from the same conditional (on X) distribution as y. In order to apply C to selecting the model dimension, we will need to express it as C (p), an explicit function of p, analogous to the overoptimism formula C AICc (p) of (7). Since C depends on the distribution of X as well as that of y, our derivation of C (p) (see below, Section 4.2) will rely on assumptions regarding the predictor distribution.

11 0 Philip T. Reiss et al. Similarly to (2), we can define adaptive extensions of ĈCV M(p), the best model of dimension p: and C for K y Ĉad,CV k X ˆβ k;m k (p) k;m k (p) 2 (p) = n, k= ˆσ 2 k;m k (p) C ad(p) = E[Ĉ ad,cv (p)], where M k (p) is the best size-p model for the kth training set; ˆβ k;m k (p), ˆσ 2 k;m k (p) are the associated MLEs for this training set; and X k;m k (p) comprises the corresponding columns of X k. The function C (p), given explicitly in (23), is strictly increasing on its domain and hence invertible, so we have C AICc (p) = C AICc C C (p). This suggests the following plug-in estimator of the overoptimism of M(p): where Ĉ ad,cv (p) = C AICc C Ĉ ad,cv (p) (8) = C AICc (df p ) = n(2df p + 2) n df p 2 df p = C [Ĉ ad,cv (p)]. (9) One can think of df p as the effective degrees of freedom, or effective model dimension, of M(p). As we show in Appendix Appendix C:, generally speaking df p p. The overoptimism estimate (8) leads to CV IC(p) = n log ˆσ 2 M(p) + n + Ĉad,CV (p) = n log ˆσ 2 M(p) + n(n + df p) n df p 2 (20) i.e., AIC c for M(p), but with p replaced by df p in the penalty as our model selection criterion. Like (3), this criterion is defined only for the best model of each size; but it can be extended to all candidate models, analogously to (4). 4.2 Derivation of C (p) To evaluate (8) and hence (20), we must derive the aforementioned function C (p). We begin by writing H = (h ij ) i,j n X(X T X) X T = H... H K..... H K... H KK, where each of the above blocks is n v n v. We can then state the following result, which gives the expectation of Ĉ CV, conditional on the K folds that make up X. Theorems and 2 are proved in Appendix Appendix D:.

12 Resampling-Based Information Criteria for Best-Subset Regression Theorem Suppose model () holds where X is an n p matrix and ε N(0, σ 2 I n ). Assume that n t > p + 2 and that I nv H kk is invertible for each k. Then ( K ) y E k X k ˆβ k 2 X,..., X K = k= ˆσ 2 k n t n t p 2 K tr[(i nv H kk ) ]. k= (2) To derive C (p), i.e. an expression for E(Ĉ CV ) that depends only on the model dimension p but not on X, we require distributional assumptions on the rows of X. Theorem 2 Let X = ( X) where x,..., x n, the rows of X, are IID (p )-variate normal. Under the assumptions of Theorem, if X k has rank p for each k, then ( K ) y E k X k ˆβ k 2 k= ˆσ 2 k By (6) and (22), we may take [ ] C (nt + )(n t 2) (p) = n (n t p 2) 2 = n(n t + )(n t 2) (n t p 2) 2. (22) (23) provided that the predictor vectors are IID multivariate normal. In practice this condition may not hold; but Appendix Appendix E: offers evidence that (23) is a reasonable approximation in most realistic settings. We therefore adopt (23) as our formula for C (p) in what follows. 4.3 Refining the criterion by constrained monotone smoothing Three obvious limitations of CVIC (20) are:. It is based on an overoptimism estimate Ĉad,CV (p) that is stochastic, unlike the fixed penalty C AICc (p). 2. For p = 2,..., P, we may be willing to accept the stochastic estimate Ĉ ad,cv (p), since no closed-form expression for C ad (p) exists. But as noted in Section 2.2, for p =, P, there is only one candidate model of size p, so C ad (p) reduces to C AICc (p) = n(2p+2) n p 2. In effect, for p =, P, df p is just a noisy version of p, and CVIC is just a noisy version of AIC c and hence inferior to simply using AIC c. 3. Whereas C ad is an increasing function of p, Ĉad,CV need not be. This may cause CVIC to favor overfitting in some cases. We can mitigate the first of these problems, and eliminate the other two, by means of the constrained penalized spline algorithm implemented in the R package mgcv (Wood, 2006). This algorithm allows us to compute a smooth function df(p), approximating df p at p =,..., P, such that (i) df() = and

13 2 Philip T. Reiss et al. df(p ) = P, and (ii) df is constrained to be monotonically increasing, by the method of Wood (994). (Figure 8 displays the raw df p plotted against the smoothed df(p) for a real data example.) We can then replace the raw CVIC (20) with the monotonic variant CV IC mon (p) = n log ˆσ 2 M(p) + n[n + df(p)] n df(p) Summary of the proposed adaptive methods We provide here a brief summary of the resampling-based information criteria considered above. Section 3 introduced Ishiguro et al. s (997) EIC, based on using bootstrap samples to estimate the overoptimism (5), and proposed the adaptive extension EIC ad. In Section 3.3 we showed that the bootstrap tends to overpenalize larger models in small samples, and this motivated an alternative, cross-validatory approach to adaptive linear model selection. Section 4. defined the CVIC in terms of a quantity C (p) which we derived in Section 4.2. In Section 4.3 we sought to overcome some of CVIC s limitations by applying constrained monotone smoothing to the effective degrees of freedom, resulting in the new criterion CVIC mon. A somewhat unappealing feature of the CVIC overoptimism estimate (8) is that it estimates C ad (p) indirectly, by applying C AICc C to Ĉ ad,cv (p), in contrast to the direct bootstrap estimate Ĉad,boot(p) (2). On the other hand, whereas Ĉad,boot(p) is an adaptive extension of Ĉboot(p), which we have shown to be a biased overoptimism estimator, Ĉad,CV (p) is an adaptive extension of the unbiased estimator Ĉ CV = Ĉ CV (p) of C (p) offering some hope that the resulting model selection criterion CVIC will outperform the bootstrap criterion EIC ad. Appendix A.2 discusses two alternative CV-based approaches to subset selection. 5 Simulation study 5. Setup We conducted a simulation study to compare the performance of minimization of () AIC; (2) AIC c ; (3) BIC; (4) CIC; (5) EIC ad ; (6) CVIC; and (7) CVIC mon. Four sets of 300 simulations were performed. Each set began by choosing a predictor matrix X = (x... x 50 ) T whose rows were independently generated from a multivariate normal distribution with mean zero and covariance matrix having (j, k) entry 0.7 j k. (Note that here, as in Tibshirani and Knight s (999) simulation study, X does not include a column of s, since the true intercept is 0.) Then, for each individual simulation, outcomes y,..., y 50 were generated from the model y i = x T i β + ε i, where ε,..., ε 50 are independent normal variates with mean 0 and variance

14 Resampling-Based Information Criteria for Best-Subset Regression 3 σ 2 =. In the first set of simulations the true coefficient vector β R 20 was set to zero. For the remaining simulations, there were two sets of nonzero coefficients centered around the 5th and 5th predictors. These were set initially to β 5+j = β 5+j = h j for j < h, where h was, 2, 3 in the second, third, and fourth sets of simulations, resulting in 2, 6, and 0 nonzero coefficients. We then multiplied β by a constant chosen so that the final β would satisfy R 2 β T X T Xβ/(nσ 2 + β T X T Xβ) = This quantity is the theoretical R 2 used by Tibshirani and Knight (999), i.e., the regression sum of squares divided by the expected total sum of squares for the true model; note that the more general expression given by Luo et al. (2006) would be required if the true model had a nonzero intercept. For CIC and EIC ad, resampling was performed 40 times per simulation; for the two CVIC variants, leave-one-out CV was used. 5.2 Comparative performance Figures 2 5 display the true coefficients for each set of simulations, along with the proportion of simulations in which each coefficient was included in the model i.e., the detection rate for truly nonzero coefficients, and the false alarm rate for zero coefficients for each method. Figure 6 shows boxplots of the model error X ˆβ Xβ 2 (i.e., the mean prediction error minus σ 2 ) for each method, in each set of simulations. For the true models with 0 or 2 nonzero coefficients, the proposed methods EIC ad, CVIC and CVIC mon are the best performers, achieving near-perfect variable selection. The constrained smoothing makes CVIC mon somewhat superior to CVIC, but EIC ad attains the lowest model error of all the methods. For the models with 6 or 0 nonzero coefficients, the three proposed methods are again notably resistant to false alarms, but have difficulty detecting the truly nonzero coefficients; hence these methods model error is no better than that of the other methods, and indeed EIC ad, which tends to overpenalize large models (see Section 3.3), has the highest model error when the true model has 0 nonzero coefficients. Since the selection bias discussed in Section 2.2 tends to favor non-null over null models, it is unsurprising that our adaptive methods provide the greatest benefit when all or most of the true coefficients are zero. CIC, like our proposed methods, aims to take into account the process of searching among numerous candidate models, but the simulation results suggest that its ability to protect against spurious predictors diminishes as the true model grows: CIC s false alarm rates are lower than AIC c s when the true model is null, but higher for the larger models. 5.3 Variability A key criticism of resampling-based overoptimism estimators is that their inherent variability leads to unstable model selection. To investigate the vari-

15 4 Philip T. Reiss et al True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 2 This figure and the three figures that follow show detection rates and false alarm rates for four true models with 20 candidate predictors. The top left subfigure shows the true coefficients, which are all zero in this case, while the other subfigures display the relative frequency of inclusion of each predictor, in 300 simulations, based on the seven methods listed in the text. Table 2 Mean, over 00 simulated data sets, of the mean and standard deviation of the overoptimism estimates from 30 replicates of each of the proposed adaptive procedures. The last row shows C AICc (7) for comparison. Note that the value 4.26 in the first column is the true overoptimism, and is recovered exactly by CVIC mon. In the remaining columns, the true overoptimism C ad is greater than C AICc due to the effect of model selection. Number of predictors with nonzero coefficients EIC ad 4.8 (.3) 0.76 (.5) 72.6 (5.05) (2.7) CVIC 3.7 (.24) 7.86 (2.09) (7.06) (8.20) CVIC mon 4.26 (0) 2.8 (2.9) (5.9) (7.23) AIC c ability of our estimators of C ad, we performed another four sets of 00 simulations, using the exact same specifications for the sample size, predictors and outcomes as above. For each simulated data set, we computed 30 replicates of the EIC ad overoptimism estimate for the true model size using 00 bootstrap samples, and 30 replicates of the CVIC and CVIC mon overoptimism estimates using 0-fold CV; we then obtained the mean and standard deviation (SD) of the 30 estimates by each method. The means of these values over the 00 data sets are given in Table 2.

16 Resampling-Based Information Criteria for Best-Subset Regression True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 3 Detection and false alarm rates for the true coefficient vector shown at top left, i.e., two equal positive coefficients and all other coefficients equal to zero. Grey bars indicate detection rates for truly nonzero coefficients, and white bars give false alarm rates for zero coefficients True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 4 Detection and false alarm rates for the true coefficient vector shown at top left (six nonzero coefficients).

17 6 Philip T. Reiss et al. True coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon Fig. 5 Detection and false alarm rates for the true coefficient vector shown at top left (ten nonzero coefficients). AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients AIC AIC c BIC CIC EIC ad CVIC CVIC mon nonzero coefficients Fig. 6 Model error X ˆβ Xβ 2 under the four simulation settings.

18 Resampling-Based Information Criteria for Best-Subset Regression 7 The raw CVIC penalty is seen to have somewhat higher SD than that of EIC ad, except in the column for 0 true predictors, where EIC ad has much higher SD. Note that in that column the mean is much higher for EIC ad than for CVIC; although the true value C ad is unknown, the EIC ad penalty seems to be positively biased here, in line with the fixed-model results in Table. The SD values for CVIC mon suggest that the constrained monotone smoothing succeeds in reducing the penalty variability; the only exception is that with 2 true predictors, the SD is slightly higher for CVIC mon than for raw CVIC, but here the mean is also notably higher. 6 Application: functional connectivity in the human brain We now turn to the application outlined in the Introduction. Self-esteem and MASQ-GDD (depression) scores, and FC values for the 0 connections described above, were acquired in a sample of 43 participants scanned with resting-state fmri (Biswal et al., 995) at New York University. The regions of interest were defined, and FC was computed, as in Stark et al. (2008). Approximately optimal Box-Cox transformations were applied to the two psychological outcomes (the third power for self-esteem, and the logarithm for MASQ-GDD), which were then regressed on subsets of the 0 FC predictors. We compare the subset selection results for AIC, AIC c, EIC ad, and CVIC mon. For regression of self-esteem score on the 0 connections, there are 3 models, with 5 predictors, having somewhat lower AIC than the null model. The best model (AIC value 77.2, versus 77.7 for the second-best model and for the null model) includes the AMY-SUB, OFC-AMY, SUB-INS, and SUP- PCC connections. But since a number of models have AIC near the minimum, it seems reasonable to adopt a pluralistic approach that considers all of the near-optimal models and asks which predictors appear most often in them. Of the 3 models that outperform the null model, two of the above four connections stand out in terms of inclusion frequency: SUB-INS appears in all 3 models, while OFC-AMY occurs in all but two suggesting that these two connections may be most strongly associated with self-esteem. However, AIC c, EIC ad and CVIC mon choose the null model as the best model, suggesting that chance variation accounts for the AIC-based findings. For the MASQ-GDD score, on the other hand, the adaptive criteria yield rather different results than either AIC or AIC c. The number of models outscoring the null model is 78 for AIC and 56 for AIC c, but only 3 for CVIC mon (the models including PCC-AMY only, SUB-INS only, or both), and for EIC ad (the model including both PCC-AMY and SUB-INS). Figure 7 displays the 0 best models according to AIC c, and the 3 models with lower CVIC mon values than the null model. All four criteria choose the model with the PCC-AMY and SUB-INS connections as the best. Moreover, these two connections occur in all 0 of the lowest-aic c models, whereas no other connection occurs in more than 3 of these models. But whereas the AIC c results indicate that one or two additional predictors might improve the model, the EIC ad and CVIC mon

19 8 Philip T. Reiss et al. AIC c CVIC mon AMY PHG AMY SUB INS AMY OFC AMY OFC PFC PCC AMY SUB INS SUB SUP SUP AMY SUP PCC AMY PHG AMY SUB INS AMY OFC AMY OFC PFC PCC AMY SUB INS SUB SUP SUP AMY SUP PCC Criterion value Fig. 7 Information criterion values for the best models for MASQ-GDD score, based on AIC c and on CVIC mon. The set of line segments at a given y-coordinate indicates the connections included in the model attaining that value of the information criterion, and the dotted line in the right plot indicates CVIC mon for the null model. results imply that such predictors would be spurious. The 0 lowest-aic models (not shown in Figure 7), like the lowest-aic c models, consistently include PCC-AMY and SUB-INS, but the former models are generally even larger, including as many as 6 of the connections. Figure 8 shows the effective degrees of freedom obtained by the CVIC mon method. The fitted models regressing log-transformed GDD on PCC-AMY and/or SUB-INS suggest that both are positively related with depression: a onestandard-deviation increase in either connectivity score is associated with approximately a 0% increase in the GDD subscale. The coefficients of determination are roughly % for PCC-AMY alone, 3% for SUB-INS alone, and 25% for the model containing both. 7 Discussion In the Introduction, we argued that our resampling approach to model selection should be particularly relevant when the predictor dimension p is not negligible compared with the sample size n. On the other hand, when p becomes too large to compute a score such as AIC for all subsets, it is more common to turn to other methods such as partial least squares (Helland, 988), ridge regression (Hoerl and Kennard, 970) or the lasso (Tibshirani, 996). Indeed, such methods have been applied successfully in moderate-p applications. Nevertheless, when it is feasible to evaluate a score such as CVIC mon for all subsets, there are advantages to doing so including the fact that this approach provides a natural framework for obtaining not just a single model

20 Resampling-Based Information Criteria for Best-Subset Regression 9 Effective model dimension Raw Smoothed Model dimension Fig. 8 CVIC mon for the GDD model: effective model dimension, raw (df p) and smoothed ( df(p)), for the selected model of each size. The line of identity is shown in grey. estimate but a collection of leading candidate models, as illustrated in our functional connectivity example (Section 6 above). The above data analysis also demonstrates that the relatively low falsealarm rate of resampling-based information criteria makes them particularly useful for highly exploratory discovery science studies in which preventing spurious findings is of central importance. In other settings, however, these criteria s tendency to err toward non-detection of true predictors may represent a serious limitation. Another concern is the variability of our methods overoptimism estimates, which cannot be reduced by the technique of Konishi and Kitagawa (996) since that device is applicable only in the fixed-model case. In ongoing research, we are seeking ways to surmount these limitations. To extend our resampling-based information criteria from linear to generalized linear models (GLMs), two difficulties must be surmounted. First, finding the best model for each resampled data set requires efficient subset selection algorithms, which are readily available for linear models (e.g., the above-cited leaps algorithm) but much less developed for GLMs. Second, the closedform expressions for expected overoptimism used in the CVIC criterion are valid only for the linear case. The methods of Lawless and Singhal (978) and Cerdeira et al. (2009) have enabled us to overcome the first difficulty and implement a logistic regression version of EIC ad, but this criterion appears to suffer from a marked tendency to underfit. This preliminary finding adds to our motivation to overcome the second difficulty and extend CVIC to the generalized linear case.

21 20 Philip T. Reiss et al. The methods of this paper have been implemented in an R package called reams (resampling-based adaptive model selection), available at Appendix A: Some alternatives to our adaptive criteria A. An alternative form of best-subset EIC An alternative to (3) would be n log ˆσ M(p) 2 + B y X M(p) ˆβ b;m(p) 2 B ˆσ 2. (24) b= b;m(p) This criterion differs from (3) in that, although new parameter estimates ˆβ b;m(p), ˆσ 2 b;m(p) are obtained for each bootstrap sample, we reuse the original-data selected subset M(p), rather than selecting a new subset Mb (p) for each bootstrap sample as in (3). Criterion (24) is more computationally efficient than (3), and may be equivalent to the EIC 2 criterion of Konishi and Kitagawa (2008, pp ). However, because this criterion uses the bootstrap data as a surrogate for the real data only for parameter estimation, but not for model selection, it may not fully capture the overoptimism associated with parameter estimates from a selected model. Indeed, in simulations similar to those reported in Section 5, we found criterion (24) to be susceptible, like nonadaptive criteria, to a high rate of false detections. A.2 Two alternative CV methods An alternative to the CVIC (20), which at first glance may seem much simpler, is to compute the usual K-fold CV estimate of prediction error K k= y k X k;aβ k;a 2 for every possible subset A, and choose A for which this quantity is minimized. However, to the best of our knowledge, it is nontrivial to extend efficient all-subsets regression algorithms to perform all-subsets CV. Moreover, this approach would be biased in favor of model dimensions p for which the number of subsets of size p is highest. Another alternative to CVIC is to minimize n log ˆσ M(p) 2 + n + Ĉ ad,cv (p), (25) i.e., to define our criterion directly in terms of Ĉad,CV (p) instead of the overoptimism estimate Ĉad,CV (p) = C AICc C Ĉ ad,cv (p) appearing in (20). However, we chose the latter quantity for consistency with the standard definitions (4), (5) of the overoptimism C. Had we opted for (25), it would not have been clear whether our criterion performed differently from AIC c due to its adaptive nature, or due to substituting the estimand C for the traditional C. Appendix B: The simulation procedure of Section 3.3 Suppose the bth bootstrap sample consists of cases i b, ib 2,..., ib n. The resampled data set can be written as y b = S by and X b = S bx where S b is the sampling-with-replacement matrix (e i b e i b 2... e i b n ) T ; here e k is the n-dimensional vector with in the kth position and 0 elsewhere. At first glance, Monte Carlo estimation of E(Ĉboot) for a given design matrix X

22 Resampling-Based Information Criteria for Best-Subset Regression 2 entails randomly generating (i) the bootstrap sampling matrix S b and (ii) the error vector ε. However, since ( B y X E(Ĉboot) = E ˆβ ) b 2 B ˆσ 2 n b= b [ ( B y X = E E ˆβ )] b 2 B ˆσ 2 S b n, (26) b= b we can eliminate the second source of simulation error by means of an analytic expression for the inner expectation in (26). We do this in two steps: expressing the fractional expression in (26) as a quotient of quadratic forms, and exploiting a formula for the expectation of such a quotient. Step. It is easily seen that S T b S b = D b diag(r b, rb 2,..., rb n), where rk b is the number of occurrences of k in the bth bootstrap sample. If fewer than p of r b, rb 2,..., rb n are positive, i.e., if the bootstrap sample contains fewer than p distinct cases, then ˆσ b 2 = 0 and EIC is undefined. On the other hand, if the bootstrap sample contains at least p distinct cases, then X T D b X would ordinarily be nonsingular. Assuming this to be the case, one can show that y X ˆβ b 2 /ˆσ b 2 = (ny T Q T b Q b y)/(yt Q T b D bq b y), where Q b = I X(X T D b X) X T D b. If X represents a correct model, i.e. () holds, then since Q b X = 0, we obtain y X ˆβ b 2 /ˆσ b 2 = nεt Q T b Q b ε ε T Q T b D bq b ε. (27) Step 2. If ε is a vector of n IID normal variates with mean zero, A is a symmetric n n matrix, and B is a positive semidefinite n n matrix with singular value decomposition P ΛP T, then by Theorem 6 of Magnus (986), E ( ε T ) Aε ε T = Bε (I + 2tΛ) /2 tr [(I + 2tΛ) P T AP 0 ] dt, (28) provided the expectation exists. By (27), the inner expectation in (26) is given by this integral with A = nq T b Q b, B = QT b D bq b. One can thus estimate E(Ĉboot) for a given X by numerically computing the integral (28) for each of a large set of bootstrap sampling matrices S b. In our simulation, for each (n, p), we generated 300 design matrices X by sampling each element independently from the standard normal distribution, and drew a single bootstrap sample for each X. Appendix C: An explicit expression for df p We attempt here to provide some intuition regarding df p. By (9), C (df p) = Ĉ ad,cv (p); (23) then leads to n(nt + )(nt 2) df p = n t n + Ĉ ad,cv (p) 2, or alternatively df p = p ( + ) (n t 2), (29) κ κ where κ = n+ĉ ad,cv (p) n+c, i.e., the ratio of K (p) k= y k X k;m k (p) ˆβ k;m k (p) 2 /ˆσ k;m 2 k (p) to E( K k= y k X k ˆβ k 2 /ˆσ k 2 ) as given by (22). The convex combination formula (29) makes it clear that df p p if and only if Ĉ ad,cv (p) C (p), as would ordinarily be case.

23 22 Philip T. Reiss et al. Appendix D: Proofs of Theorems and 2 D. Theorem We suppress the conditioning on X,..., X K in what follows. Since ˆβ k and ˆσ 2 k are independent, so are the numerator and denominator on the left side of (2). Therefore, since ˆσ 2 k σ2 χ 2 n t p /nt, ( K ) ( ) y k X k ˆβ k 2 K K E ˆσ 2 = E k= k ˆσ 2 E( y k X k ˆβ k 2 ) = nt k= E( y k X k ˆβ k 2 ) k= k σ 2. (n t p 2) It therefore suffices to show that K K E( y k X k ˆβ k 2 ) = σ 2 tr[(i nv H kk ) ]. (30) k= To prove (30) we use the identity k= y k X k ˆβ k = (I nv H kk ) (y k X k ˆβ), the generalization to K-fold CV of a well-known result for leave-one-out CV (e.g., Reiss et al., 200). Combining this with y X ˆβ = (I n H)ε implies that the left side of (30) equals E[ε T (I n H)B(I n H)ε] = σ 2 tr[b(i n H)] where (I nv H ) (I nv H 22 ) B = (I nv H KK ) 2 Clearly tr[b(i n H)] = K k= tr[(inv H kk ) ], so (30) follows. D.2 Theorem 2 ( K Since E k= y k X k ˆβ ) [ ( K k 2 /ˆσ k 2 = E E k= y k X k ˆβ k 2 /ˆσ k 2 X,..., X K )], (2) implies that it suffices to prove K E [ tr{(i nv H kk ) } ] = k= n(nt + )(nt 2). (3) n t(n t p 2) By the Sherman-Morrison-Woodbury lemma (e.g., Harville, 2008) and the fact that H kk = X k (X T X) X k, [ ] tr[(i nv H kk ) ] = tr I nv + X k (X T k X k) X T k = n v + x T i (XT k X k) x i i V k = n v + i V k ( x T i ) ( n t n t x k n t x T k XT k X k ) ( x i ),

24 Resampling-Based Information Criteria for Best-Subset Regression 23 where V k = {i : x i belongs to the kth validation set}, X k comprises the last p columns of X k, and x k is the vector of column means of X k. By a standard formula for the inverse of a partitioned matrix, the above yields tr[(i nv H kk ) ] = n v + 2 x T i i V k [ /n t + x T ct k( X X c k k) x k ct ( X X c k k) x k + x T ct i ( X X c ] k k) x i, where X c k is the column-centered version of X k. If x k = µ + η k and x i = µ + η i where µ R p is the mean of the predictor distribution, then some algebra leads to E [ tr{(i nv H kk ) } ] = n v + [ /n t + E{ η T ct k ( X X c k k) η k } i V k + E{η T ct i ( X X c ] k k) η i }. The two quadratic forms above are distributed as n t (n t ) times the Hotelling T 2 (p, n t ) distribution and as n t times the T 2 (p, n t ) distribution, respectively. Hence E [ tr{(i nv H kk ) } ] = n v + Summing over k leads to (3). i V k [ { + n t n t(n t ) + n t } ] (nt )p. n t p 2 Appendix E: Robustness of (22) E. Expectation of n + Ĉ CV in general Our formula for C (p) is based on (22), which gives the expectation of n+ĉ CV = K k= y k X k ˆβ k 2 /ˆσ k 2 assuming IID multivariate normal predictor vectors. Here we consider the robustness of (22) to departures from this distribution, for the most popular form of CV: leave-one-out CV, i.e., K = n. In this case the right-hand expressions in (2) and (22) reduce n to n n p 3 i= and n2 (n 3) h ii (n p 3) 2, respectively. Let r n,p(x) be the ratio of these two values, i.e., r n,p(x) = (n )(n p 3) n 2 (n 3) n i= h ii. (32) The ratio of the actual expectation of n + Ĉ CV to that given by Theorem 2 is then rn,p(x)dx, where the integration is with respect to the distribution of the random matrix X (if one wishes to view X as fixed, this distribution becomes a point mass). We then have the following result. Proposition Let h max = h max(x) be the largest diagonal element of H. If the assumptions of Theorem 2 hold and h max <, then { where c X = min (n )(n p 3) (n 3)(n p) h max, + r n,p(x) c X(n )(n p 3), n(n 3) p n( h max) 2, + p n + phmax n( h max) 3 }.

25 24 Philip T. Reiss et al. Lower bound Upper bound h max h max p/n p/n Fig. 9 Contour plots, for n = 00, of the bounds imposed by Proposition on the ratio r n,p(x). Note that points below the line of identity are excluded, since h max must be at least p/n. We can gain insight into the practical implications of Proposition by plotting the above bounds on r n,p(x) with respect to p/n and h max. The bounds depend only weakly on n; Figure 9 displays them for n = 00. For p not too large, the lower bound is just slightly below, and the random quantity h max will ordinarily have most of its mass in the region where the upper bound is also near ; hence r n,p(x)dx, i.e., (22) should provide a good approximation. The bounds deviate markedly from only when either p/n or h max is very high, i.e., either the model dimension is very large or there are inordinately high-leverage observations two scenarios in which the entire enterprise of linear modeling is generally unreliable. E.2 Proof of Proposition By (32), it suffices to show that p/n n n i= h ii c X. The first inequality holds since the first expression is the harmonic mean, whereas the second expression is the arithmetic mean, of,...,. The second inequality is derived h h nn from the following three inequalities: n i= n ; h ii h max n i= h ii = n i= < n + [ + h ii ( h ii )2 p ( h max) 2 ; ] where 0 < h ii < h ii for each i

Resampling-based information criteria for best-subset regression

Ann Inst Stat Math (202) 64:6 86 DOI 0.007/s0463-02-0353- Resampling-based information criteria for best-subset regression Philip T. Reiss Lei Huang Joseph E. Cavanaugh Amy Krain Roy Received: 5 July 200