1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsa

Size: px

Start display at page:

Download "1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsa"

Branden Nash
6 years ago
Views:

1 On the Use of Marginal Likelihood in Model Selection Peide Shi Department of Probability and Statistics Peking University, Beijing P. R. China Chih-Ling Tsai Graduate School of Management University of California Davis, California U. S. A. SUMMARY Based on the marginal likelihood approach, we develop a model selection criterion, MIC, for regression models with the general variance structure. These include weighted regression models, regression models with ARMA errors, growth curve models, and spatial correlation models. We show that MIC is a consistent criterion. For regression models with either constant or non-constant variance, simulation studies indicate that MIC not only provides better model order choices than the asymptotic ecient criteria AIC, AICC, F P E and C p, but is also superior to the consistent criteria BIC and F IC in both small and large sample sizes. These results also hold for regression models with AR(1) errors, except that BIC performs slightly better than MIC in the large sample size case. Finally, Monte Carlo studies show that the eectiveness of model selection criteria decreases as the degree of heterogeneity (or the rst order autocorrelation) increases. Some key words: AIC; AICC; BIC; F IC; MIC; Marginal likelihood. 1

2 1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsai, 1989), BIC (Schwartz, 1978), F IC (Wei, 1992), F P E (Akaike, 1970), and C p (Mallows, 1973). All of these criteria are comprised of two basic components where the rst is a function of the scaled parameter estimator, which measures the goodness-of-t, and the second is a function of the number of unknown parameters, which penalizes overtting. Hence, when considering two candidate models with the same number of parameters (e.g., one including explanatory variables x 1 ; x 3 and x 5, and the other including x 2 ; x 4 and x 6 ), the model with the smaller value for the rst component will be selected. This implies that the more interesting parameter for model selection is the scaled parameter, and that the regression parameters can be viewed as nuisance parameters. In applications, one may use a linear model with a general variance structure, which needs to be selected based on data. For this kind of models, we are interested in the parameters included in the variance structure, and the regression parameters are considered nuisance parameters. A widely used procedure for inference about parameters of interest when nuisance parameters are also present is based on the prole log-likelihood function. However, this procedure may give inconsistent or inecient estimates. In addition, the prole loglikelihood itself is not a log-likelihood. Thus we must consider conditional log-likelihood (Cox & Hinkley, 1974), modied prole log-likelihood (Barndor-Nielsen, 1983 and 1985), and marginal log-likelihood (McCullagh & Nelder, 1989) as alternative approaches. For the sake of simplicity, in this paper we only consider the marginal log-likelihood to obtain a model selection criterion. The aim of this paper is to consider model selection for regression models with the general variance structure and obtain the modied information criterion, MIC. We will achieve this rst by obtaining an unbiased estimator of the expected Kullback-Leibler information of the marginal log-likelihood function of the tted model. We then derive MIC criterion in Section 2, and show that it is a consistent criterion in Section 3. In Section 4, we compare MIC with other model selection criteria via Monte Carlo studies. The results indicate that MIC is not only outperforms the ecient criteria AIC, AICC, F P E and C p, but is also superior (or comparable) to the consistent criteria BIC and F IC in both small and large sample sizes. Finally, we give some concluding remarks in 2

3 Section Derivation of the M I C Criterion 2.1. Marginal Log-likelihood Function Consider the candidate family of models Y = X + e; (2:1) where Y is an n1 vector, X is an nk matrix of known constants, is a k 1 vector of unknown regression coecients, e = (e 1 ; ; e n ) 0 has a multivariate normal distribution with mean E(e) = 0 and variance E(ee 0 ) = 2 W (), is an unknown scalar and is a q 1 unknown vector. In this model, ( 0 ; 2 ) 0 are the parameters of interest and the are nuisance parameters. Next, by adopting McCullagh and Nelder's (1989, Section 7.2) results, we obtain the marginal log-likelihood of ( 0 ; 2 ) 0 : L(; 2 ; Y ) =? 1 2 log j2 W j? 1 2 log jx0 W?1 X= 2 j? 1 2 Y 0 (I? H)Y = 2 ; (2:2) where Y = W?1=2 Y, X = W?1=2 X and H = X (X 0 X )?1 X 0. Suppose, however, that the response was generated by the true model Y = X "; where X 0 is an n k 0 matrix of known constants, 0 is a k 0 1 vector of unknown regression coecients, " = (" 1 ; ; " n ) 0 has a multivariate normal distribution with mean E(") = 0 and variance E("" 0 ) = 2 0W ( 0 ), 0 is an unknown scalar and 0 is a m 1 unknown vector. Then the marginal log-likelihood of ( 0 0; 2 0) 0 is L 0 ( 0 ; 0; 2 Y ) =? 1 2 log j2 0W 0 j? 1 2 log jx0 0W?1 0 X 0 =0j 2? 1 2 Y 0 0 (I? H 0 )Y 0 =0; 2 (2:3) where Y 0 = W?1=2 0 Y, X = W?1=2 0 X 0, H 0 = X0(X 0 0 X 0 )?1 X 0 0 and W 0 = W ( 0 ). Using equations (2.2) and (2.3), we can now obtain a model selection criterion as shown in the next subsection Marginal Information Criterion 3

4 A useful measure of the discrepancy between the true and candidate marginal loglikelihood functions is the Kullback-Leibler information (; ) =?2E 0 fl(; 2 ; Y )g = log j 2 W j + log jx 0 W?1 X= 2 j + trf 0 0X 0 0W 0?1=2 (I? H)W?1=2 X 0 0 g +trf(i? H)W?1=2 W 0 W 0?1=2 g 2 0= 2 ; (2:4) where E 0 denotes the expectation under the true model. We assume that the candidate family includes the true model, an assumption that is also used in the derivation of the Akaike information criterion, AIC (Linhart & Zucchini, 1986), and the corrected Akaike information criterion, AICC (Hurvich & Tsai, 1989). Under this assumption, the columns of X can be rearranged so that X 0 0 = X, where = ( 0 0; 0 0 ) 0 and 0 0 is a 1 (k? k 0 ) vector. Hence, the the third component of the right hand side of (2.4) is 0. Furthermore, replacing W?1=2 by its approximation W?1=2 0 in the second trace component of equation (2.4), (; ) can be approximated by (n? k) log( 2 ) + log jw j + log jx 0 W?1 Xj + (n? k) 0 = 2 : (2:5) A reasonable criterion for judging the quality of the candidate family in the light of the data is E 0 f(^; ^)g, where ^ 2 and ^ are the maximum likelihood estimates of 2 and in the candidate family, respectively. Note that when the candidate models include the true model, n^ 2 = 2 0 is approximately distributed as 2 n?k and E 0 (0=^ 2 2 ) n. This, n?k?2 together with equation (2.5), gives us (^; ^) (n? k)e 0 flog(^ 2 )g + E 0 flog j ^W j + log jx 0 ^W?1 Xjg + n(n? k) n? k? 2 ; where ^W = W (^). This leads to the corrected Akaike information criterion obtained from the marginal log-likelihood function, which is MIC = (n? k) log(^ 2 ) + log j ^W j + log jx 0 ^W?1 Xj + Three analytical examples are given below to illustrate the use of MIC. Example 2.1 (Regression model with constant variance) n(n? k) n? k? 2 : (2:6) Consider the case where W () = I. Here (2.1) represents a multiple regression model with constant variance, and the resulting selection criterion is MIC = (n? k) log(^ 2 ) + log jx 0 Xj + 4 n(n? k) n? k? 2 : (2:7)

5 It is interesting to note that the log jx 0 Xj term also appears in Wei's (1992) F IC criterion (see the expression of FIC in Section 4). Example 2.2 (Regression model with non-constant variance) Assume that W () = diag(w i ()), where W i () = W (z i ; ), z i is a known r 1 vector and is an unknown m 1 vector for i = 1; ; n. Under this assumption, (2.1) is a multiple regression model with non-constant variance, and the resulting marginal log-likelihood has the same form as equation (2.2). Hence, the selection criterion MIC is given by (2.6). For this weighted regression model, McCullagh & Tibshirani (1990) showed that its marginal log-likelihood is the rst order bias corrected prole log-likelihood. In addition, Verbyla (1993) used the marginal likelihood to obtain a test statistic to assess the homogeneity of the errors, and recently Lyon & Tsai (1996) compared various tests for homogeneity obtained when using the log-likelihood, marginal log-likelihood, conditional log-likelihood, and modied log-likelihood functions. Example 2.3 (Regression model with ARMA errors) Consider the regression model given by (2.1), where the random errors are generated by an ARMA(p; q) process dened as e t? 1 e t?1? ;? p e t?p = a t? ' 1 a t?1?? ' q a t?q ; where a t is a sequence of independent normal random variables having mean zero and variance 2. The log-likelihood function for e (ignoring the constant) is L(ej 2 ; ) = jw j?1=2 (2)?n=2 exp(? 1 2 e0 W?1 e= 2 ); (2:8) where = ( 1 ; ; p ; ' 1 ; ; ' q ) 0 and W is the variance matrix of e. Equation (2.8) can be found in Cooper & Thompson, 1977 and Harvey & Phillips, The resulting marginal log-likelihood has the same form as (2.2) (see Wilson, 1989), and hence the selection criterion MIC is given by (2.6). Note that MIC is used to select regression parameters by assuming that the the orders of p and q in the ARMA process are specied. This is dierent from the approach taken by Tsay (1984), who proposed a method to identify the order of p and q when the dimension of the regression parameters is known. Note that by using arguments similar to those in Examples 2 and 3, we also can obtain MIC for growth curve models and for regression models with spatial correlation errors 5

6 (Sen & Shrivastava, 1990, p. 138 and p. 143). The theoretical justication for applying MIC to variable selection under the above conditions is given in the next section. 3. Asymptotic Results for M IC Let A = f : is a nonempty subset of f1; ; kgg and A 0 = f 2 A : E(Y ) = X()()g be a subset of A, where each in A 0 is associated with a model that includes the true model, and X() and () contain the components X and, respectively. Finally, let ^ denote the model selected by MIC so that its value is the smallest of those for all possible candidate models, and let 0 be the model in A 0 with the smallest dimension. We will rst obtain asymptotic results for the multiple regression model with constant variance, and then extend the results to the general model setting given by (2.1). In order to state and prove our asymptotic results, we will need the following two denitions and assumptions. Denition 1. The selection procedure is said to be consistent if P f^ = 0 g?! 1 as n tends to innity, and the selection procedure is said to be strongly consistent if P flim n ^ = 0 g = 1: Denition 2. A random variable is sub-gaussian if there exists a constant ' > 0 such that for all real d Efexp(d)g exp('d 2 ): This denition is adopted from Zheng & Loh (1995). Assumption 1. X()(X() 0 X())?1 X 0 (). Assumption 2. lim inf n!1 n?1 inf 2A?A 0 kx 0 0? H()X 0 0 k 2 > 0, where H() = lim sup n!1 max 2A jx()0 X()=nj < 1 and 0 < lim inf n!1 min 2A jx()0 X()=nj: We can now present the asymptotic results for the selection criterion, MIC. 6

7 Theorem 3.1. If Assumptions 1 and 2 are satised and the " i are assumed to be iid sub-gaussian, then MIC is a consistent criterion. The proof for Theorem 3.1 is given in the Appendix. Note that if the sub-gaussian assumption is replaced by the moment condition, E(" 4h 1 ) < 1 for some positive integer h, Theorem 3.1 remains true. Theorem 3.2. Let MIC (s) () = (n? k()) log(^ 2 ()) + n log jx() 0 X()j + 2(k() + 2) n? k()? 2 ; where k() is the dimension of the submodel 2 A, j s = maxfj : j 1=2s < log(j)g, and n = 8 < : 1 when n j s n 1=2s = log(n) when n > j s : Under the same assumptions given for Theorem 3.1, MIC (s) is a strongly consistent criterion for all nite s > 1. The proof of Theorem 3.2 is not presented here since the techniques used are similar to those for the proof of Theorem 3.1. However, the details can be obtained from the authors by request. Note that under the assumptions of Theorem 3.1, MIC is near strong consistency, since, except for a constant term n+2, MIC and MIC (s) are identical when n < j s and j s are very large; for example, j 35 = 7: E and j 54 = 7: E Theorem 3.1 shows that MIC is a consistent criterion for multiple regression models with a constant variance error structure. In fact, this result can be extended to the general model setting given by (2.1), as we see below. Theorem 3.3 In model (2.1), assume that both ^? 0 and ^ 2 (; ^)?^ 2 (; 0 ) tend to zero in probability as n tends to innity for all 2 A. In addition, assume that the elements of W () are continuous functions of, W () is positive denite in the neighborhood of 0, and Assumptions 1 and 2 are satised when X() is replaced by W ()?1=2 X(). If the " i are iid sub-gaussian, then MIC is a consistent criterion. The proof of Theorem 3.3 is given in the Appendix. Its results also are still valid if the sub-gaussian assumption is replaced by the moment condition, E(" 4h 1 ) < 1 for some positive integer h. 7

8 4. Monte Carlo Results In this section we use simulation studies to compare the performance of MIC with AIC, AICC, BIC, F IC, CP and F P E, dened as: AIC = n log(^ 2 ) + 2k, AICC = n log(^ 2 ) + 2n(k + 1)=(n? k? 2), BIC = n log(^ 2 ) + k log(n), F IC = n^ 2 + ~ 2 log jx 0 Xj, C p = n^ 2 =~ 2? n + 2k, and F P E = n^ 2 (n + k)=(n? k), where ^ 2 and ~ 2 are the maximum likelihood estimates of under the tted model and the full model, respectively. We consider regression models with either constant variance, non-constant variance, or AR(1) errors. For each of ve sample sizes (n=15, 20, 40, 80, 160), 1000 realizations were generated from the true model (2.1), where 0 = (1; ; 1) 0 is an k 0 1 vector, 0 = 1, and x 0i N(0; I k0 ) for i = 1; ; n. There are nine candidate variables, stored in an n9 matrix of independent identically distributed normal random variables. The candidate models are linear, and are listed in columns in a sequential nested fashion. Hence the set of candidate models includes the true model. Example 4.1. (Regression model with constant variance) Here we assume that k 0 = 5 and that the " i are iid normal random variables with mean zero and variance 1. Table 1 presents the proportions of the true order selected by the various criteria. In the small sample sizes (n=15 and 20), it is obvious that AICC and MIC outperform other criteria, and F IC and C P perform the worst. As the sample size increases, the proportions of correct model order chosen by BIC, F IC, and MIC also increase. This is not surprising since all these criteria are consistent. MIC performs the best overall for both large and small sample sizes. Example 4.2. (Regression model with non-constant variance) Consider the weighted regression model y i = x 0 0i 0 + w i " i ; i = 1; ; n; where w i = exp( 0 z i ), the z i are iid standard normal, and the 0 are 0.2, 0.5 and 0.7. Table 2 shows that the performances of all six criteria are similar to those for Example 4.1; that is, MIC is superior to AICC in small sample sizes and is slightly better than F IC and BIC in large sample sizes, and is best overall. The criteria AIC, C P and F P E 8

9 all perform comparably poorly. It is also interesting to note that the proportion of correct model order selections for each criterion decreases as the 0 increases. This indicates that strong heterogeneity has a negative eect on model selection. Example 4.3. (Regression model with AR(1) errors) Consider the linear regression with AR(1) errors where the " i are random errors satisfying y i = x 0 0i 0 + " i ; i = 1; ; n; " 1 N(0; 1=(1? 2 0)); " i = 0 " i?1 + i ; i N(0; 1); for i = 2; 3; ; n; and 0 = 0:1; 0:5 and 0:9. Table 3 only presents simulation results for positive autocorrelation, since the corresponding negative autocorrelation gives similar results. It shows that MIC outperforms the other criteria except when n = 80 and 160, where BIC performs slightly better. Not surprisingly, large positive autocorrelation reduces the eectness of model selection criteria, especially when 0 is close to 1 and the sample size is small or moderate. We also obtain the same conclusion when the autocorrelation is negative. The above three examples give us more insight into why AICC and MIC perform quite dierently. For regression models with the constant variance (Example 4.1), AICC and MIC are unbiased estimators of the Kullback-Leibler informations computed from the log-likelihood function and the marginal log-likelihood function, respectively. Hence, both performance comparably in small samples. As the sample size increases, MIC, because it is a consistent criterion, outperforms AICC. For regression models with heteroscedastic error (Example 4.2) or autocorrelated error (Example 4.3), MIC is still an approximately unbiased estimator of the Kullback-Leibler information computed from the marginal loglikelihood function, but AICC is not the unbiased estimator of the Kullback-Leibler information computed from the log-likelihood function. Therefore, MIC outperforms AICC in both small and large samples. 5. Conclusions 9

10 We have applied the marginal log-likelihood approach to regression models with the general variance structure in order to obtain a good criterion, MIC. We show that MIC is a consistent criterion. Simulation studies indicate that MIC is not only superior to the ecient criteria AIC, AICC, C P and F IC, but also performs better than (or comparable to) the consistent criteria BIC and F IC. This means that the marginal log-likelihood approach results in better variable selections than the log-likelihood approach under these conditions, and therefore it is natural to use the marginal log-likelihood approach to revise some existing model selection criteria. MIC also can be extended to other model settings, such as regression models with ARCH errors (Gourieroux, 1997) and multivarite regression model with a general variance structure (Greene, 1993, Chapters 16 & 17, Diggle, Liang & Zeger, 1994, Chapter 4, and Bedrick & Tsai, 1994). Finally, in future the conditional loglikelihood and modied prole log-likelihood approaches can be applied to obtain model selection criteria, and then compared to MIC. ACKNOWLEDGMENTS The research of Peide Shi was supported by NSF of China. The research of Chih-Ling Tsai was supported by National Science Foundation grant DMS

11 REFERENCES Akaike, H. (1970). Statistical predictor identication. Ann. Inst. Statist. Math. 22, Akaike, H.(1973). Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, Ed. B.N. Petrov and F. Csaki, pp Budapest: Akademia Kiado. Barndor-Nielsen, O. E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70, Barndor-Nielsen, O. E. (1985). Properties of modied prole likelihood. In Contributions to Probability and Statistics in Honour of Gunnar Blom, Ed. J. Lanke and G. Lindgren, pp Lund. Bedrick, E. J. & Tsai, C. L. (1994). Model selection for multivariate regression in small samples. Biometrics 50, Chatterjee, S. & Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. New York: Wiley. Cooper, D. M. & Thompson, R. (1977). A note on the estimation of the parameters of the autoregressive-moving average process. Biometrika 64, Cox, D. R. & Hinkley, D. V. (1974). Theoretical Statistics. New York: Chapman and Hall. Diggle, P. J., Liang, K. Y. & Zeger, S. L. (1994). Analysis of Longitudinal Data. New York: Oxford. Gourieroux, C. (1997). ARCH Models and Financial Applications. New York: Springer. Greene, W. H. (1993). Econometric Analysis. New York: Macmillan. Harvey, A. C. & Phillips, G. D. A. (1979). Maximum likelihood estimation of regression models with autoregressive-moving average disturbances. Biometrika 66, Hurvich, C. M. & Tsai, C.-L. (1989). Regression and time series model selection in small samples. Biometrika 76, Linhart, H. & Zucchini, W. (1986). Model Selection. New York: Wiley. Lyon, J. & Tsai, C. L. (1996). A comparison of tests for heteroscedasticity. The Statistician 45, Mallows, C. L. (1973). Some comments on C p. Technometrics 15, McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. New York: Chapman and Hall. 11

12 McCullagh, P. & Tibshirani, R. (1990). A simple method for the adjustment of prole likelihoods. J. R. Statist. Soc. B. 52, Rao, C. R. & Klee, J. (1988). Estimation of Variance Components and Applications. Amsterdam: North-Holland. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, Sen, A. & Shrivastava, M. (1990). Regression Analysis: Theory, Methods, and Applications. New York: Springer-Verlag. Tsay, R. S. (1984). Regression models with time series errors. J. Amer. Statist. Assoc. 79, Verbyla, A. P. (1993). Modelling variance heterogeneity: residual maximum likelihood and diagnostics. J. R. Statist. Soc. B. 55, Wei, C. Z. (1992). On predictive least squares principles. Ann. Statist. 20, Wilson, G. T. (1989). On the use of marginal likelihood in time series model estimation. J. R. Statist. Soc. B. 51, Zheng, X. D. & Loh, W. Y. (1995). Consistency variable selection in linear models. J. Amer. Statist. Assoc. 90,

13 Table 1. Proportions of correct model order selection by AIC, AICC, BIC, F IC, C p, F P E, and MIC criteria in 1000 realizations for the regression model. Criteria n=15 n=20 n=40 n=80 n=160 AIC AICC BIC F IC C p F P E MIC

14 Table 2. Proportions of correct model order selection by AIC, AICC, BIC, F IC, C p, F P E, and MIC criteria in 1000 realizations for the weighted regression model. 0 Criteria n=15 n=20 n=40 n=80 n=160 AIC AICC BIC F IC C p F P E MIC AIC AICC BIC F IC C p F P E MIC AIC AICC BIC F IC C p F P E MIC

15 Table 3. Proportions of correct model order selection by AIC, AICC, BIC, F IC, C p, F P E, and MIC criteria in 1000 realizations for the regression model with AR(1) errors. 0 Criteria n=15 n=20 n=40 n=80 n=160 AIC AICC BIC F IC C p F P E MIC AIC AICC BIC F IC C p F P E MIC AIC AICC BIC F IC C p F P E MIC

Bias-corrected AIC for selecting variables in Poisson regression models

Bias-corrected AIC for selecting variables in Poisson regression models Ken-ichi Kamo (a), Hirokazu Yanagihara (b) and Kenichi Satoh (c) (a) Corresponding author: Department of Liberal Arts and Sciences,