Some connections between Bayesian and non-bayesian methods for regression model selection

Size: px

Start display at page:

Download "Some connections between Bayesian and non-bayesian methods for regression model selection"

Kathryn Watson
5 years ago
Views:

Statistics & Probability Letters 57 (00) 53 63 Some connections between Bayesian and non-bayesian methods for regression model selection Faming Liang Department of Statistics and Applied Probability,

Bayesian methods and non-bayesian methods for variable selection in multiple linear regression.

1 Statistics & Probability Letters 57 (00) Some connections between Bayesian and non-bayesian methods for regression model selection Faming Liang Department of Statistics and Applied Probability, The National University of Singapore, 3 Science Drive, Singapore , Singapore Received March 001; received in revised form August 001 Abstract In this article, we study the connections between Bayesian methods and non-bayesian methods for variable selection in multiple linear regression. We show that each ofthe non-bayesian criteria, FPE ; AIC; C p and adjusted R, has its Bayesian correspondence under an appropriate prior setting. The theoretical results are illustrated by numerical simulations. c 00 Elsevier Science B.V. All rights reserved. Keywords: Bayes factor; FPE criterion; Kullback Leibler distance; MAP; Variable selection 1. Introduction Consider a linear regression with a xed number ofpotential predictors x 1 ;:::;x k, Y = X + ; (1) where Y is an n-vector ofresponse, X =[1; x 1 ;:::;x k ] is an n (k + 1) design matrix, = ( 0 ; 1 ;:::; k ) is a (k + 1)-vector ofregression coecients, N n (0; I), and and are unknown. The problem ofinterest to us is to nd a subset model ofthe form Y = X p p + ; () which is best under some criterion, where 0 6 p 6 k; X p =[1; x1 ;:::;x p]; x1 ;:::;x p are the selected predictors, and p =[0 ; 1 ;:::; p] is the vector ofregression coecients ofthe subset model. When p = k, the model is called the full model; when p = 0, the model is called the null model. Throughout this article, we assume that for any subset model the intercept term is always Fax: address: stalfm@nus.edu.sg (F. Liang) /0/$ - see front matter c 00 Elsevier Science B.V. All rights reserved. PII: S (0)00048-

2 54 F. Liang / Statistics & Probability Letters 57 (00) included in and the design matrix X p is offull column rank, and the true model is one ofthe subset models and it consists of k predictors with 0 6 k 6 k. During the last three decades, numerous methods have been developed for this problem from both the Bayesian and non-bayesian perspectives. The non-bayesian methods are usually criterion-based. They work by selecting a best model under some criterion and then to make inferences as if the selected model were true. The most famous criteria may include Adj R, FPE (Akaike, 1969), C p (Mallows, 1973, 1995), AIC (Akaike, 1973; Sugiura, 1978; Hurvich and Tsai, 1989), PRESS (Allen, 1974; Stone, 1974), -criterion (Hannan and Quinn, 1979), GIC (Nishii, 1984) and others (Shao, 1993, 1997; Zhang, 1993; Foster and George, 1994; Zheng and Loh, 1995). The model determination usually requires a comparison for all possible k models, and this is prohibitive when k is large. To reduce the computational amount required for a large value of k, some heuristic methods have been proposed by restricting the model space to a smaller number ofpotential subsets, e.g., branch and bound methods (Furnival and Wilson, 1974), stepwise procedures (Efroymson, 1966) and their variants. For details, see Miller (1990) and the references therein. Bayesian methods include MAP (maximum a posteriori), Bayes factor (Jereys, 1961; Kass and Raftery, 1995) and predictive criteria-based methods (San Martini and Spezzaferri, 1984; Laud and Ibrahim, 1995). An overview is given in George (1999). The MAP method is to select the model with the maximum posterior probability in the model space. One famous example is BIC (Schwarz, 1978; Kass and Wasserman, 1995; Raftery, 1996; Pauler, 1998). Recently, other MAP examples are proposed based on dierent prior settings (George and McCulloch 1993, 1997; Phillips and Smith 1995; Geweke 1996), and Markov chain Monte Carlo (MCMC) methods are used to search for MAP models. The ergodicity ofthe Markov chains ensures that the MAP models will be found almost surely as the running time tends to innity (Tierney, 1994). A defect on the philosophy of the MAP method is that the posterior distribution is sensitive to the prior distribution imposed on the model space by users. Bayes factor gets around this diculty by dividing the prior odds from the posterior odds, and then is to compare the marginal probabilities ofmodels. It is dened as B 10 = (P(M 1 Y)=P(M 0 Y)) (P(M 1 )=P(M 0 )) = P(Y M 1) P(Y M 0 ) ; where M 1 and M 0 denote two models under comparison. When B 10 1, model M 1 is supported, otherwise, model M 0 is supported. However, the Bayes factor is far from perfection. When improper priors are imposed on some model-specic parameters, the Bayes factor can only be determined up to a constant, and in this case it does not make sense for the model comparison any more. A variety ofapproximate Bayesian factors have been proposed to overcome the diculty. Geisser and Eddy (1979) and Gelfand et al. (199) proposed to use a cross-validation predictive distribution to replace the marginal distribution, and the replacement yields the pseudo-bayes factor B 10, B 10 = n P(y i Y (i) ;M 1 ) P(y i Y (i) ;M 0 ) ; where Y (i) denotes the data set with the ith case omitted. The other approximators and related references can be found in Spiegelhalter and Smith (198), Perichi (1984), Aitkin (1991), Gelfand and Dey (1994), O Hagan (1995), Berger and Pericchi (1996) and Moreno, Bertolino, and Racugno (1998).

3 F. Liang / Statistics & Probability Letters 57 (00) Given the plethora ofmodel selection criteria, a need exists for research which unies existing criteria under common themes. This article contributes towards fullling this need by exploring some connections between Bayesian and non-bayesian methods for variable selection in a multiple linear regression. Under an appropriate prior setting, we show that the MAP methods correspond to the FPE criteria, and that the pseudo-bayes factor corresponds to the Adj R criterion and they both minimize the Kullback Leibler distance between the predictive likelihoods ofthe true and candidate models. This article is organized as follows. In Section we study the connections between FPE criteria and MAP methods. In Section 3 we study the connections between the pseudo-bayes factor and the Adj R criterion. In Section 4 we present some simulation results to conrm the theoretical results ofthis article.. FPE criteria and MAP models The original FPE criterion is proposed by Akaike (1969) to minimize the nal prediction error (FPE). For a particular subset model M p, FPE is dened as E(z i ẑ i ) = 1+ p ; n where p = p + 1 is the total number ofexplanatory variables included in the subset model (including the intercept term), z i denotes a new observation independent ofthe ones used for model determination, and ẑ i denotes the prediction value of z i. Akaike s derivation estimated with ˆ p = RSS p =(n p ), and this substitution yields n + p FPE(M p ) = RSS p n p by ignoring a constant factor 1=n or equivalently FPE(M p ) = RSS p +p ˆ p; where RSS p =Y Y Y X p (X px p ) 1 X py is the residual sum ofsquares ofmodel M p. Later, Shibata (1984) suggested that ˆ p to be replaced by ˆ k = RSS k =(n k 1), and proposed a generalized form for the FPE criteria, FPE (M p ) = RSS p + p ˆ k; where is a penalty coecient chosen by users. As is known, many criteria can be represented in the form of FPE with dierent values of. For example, = yields C p and, approximately, AIC and PRESS; = log k yields the risk ination criterion (Foster and George, 1994); = c log log n yields -criterion (Hannan and Quinn, 1979); = log n yields BIC; and if is a function of n and lim n =n = 0, it yields the GIC criterion (Nishii, 1984). To study the connections between FPE criteria and MAP methods, we consider the following prior setting for model M p. First, the model is reparameterized as Y = Z p p + ; (3) (4)

4 56 F. Liang / Statistics & Probability Letters 57 (00) where a QR decomposition is performed on X p and X p = Z p R p. Thus, Z p is an n (p + 1) matrix with orthonormal columns, R p is upper triangular, and p = R p p. The likelihood function of the model is { L p (Y X;M p ; p ; 1 )= ( ) exp 1 } n (Y Z p p ) (Y Z p p ) : (5) We assume that all k predictors are linearly independent, and each has a prior probability to be included in the model. Thus, the prior probability imposed on the model M p is P(M p )= p (1 ) k p ; where is a hyperparameter to be determined later. We further assume that p and are a priori independent, and they are subject to the Jereys non-informative priors, (6) P( p M p ) 1; P( ) 1 : Multiplying (5) (8), we get the posterior distribution (up to a multiplicative constant), (7) (8) P(M p ; p ; Y) P(Y X; p ; ;M p )P( p M p )P( )P(M p ) { = p (1 ) k p 1 ( 1 exp 1 } { ) n ( )(n=+1) ˆ exp 1 } p ( pi ˆ pi ) ; (9) where ˆ = Y X p ˆ p = Y Z p ˆ p is the residual vector. In the derivation we used that Y X p p = Y Z p p = ˆ + p ˆ p : Integrating out p and from (9) and then taking a logarithm, we get the log-posterior of model M p (up to an additive constant), log P(M p Y)=plog n p 1 log() n p 1 log(rss p ) 1 n p 1 + log : (10) This posterior distribution only depends on one tunable parameter, the value ofwhich reects our prior knowledge on the number ofpredictors ofthe regression. Ifwe think more predictors should be included in the regression, can be set to a large value, otherwise, it should be set to a small value. A particular form of is ofspecial interest, = 1 1+ ˆ k exp(=+1=((n 1)) ; where ˆ k = RSS k =(n k 1); is a value specied by users. The following theorem shows the correspondence between FPE criteria and MAP methods for some specic values of. i=0

5 F. Liang / Statistics & Probability Letters 57 (00) Theorem.1. Under the prior setting (6) (8); when = and n p; we have log P(M p Y) constant FPE (p)=[ ˆ k] (11) for the models with p 0; where p =(ˆ p ˆ k)= ˆ k. Proof. By the Stirling approximation; when n p we have ( ) n p 1 log n p 1 + n p ( n p 1 log Substituting (1) into (10); we have log P(M p Y) p log n p log() n p n p 1 log n p 1 log ˆ p ( constant + p log 1 ) + p [1 + log()] 1 log ( 1 p n 1 ) + 1 log(): (1) + p log ˆ k n p 1 log( ˆ p= ˆ k): (13) The uniqueness ofthe full model allows us to regard ˆ k as a constant in the derivation. With the Taylor expansion; we have log( ˆ p= ˆ k) = log(1 + p ) p for p 0; and log(1 p=(n 1)) p=((n 1)). Hence; log P(M p Y) constant + p log + p 1 [1 + log()] + p (n 1) + p log ˆ k n p 1 (ˆ p= ˆ k 1) = constant 1 ˆ [RSS p +(I)]; (14) k where [ ] (I)= p ˆ 1 k log + 1 (n 1) + log() + log ˆ k : When = ; (I)=p ˆ k. The proofis completed. In Theorem.1, the condition p 0 can be satised by many models, including the true model, all overtting models, and a part ofundertting models. For the true and overtting models, ˆ p and ˆ k are both consistent estimators of. For the undertting models, ˆ p is biased upward (Rencher, 000, p. 157). Ifˆ p is far from ˆ k, FPE (p)=[ ˆ k] provides an under-approximator for log P(M p Y), fortunately, the models ofthis kind usually have a very small value oftotal masses )

6 58 F. Liang / Statistics & Probability Letters 57 (00) in the posterior P(M p Y). Hence, approximately we have P(M p Y) exp{ FPE (p)=[ ˆ k]}: (15) This is conrmed by our numerical results in Section 4. A similar relationship between MAP and the C p criterion was obtained by Liang, Truong, and Wong (001), under a quite dierent prior setting. 3. Predictive information, pseudo-bayes factor and Adj R Under the Bayesian framework the predictive likelihood of a candidate model M can be written as m f = f(z i Y;M); (16) where Y denotes the n observations used for the model determination, z 1 ;:::;z m denote the m new observations independent of Y. Similarly, the predictive likelihood ofthe true model M can be written as m f = f(z i Y;M ): (17) A useful measure for the discrepancy between f and f is the Kullback Leibler distance (up to a constant) D(M; M )= m E log(f ); (18) where E denotes taking expectation with respect to f. The D(M; M ) is called the predictive information in this article. It can be estimated through a cross-validation statistic, ˆD(M; M )= log f(y i Y (i) ;M); (19) n where Y (i) denotes an (n 1)-vector ofobservations with y i omitted. Comparing with B 10,itis easy to see the equivalence between the pseudo-bayes factor and minimizing D(M; M ) for variable selection in a multiple linear regression. To compute ˆD(M; M ), we have the following theorem. Theorem 3.1. Under the prior setting (6) (8); minimizing the predictive information D(M; M ) is approximately equivalent to minimizing the studentized residual sum of squares (SRSS); which for a particular model M p is SRSS(M p )= d i ; where d i = e i =[ˆ k 1 hii ];e i is the OLS residual for the ith case; h ii is the ith diagonal element of the hat matrix H p = X p (X px p ) 1 X p; and ˆ k = RSS k =(n k 1).

7 F. Liang / Statistics & Probability Letters 57 (00) Proof. With the identity we have P(y i Y (i) ;M)= P(M Y) P(M Y (i) ) ˆD(M; M )= n P(Y) P(Y (i) ) ; [log P(M Y) log P(M Y (i) )] n [log P(Y) log P(Y (i) )]: Hence; minimizing ˆD(M; M ) is equivalent to minimizing ˆD (M; M )= [log P(M Y) log P(M Y (i) )]: Similar to (13); we have log P(M Y) constant + p log + p 1 [1 + log()] 1 ( log 1 p ) n 1 + p log ˆ 0 n p 1 log( ˆ p= ˆ 0) constant + p log + p 1 [1 + log()] + p (n 1) + p log ˆ 0 n p 1 (ˆ p= ˆ 0 1) = constant 1 ˆ [RSS p +(II)]; (0) 0 where ˆ 0 = n (y i y) =(n 1) can be regarded as a constant in the derivation; and [ ] (II)= pˆ 1 0 log + 1 (n 1) + log() + log ˆ 0 : Let =[1+ ˆ 0 exp(1=((n 1)))] 1 ; we have (II) = 0. In the derivation of(0); we used the Taylor expansion; log( ˆ p= ˆ 0) 1+(ˆ p ˆ 0)= ˆ 0 by ignoring the higher order terms ( (ˆ p ˆ 0)= ˆ 0 6 1; M p ). Hence; with the special value of ; we have [ ] ˆD RSS p (M; M ) ˆ RSS p(i) 0 ˆ ; (1) 0(i) where ˆ 0(i) denotes the estimate of from the null model with the ith case omitted. When n is large; we have ˆ 0(i) ˆ 0 for i =1;:::;n. Thus ˆD RSS p RSS p(i) (M; M ) ˆ ; 0

8 60 F. Liang / Statistics & Probability Letters 57 (00) or equivalently; ˆD RSS p RSS p(i) (M; M ) ˆ ; k where RSS p(i) denotes the residual sum ofsquares ofmodel M p with the ith case omitted. An algebraically equivalent expression for RSS p RSS p(i) is RSS p RSS p(i) = e i : 1 h ii An estimated variance ofe i is s {e i } =ˆ k(1 h ii ): Hence; ˆD (M; M ) n d i ; where d i is the studentized residual for the ith observation. () The conclusion oftheorem 3.1 is independent ofthe prior setting (6). This is easily known from the form of ˆD(M; M ). In the proof, a specic value of is chosen only to facilitate the derivation. This theorem shows that D(M; M ) can be measured approximately from a single regression run without requiring n separate runs, each time omitting one ofthe n cases. Note that trace(h p )= n i=0 h ii = p. When n k, we have log{srss(m p )} log{rss p =(1 p =n)} log( ˆ k) = log( ˆ p) log( ˆ k) + log n: (3) To study the connections between the pseudo-bayes factor and the Adj R criterion, we rst recall the AdjR statistic. Denition 3.1. Adjusted R statistic. ˆ Adj R p =1 MS(total) ; where ˆ p = RSS p =(n p ) and MS(total) = n (y i y) =(n 1). By comparing (3) and Adj R, it is clear that minimizing SRSS is approximately equivalent to maximizing Adj R. Summarizing this section, we have the following conclusion: under the Jereys non-informative priors, the pseudo-bayes factor corresponds to the Adj R criterion, and they both minimize the predictive Kullback Leibler discrepancy between the predictive likelihoods ofthe true and candidate models. 4. Numerical examples These data come from Draper and Smith (1981). They consist of 9 predictors and 5 observations, which are taken at intervals from a steam plant at a large industrial concern. These data have been used by many authors to illustrate the regression model selection (Miller, 1990).

9 F. Liang / Statistics & Probability Letters 57 (00) (a) Distribution of Cp Cp (b) Histogram of samples Cp Fig. 1. A comparison ofthe true Boltzmann distribution dened on C p values (a) and estimated using posterior samples (b) for the steam plant data. The posterior distribution (10) was simulated using evolutionary Monte Carlo (EMC) (Liang and Wong, 000). The Metropolis Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) or reversible jump MCMC (Green, 1995) may also serve the same purpose for this example. In the simulation we set =. Fig. 1(b) is the histogram ofthe C p values ofthe models sampled in one run ofemc. The sample size is For comparison, Fig. 1(a) shows the Boltzmann distribution Pr(M) exp{ C p (M)=}. The high similarity ofthe two plots shows that C p provides a good approximation to the log-posterior when =. The result oftheorem.1 is conrmed. References Aitkin, M., Posterior Bayes factors. J. Roy. Statist. Soc. B 53, Akaike, H., Fitting autoregressive models for prediction. Ann. Inst. Statist. Math. 1, Akaike, H., Information theory and the extension of the maximum likelihood principle. In: Petrov, B.N., Czaki, F. (Eds.), Proceedings ofthe International Symposium on Information Theory. Akademia Kiadoo, Budapest, pp Allen, D.M., The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16, Berger, J.O., Pericchi, L.R., The intrinsic Bayes factor for model selection and prediction. J. Amer. Statist. Assoc. 91,

10 6 F. Liang / Statistics & Probability Letters 57 (00) Draper, N.R., Smith, H., Applied Regression Analysis, nd Edition. Wiley, New York. Efroymson, M.A., Multiple regression analysis. In: Ralston, A., Wilf, H.S. (Eds.), Mathematical Methods for Digital Computers. Wiley, New York, pp Foster, D.P., George, E.I., The risk ination criterion for multiple regression. Ann. Statist., Furnival, G.M., Wilson, R.W., Regression by leaps and bounds. Technometrics 16, Geisser, S., Eddy, W., A predictive approach to model selection. J. Amer. Statist. Assoc. 74, Gelfand, A., Dey, D., Bayesian model choice: asymptotic and exact calculations. J. Roy. Statist. Soc. B 56, Gelfand, A.E., Dey, D.K., Chang, H., 199. Model determination using predictive distributions with implementation via sampling-based methods. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics, Vol. 4. Oxford University Press, Oxford, pp George, E.I., Bayesian model selection. In: Kotz, S., Read, C., Banks, D. (Eds.), Encyclopedia ofstatistical Sciences Update, Vol. 3. Wiley, New York, pp George, E.I., McCulloch, R.E., Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88, George, E.I., McCulloch, R.E., Approaches for Bayesian variable selection. Statist. Sinica 7, Geweke, J., Variable selection and model comparison in regression. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics, Vol. 5. Oxford University Press, Oxford, pp Green, P.J., Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 8, Hannan, E.J., Quinn, B.G., The determination ofthe order ofan autoregression. J. Roy. Statist. Soc. B 41, Hastings, W.K., Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, Hurvich, C.M., Tsai, C.L., Regression and time series model selection in small samples. Biometrika 76, Jereys, H., Theory ofprobability, 3rd Edition. Oxford University Press, London. Kass, R.E., Raftery, A.E., Bayes factors. J. Amer. Statist. Assoc. 90, Kass, R.E., Wasserman, L., A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Amer. Statist. Assoc. 90, Laud, P.W., Ibrahim, J.G., Predictive model selection. J. Roy. Statist. Soc. B 57, Liang, F., Wong, W.H., 000. Evolutionary Monte Carlo: applications to C p model sampling and change point problem. Statist. Sinica 10, Liang, F., Truong, Y.K., Wong, W.H., 001. Automatic Bayesian model averaging for linear regression and applications in Bayesian curve tting. Statist. Sinica 11, Mallows, C.L., Some comments on C p. Technometrics 15, Mallows, C.L., More comments on C p. Technometrics 37, Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., Equation ofstate calculations by fast computing machines. J. Chem. Phys. 1, Miller, A.J., Subset Selection in Regression. Chapman and Hall, New York. Moreno, E., Bertolino, F., Racugno, W., An intrinsic limiting procedure for model selection and hypotheses testing. J. Amer. Statist. Assoc. 93, Nishii, R., Asymptotic properties ofcriteria for selection ofvariables in multiple regression. Ann. Statist. 1, O Hagan, A., Fractional Bayes factor for model comparison (with discussion). J. Roy. Statist. Soc. B 57, Pauler, D., The Schwarz criterion and related methods for the normal linear model. Biometrika 85, Perichi, L., An alternative to the standard Bayesian procedure for discrimination between normal linear models. Biometrika 71, Phillips, D.B., Smith, A.F.M., Bayesian model comparison via jump diusions. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.), Markov Chain Monte Carlo in Practice. Chapman & Hall, London, pp Raftery, A.E., Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Biometrika 83, Rencher, A.C., 000. Linear Models in Statistics. Wiley, New York. San Martini, A., Spezzaferri, F., A predictive model selection criterion. J. Roy. Statist. Soc. B 46, Schwarz, G., Estimating the dimension ofa model. Ann. Statist. 6, Shao, J., Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88, Shao, J., An asymptotic theory for linear model selection. Statist. Sinica 7, 1 64.

11 F. Liang / Statistics & Probability Letters 57 (00) Shibata, R., Approximate eciency ofa selection procedure for the number ofregression variables. Biometrika 71, Spiegelhalter, D.J., Smith, A.F.M., 198. Bayes factors for linear and log-linear models with vague prior information. J. Roy. Statist. Soc. B 44, Stone, M., Cross-validation choice and assessment ofstatistical predictions. J. Roy. Statist. Soc. B 36, Sugiura, N., Further analysis ofthe data by Akaike s information criterion and the nite corrections. Commun. Statist. A7, Tierney, L., Markov chains for exploring posterior distributions (with discussion). Ann. Statist., Zhang, P., Model selection via multifold cross-validation. Ann. Statist. 1, Zheng, X., Loh, W.Y., Consistent variable selection in linear models. J. Amer. Statist. Assoc. 90,

A note on Reversible Jump Markov Chain Monte Carlo

A note on Reversible Jump Markov Chain Monte Carlo Hedibert Freitas Lopes Graduate School of Business The University of Chicago 5807 South Woodlawn Avenue Chicago, Illinois 60637 February, 1st 2006 1 Introduction