Variable Selection for Bayesian Linear Regression

Size: px

Start display at page:

Download "Variable Selection for Bayesian Linear Regression"

Katherine Ellis
6 years ago
Views:

1860 2013 58-72 58 Variable Selection for Bayesian Linear Regression Model in a Finite Sample Size 1 Introduction Satoshi KABE 1 Graduate School of Systems and Information Engineering, University of

1 Variable Selection for Bayesian Linear Regression Model in a Finite Sample Size 1 Introduction Satoshi KABE 1 Graduate School of Systems and Information Engineering, University of Tsukuba. Yuichiro KANAZAWA 2 Graduate School of Systems and Information Engineering, University of Tsukuba. Data analysis often involves a comparison of several candidate models. Because true model is seldom known a priori, there is a need for a simple, effective, and objective methods for the selection of the best approximating model. Akaike (1973) proposed an information criterion later known to be Akaike s information criterion (AIC) based on the concept of minimizing the Kullback-Leibler Information, a measure of discrepancy between the true density (or model) and approximating model: The discrepancy between two models or two probability densities is expressed by the expected -likelihood with respect to the $\log$ true density. AIC is designed to be an approximately unbiased estimator of the expected $\log$-likelihood under the assumption that true model is one of the candidate models being considered. Hurvich and Tsai (1989) illustrated that AIC can be dramatically biased when the sample size is small by comparing AIC with the finite bias-corrected version of AIC $(AIC_{C})$ proposed by Sugiura (1978). This criterion adjusts AIC to be an exact unbiased estimator of the expected -likelihood. They then extend so $\log$ $AIC_{C}$ that it can be employed for autoregressive ( $AR$ ) model and autoregressive moving average (ARMA) model. Later Bedrick and Tsai (1994) further extend Sugiura (1978) to multivariate regression cases where response variable is expressed by a matrix. In the fully Bayesian data analysis, the deviance information criterion (DIC) is widely used for the model selection. Spiegelhalter et al. (2002) proposed a Bayesian measure of model complexity $(i.e.,$ effective number $of$ parameters $p_{d})$ obtained from the difference between the posterior mean of the deviance and the deviance at the posterior mean of the parameters. When the number of data is sufficiently large, DIC is given by adding $p_{d}$ to the posterior mean of the deviance. Spiegelhalter et al. (2002) gave an asymptotic justification of DIC in cases where the number of observations is large relative to the number of parameters. In an discussion to Spiegelhalter et al. (2002), Burnham (2002) questioned lesson $[a]$ that we learned was that, if sample size is small or the number of estimated parameters $n$ $p$ is large relative to, a modified AIC should be used, such as $n$ $AIC_{C}=$ AIC $+2p(p+$ $I$ $1)/(n-p-1)$. wonder whether DIC needs such a modification or if it really automatically adjusts for a small sample size or large $p$, relative to $n.$ We write this article to answer this question, at least partially for a very important case of linear regression. More concretely, $1E$ -mail: k @sk.tsukuba.ac.jp. 2 $E$ -mail: kanazawa@sk.tsukuba.ac.jp.

2 . follows 59 we propose a finite-sample bias corrected information criterion for the Bayesian linear regression models with normally distributed error, and then we implement simulation studies when the sample size is relatively small. The rest of this article is organized as follows: Next section briefly describes the Bayesian linear regression model. In Section 3, we propose our information criterion for the Bayesian linear regression $mo$del. Section 4 shows the results of simulation studies to show the validity of our proposed information criterion when the sample size is small. Finally, Section 5 draws some conclusions concerning our proposed criterion. 2 Bayesian Linear Regression Model We consider the linear regression model as follows $y=x\beta+\epsilon, \epsilon\sim \mathcal{n}(0, \sigma^{2}i_{n})$ (2.1) where $y$ is a $N\cross 1$ vector and $X$ is a $N\cross K$ non-stochastic matrix. The parameter $\beta$ vector is a vector $K\cross 1$ $\epsilon$ and error term a $N$ -dimensional multivariate normal distribution. $\mathcal{n}(0, \sigma^{2}i_{n})$ $\beta$ We assume that prior distribution of is a $K$ -dimensional multivariate normal distribution and that of is a gamma $\sigma^{-2}$ distribution: $\beta \sigma^{-2}\sim \mathcal{n}(b_{0}, \sigma^{2}b_{0})$ (2.2) $\sigma^{-2}\sim \mathcal{g}(\frac{\nu_{0}}{2}, \frac{\lambda_{0}}{2})$. (2.3) $\beta$ Then posterior distributions of parameters and $\sigma^{-2}$ are expressed as $\beta \sigma^{-2}, y, X\sim \mathcal{n}(b_{1}, \sigma^{2}b_{1})$ (2.4) $\sigma^{-2} y, X\sim \mathcal{g}(\frac{v_{1}}{2}, \frac{\lambda_{1}}{2})$ (2.5) where $b_{1}=b_{1}(x y+b_{0}^{-1}b_{0}),$ $B_{1}=(X X+B_{0}^{-1})^{-1},$ $\nu_{1}=\nu_{0}+n+k,$ $\lambda_{1}=\lambda_{0}+(y-$ $X\sqrt{}N) (y-x\hat{\beta}_{n})\wedge+(b_{0}-\hat{\beta}_{n}) [(X X)^{-1}+B_{0}]^{-1}(b_{0}-\hat{\beta}_{N})$ and $\hat{\beta}_{n}=(x X)^{-1}X y.$ 3 Finite-Sample Bias Correction and Variable Selection Criterion Let us denote unknown true density as and approximating model as $f_{y}(\cdot)$ $g(\cdot \theta)$ with $\theta$ parameter vector Then Kullback-Leibler Information between $f_{y}(.$ $)$ and can be $g(\cdot \theta)$ expressed as follows $I(f_{Y}, g( \cdot \theta))=\int f_{y}(z)\log\{\frac{f_{y}(z)}{g(z \theta)}\}dz$ (3.1) and (3.1) can be rewritten as $I(f_{Y}, g(\cdot \theta))=e_{z}[\log\{f_{y}(z)\}]-e_{z}[\log\{g(z \theta)\}]$. (3.2)

3 in 60 Even though the true density is unknown, the first term on the right-hand side of $f_{y}(\cdot)$ Kullback-Leibler Information in (3.2) can be regarded as a constant since the variable $z$ is integrated out. Then we select the best approximating model with maximizing the expected -likelihood. $\log$ In Bayesian perspective, parameters follow the posterior distributions estimated by observed data $y$. Hence we consider the posterior mean of (3.2): $E_{\theta y}[i(f_{y}, g(\cdot \theta))]=e_{z}[\log\{f_{y}(z)\}]-e_{\theta y}[e_{z}[\log\{9(z \theta)\}]]$ (3.3) and as in Spiegelhalter et al. (2002) and Ando (2007), our proposed information criterion is constructed based on the posterior mean of expected -likelihood. $\log$ From the Bayesian hnear regression model (2.1), we use $y$ as observed data of sample size $N$ obtained from the unknown true density $f_{y}(y)$ to estimate the posterior distributions of parameters and, while we also use as replicate data of sample size $z$ $\sigma^{-2}$ $\beta$ $N$ generated from the unknown true density $f_{y}(z)$ to evaluate the goodness of fit of approximating model $g(z X,\beta, \sigma^{-2})$. Then posterior mean of expected -likelihood $\log$ $\mathcal{t}$ is defined as $\mathcal{t}\equiv E_{\beta,\sigma^{-2} y,x}[e_{z}[\log\{g(z X, \beta, \sigma^{-2})\}]]$ (3.4) where expectation can be calculated by from the $E_{\beta,\sigma^{-2} y,x}[\cdot]$ $E_{\sigma^{-2} y,x}[e_{\beta \sigma^{-2}y,x}[\cdot]]$ posterior distributions (2.4) and (2.5). To estimate the posterior mean of expected -likelihood $\log$ $\mathcal{t}$ in (3.4), we use the posterior mean of observed -likelihood $\log$ $\hat{\mathcal{t}}_{n}$ : $\hat{\mathcal{t}}_{n}\equiv E_{\beta,\sigma^{-2} y,x}[\log\{g(y X, \beta, \sigma^{-2})\}]$. (3.5) $\hat{\mathcal{t}}_{n}-\hat{b}_{n}$ $\hat{b}_{n}$ and the bias-corrected estimator is obtained as, where is an estimate of bias $b_{\theta}\equiv E_{y}[\hat{\mathcal{T}}_{N}-T]\neq 0$. Then we propose information criterion ( $IC$ ) of the form $IC\equiv-2\hat{\mathcal{T}}_{N}+2\hat{b}_{N}$. (3.6) Ignoring the constant term, we can express the -likelihood function for the replicate $\log$ data $z$ such as $\log\{g(z X, \beta, a-2)\}=\frac{n}{2}\log\sigma^{-2}-\frac{\sigma^{-2}}{2}(z-x\beta) (z-x\beta)$ (3.7) $\sigma^{-2}$ $\beta$ where parameters and follow the posterior distributions estimated by observed data $y$ and X. Then the posterior mean of expected -likelihood $\log$ $\mathcal{t}$ (3.4) is rewritten as $\mathcal{t}=e_{\beta,\sigma^{-2} y,x}[e_{z}[\frac{n}{2}\log\sigma^{-2}-\frac{\sigma^{-2}}{2}(z-x\beta) (z-x\beta)]]$ $= E_{\beta,\sigma^{-2} y,x}[\frac{n}{2}\log\sigma^{-2}-\frac{\sigma^{-2}}{2}(y-x\beta) (y-x\beta)]$ $- E_{\beta,\sigma}-2 y,x[e_{z}[\frac{\sigma^{-2}}{2}(z-x\beta) (z-x\beta)]]$ $+ E_{\beta,\sigma^{-2} y,x}[\frac{\sigma^{-2}}{2}(y-x\beta) (y-x\beta)]$

4 with 61 $=\hat{\mathcal{t}}_{n}-c_{1}+c_{2}$ (3.8) and the bias $b_{\theta}$ $E_{y}[C_{1}-C_{2}].$ respect to the true density $f_{y}(y)$ is obtained by $b_{\theta}\equiv E_{y}[\hat{\mathcal{T}}_{N}-\eta=$ $C_{1}$ First we evaluate in (3.8). However, true density $f_{y}(z)$ is seldom known in practice, so that expectation with respect to the true density is not analytically obtained. In the previous studies, Kitagawa (1997) replaced the unknown true density by the prior predictive density to construct the predictive information criterion (PIC) for the Bayesian linear Gaussian model, while Laud and Ibrahim (1995), Gelfand and Ghosh (1998), and Ibrahim et al. (2001) considered using the posterior predictive density to generate the replicate data for model assessment. In this article, we use the posterior predictive $z$ density to evaluate the expectation with respect to the true density $f_{y}(z)$ $C_{1}$ in because the prior predictive density is far more sensitive to the selection of prior distribution. $C_{1}$ To evaluate in (3.8), we replace the true density with (conditional) posterior $f_{y}(\cdot)$ $a$ predictive density $z \sigma^{-2}, y, X\sim \mathcal{n}(xb_{1}, \sigma^{2}\sigma_{0})$, (3.9) $\sigma^{-2}$ where follows the posterior distribution (2.5) and $\Sigma_{0}=I_{N}+XB_{1}X $. From (3.9), $C_{1}$ we estimate in (3.8) as $\hat{c}_{1}=e_{\beta,\sigma^{-2} y,x}[e_{z \sigma^{-2},y,x}[\frac{\sigma^{-2}}{2}(z-x\beta) (z-x\beta)]]$ $= E_{\beta,\sigma^{-2} y,x}[e_{z \sigma^{-2},y,x}[\frac{\sigma^{-2}}{2}(z-xb_{1}) (z-xb_{1})]]$ $+ E_{\sigma^{-2} y,x}[tr\{\frac{\sigma^{-2}}{2}(x X)E_{\beta \sigma^{-2},y,x}[(\beta-b_{1})(\beta-b_{1}) ]\}]$. (3.10) The first term on the right-hand side of (3.10) can be rewritten as follows $E_{\beta,\sigma^{-2} y,x}[e_{z \sigma^{-2},y,x}[\frac{\sigma^{-2}}{2}(z-xb_{1}) (z-xb_{1})]]$ $= E_{\sigma^{-2} y,x}[\frac{\sigma^{-2}}{2}$ tr $\{E_{z \sigma^{-2},y,x}[(z-xb_{1})(z-xb_{1}) ]\}]$ $= E_{\sigma}-2 y,x[\frac{\sigma^{-2}}{2}$tr $\{\sigma^{2}\sigma_{0}\}]$ $= \frac{1}{2}$ tr $\{I_{N}+XB_{1}X \}$ $= \frac{n}{2}+\frac{1}{2}$ tr $\{(X X)B_{1}\}$. (3.11) The second term on the right-hand side of (3.10) is similarly obtained from the posterior distribution (2.4) as follows $E_{\sigma^{-2} y,x}[tr\{\frac{\sigma^{-2}}{2}(x X)E_{\beta \sigma^{-2},y,x}[(\beta-b_{1})(\beta-b_{1}) ]\}]$ $= E_{\sigma^{-2} y,x}[tr\{\frac{\sigma^{-2}}{2}(x X)\sigma^{2}B_{1}\}]$

5 does 62 $= \frac{1}{2}$tr $\{(X X)B_{1}\}.$ (3.12) From (3.11) and (3.12), $\hat{c}_{1}$ in (3.10) is evaluated as follows $\hat{c}_{1}=e_{\beta,\sigma^{-2} y,x}[e_{z \sigma^{-2},y,x}[\frac{\sigma^{-2}}{2}(z-xb_{1}) (z-xb_{1})]]$ $+ E_{\sigma^{-2} y,x}[tr\{\frac{\sigma^{-2}}{2}(x X)E_{\beta \sigma^{-2},y,x}[(\beta-b_{1})(\beta-b_{1}) ]\}]$ $= \frac{n}{2}+\frac{1}{2}$tr $\{(X X)B_{1}\}+\frac{1}{2}$ tr $\{(X X)B_{1}\}$ $= \frac{n}{2}+$ tr $\{(X X)B_{1}\}$. (3.13) $\hat{c}_{1}$ Then we notice that not depend on any data $y$, i.e., $E_{y}(\hat{C}_{1})=\hat{C}_{1}.$ Suppose that interchange of order of integrations is valid, we can rewrite $E_{y}(C_{2})$ as such $E_{y}(C_{2})=E_{y}[E_{\beta,\sigma^{-2} y,x}[\frac{\sigma^{-2}}{2}(y-x\beta) (y-x\beta)]]$ $= E_{\beta,\sigma^{-2}}[E_{y X,\beta,\sigma^{-2}}[\frac{\sigma^{-2}}{2}(y-X\beta) (y-x\beta)]]$ (3.14) where $E_{\beta,\sigma^{-2}}[\cdot]$ is an expectation with respect to the joint prior distribution and $E_{y X,\beta,\sigma^{-2}}[\cdot]$ is an expectation with respect to $N$ -dimensional multivariate normal distribution with mean vector and variance-covariance matrix $X\beta$ $\sigma^{2}i_{n}$. Since $E_{y X,\beta,\sigma^{-2}}[(y-X\beta)(y-$ $X\beta) ]=\sigma^{2}i_{n},$ $E_{y}(C_{2})$ in (3.14) can be evaluated as $E_{y}(C_{2})=E_{\beta,\sigma^{-2}}[E_{y X,\beta,\sigma^{-2}}[tr\{\frac{\sigma^{-2}}{2}(y-X\beta)(y-X\beta) \}]]$ $= \frac{1}{2}$ tr $\{I_{N}\}$ $= \frac{n}{2}$. (3.15) Therefore $\hat{b}_{n}$ is given by $\hat{b}_{n}=e_{y}[\hat{c}_{1}-c_{2}]$ $=\hat{c}_{1}-e_{y}(c_{2})$ $= \frac{n}{2}+tr\{(x X)B_{1}\}-\frac{N}{2}$ $=$ tr $\{(X X)B_{1}\}$. (3.16) Then $\hat{b}_{n}$ in (3.16) can be regarded as a ratio of variance-covariance matrices $\sigma^{2}(x X)^{-1}$ and $\sigma^{2}b_{1}.$ $\hat{\mathcal{t}}_{n}-\hat{b}_{n}$ Multiplying $-2$ to the bias-corrected estimator, our proposed information criterion for variable selection in the Bayesian linear regression model $(IC_{BL})$ is obtained by $IC_{BL}=-2\hat{\mathcal{T}}_{N}+2tr\{(X X)B_{1}\}$ (3.17)

6 is 63 where $B_{1}=(X X+B_{0}^{-1})^{-1}$ For simplicity, let us denote the parameter in (2.2) as $B_{0}$ $B_{0}=\kappa_{0}I_{K}(\kappa_{0}>0)$ and $\hat{b}_{n}$ the bias term can be rewritten as $\hat{b}_{n}=k-tr\{(x XB_{0}+I_{K})^{-1}\}$ (3.18) from the matrix inversion lemma 3. Then if the sample size $Narrow\infty$, the last term in (3.18) is expected to be zero because tr $\{(\kappa_{0}x X/N+I_{N}/N)^{-1}/N\}arrow 0$ $(i.e., \hat{b}_{n}arrow K)$ when each element of $X X/N$ $\kappa_{0}$ does not diverge. Furthermore, if sufficiently large (i.e., non-informative prior), we also have tr $\{(\kappa_{0}x X+I_{K})^{-1}\}arrow 0$ in (3.18). 4 Simulation Experiments 4.1 Deviance Information Criterion (DIC) We conduct two simulation studies to compare small sample performances of our proposed information criterion $(IC_{BL})$ in (3.17) with the deviance information criterion (DIC) : DIC $=-2\hat{\mathcal{T}}_{N}+p_{D}$, (4.1) where Spiegelhalter et al. (2002) termed $p_{d}$ as the effective number of parameters defined $p_{d}\equiv 2\log\{g(y X,\overline{\beta},\overline{\sigma}^{-2})\}-2\hat{\mathcal{T}}_{N}$ to be evaluated at the posterior means of parameters $\sqrt{}-(=b_{1})$ $\overline{\sigma}^{-2}(=\nu_{1}/\lambda_{1})$ and in (2.4) and (2.5). 4.2 Simulation studies In this article, we present two small sample simulation studies: The first one is designed to examine cases where the set of candidate models includes many over-specified models; the second one examines performances with respect to under-specified candidate models. Case 1 As in Hurvich and Tsai (1989), we consider the nested candidate models by using seven explanatory variables $x_{1},$ $x_{2},$ $x_{7}$ $\ldots,$. In our simulation study, is a vector $N\cross 1$ whose $x_{1}$ elements are ones (i.e., intercept term) and the other $N\cross 1$ vectors $x_{i}(2\leq i\leq 7)$ are generated from the uniform distribution. These variables $\mathcal{u}(-2,2)$ $x_{1},$ $x_{2},$ $x_{7}$ $\ldots,$ are included int the candidate models in a sequentially nested fashion. The candidate models $0$ are linear regression models given by. $y=\beta_{1}x_{1}+\beta_{2}x_{2}+\cdots+\beta_{k}x_{k}+\epsilon,$ $\epsilon\sim \mathcal{n}(0, \sigma^{2}i_{n})$ The candidate model with $K=1$ has only intercept term and with $K=7$ is the full model. In this simulation study, we determine the number of variables $K$ by using our proposed information criterion $(IC_{BL})$ in (3.17) and DIC in (4.1) in small sample cases $N=25,50,100$ with informative $(\kappa_{0}=0.1)$ and non-informative $(\kappa_{0}=100)$ priors. To 3For any matrices $A(m\cross m),$ $B(m\cross n),$ $C(n\cross m)$, and $D(n\cross n)$, we have where $A$ and $D$ are nonsingular matrices. $(A-BD^{-1}C)^{-1}=A^{-1}+A^{-1}B(D-CA^{-1}B)^{-1}CA^{-1}$

$in $\hat{\mathcal{t}}_{n}$ in 64 examine the small sample performance in the Bayesian linear regression cases, we generate a sample of $y$ from the true model $(K=3)$ : $y=1.0x_{1}+2.0x_{2}+3.$

7 in $\hat{\mathcal{t}}_{n}$ in 64 examine the small sample performance in the Bayesian linear regression cases, we generate a sample of $y$ from the true model $(K=3)$ : $y=1.0x_{1}+2.0x_{2}+3.0x_{3}+\epsilon, \epsilon\sim \mathcal{n}(0,1.0i_{n})$. (4.2) In Table 1, we examine the performance of and DIC for the small sample cases $IC_{BL}$ $(N=25,50,100)$. The parameters of prior distributions in (2.2) and (2.3) are set to be $b_{0}=0,$ $B_{0}=\kappa_{0}I_{K}$ ( or 100), and. The simulation considered each $\kappa_{0}=0.1$ $\nu_{0}=\lambda_{0}=0.1$ combination of $N=25,50,100$ and $\kappa_{0}=0.1,100$. We draw 50, 000 MCMC samples from the posterior distributions in (2.4) and (2.5) to compute the posterior mean of observed $\hat{\mathcal{t}}_{n}$ $\log$-likelihood (3.5). For each combination of, we generate 100 observations $(N, \kappa_{0})$ of and DIC, and record the number of selected models (i.e., the candidate model $IC_{BL}$ with minimum value for the two criteria). Table 1 shows that our proposed information criterion $(IC_{BL})$ identifies the true model $(K=3)$ for the small sample cases $(N=25,50,100)$ with informative prior $(\kappa_{0}=0.1)$ far better than DIC, on the other hand DIC tends to overfit the true model. For the non-informative prior $(\kappa_{0}=100)$, both criteria tend to overfit the true model for the sample size $N=25$ and 50, but nevertheless our proposed information criterion $(IC_{BL})$ far outperformes DIC at the sample size $N=100.$ In Tables 2 and 3, we show the results of average criteria in 100 observations for the small sample cases $(N=25,50,100)$ with informative $(\kappa_{0}=0.1)$ and non-informative $(\kappa_{0}=100)$ priors. Both criteria selected the true model $(K=3)$ in all cases, but the $2\hat{b}_{N}$ difference between and $p_{d}$ becomes more apparent along with an increase in the number of explanatory variables. Hence the effective number of parameters $p_{d}$ in DIC $2\hat{b}_{N}$ tends to underestimate the model complexity compared with the bias term in $IC_{BL}.$ Case 2 Next we consider the cases where explanatory variables in the candidate models are selected from the subsets of, (i.e., $\{x_{1}, x_{2}, x_{3}\}$ $\{x_{1}\},$ $\{x_{2}\},$ $\{x_{3}\},$ $\{x_{1}, x_{2}\},$ $\{x_{1}, x_{3}\},$ $\{x_{2}, x_{3}\}$ and. Then we specify a true model such as $\{x_{1}, x_{2}, x_{3}\})$ $y=1.ox_{1}+2.0x_{2}+\epsilon, \epsilon\sim \mathcal{n}(0,1.0i_{n})$. (4.3) We implement a simulation study to examine the performance of in (3.17) and DIC $IC_{BL}$ in (4.1) for under-specified candidate models in small sample cases. The parameters of prior distributions in (2.2) and (2.3) are set to be $b_{0}=0,$ $B_{0}=$ $\kappa_{0}i_{k}$ ( or 100), and. We draw 50, 000 MCMC samples from the $\kappa_{0}=0.1$ $\nu_{0}=\lambda_{0}=0.1$ posterior distributions in (2.4) and (2.5) and compute the posterior mean of observed $\log$-likelihood (3.5). For each combination of, we generate 100 observations $(N, \kappa_{0})$ of and DIC and record the number of selected models such as Case 1. $IC_{BL}$ Table 4 shows that DIC tends to select a full model $(i.e., \{x_{1}, x_{2}, x_{3}\})$ as compared with in small sample cases. On the other hand, under-specified candidate models $IC_{BL}$ $(i.e., \{x_{1}\}, \{x_{2}\}, \{x_{3}\}, \{x_{1}, x_{3}\}, \{x_{2}, x_{3}\})$ are not selected at all. Hence, DIC shows a tendency to overfit the true $mo$del in this simulation study. In Tables 5 and 6, average values of $mo$del complexity in and DIC with respect to $IC_{BL}$ the candidate models, and are close to each other. However, $\{x_{1}, x_{2}\},$ $\{x_{1}, x_{3}\}$ $\{x_{2}, x_{3}\}$

8 $\hat{\mathcal{t}}_{n}$ among 65 true model $(i.e., \{x_{1}, x_{2}\})$ has the largest value of posterior mean of observed $\log$-likelihood three candidate models. Therefore, average values of $IC_{BL}$ and DIC successfully select the true model (4.3) in all cases. 5 Conclusion and Discussion In Bayesian data analysis, DIC (Spiegelhalter et al., 2002) is widely used for the model selection, since this criterion is relatively easy to calculate and applicable to a wide range of statistical models. Spiegelhalter et al. (2002) gave an asymptotic justification of DIC in cases where the number of observations is large relative to the number of parameters. However, Burnham (2002) questioned if the DIC needs a modification for small sample size. In this article, we proposed a variable selection criterion for $IC_{BL}$ the Bayesian linear regression models in small sample cases, as Sugiura (1978) proposed $AIC_{C}$ in frequentist framework. We then examined the performance of our proposed information criterion $(IC_{BL})$ relative to the DIC for small sample cases. In our simulation studies, we found that DIC often showed tendency to overfit the true model (see Tables 1 and 4), whereas our proposed information criterion $(IC_{BL})$ performs well for small sample cases $(N=25,50,100)$. We also found that the measure of model $2\hat{b}_{N}$ complexity was mostly larger than $p_{d}$ (see Tables 2 and 3), leading us to conclude bias correction of DIC underestimates the model complexity in small sample cases. An important directions for the further research would be to extend our information criterion to the several types of Bayesian regression models (e.g., the hierarchical Bayesian regression models, Bayesian regression models with serially correlated error, and Bayesian Markov-switching models) because empirical analysis is generally performed under the limitation of data availability.

9 $\ulcorner\frac{o}{a}$ $\wedge\overline{\dot{o}}$ $\vee t_{4\underline{\frac{o}{d}}}\dotplus^{\frac{q)\succ}{\zeta 6\partial}}\frac{\prime.+\frac{}{}\frac{\Xi\partial}{0\geq}a_{\mathfrak{B}}o\llcorner o_{\wedge}o^{-}\iota_{\phi}\circ\iota 0\omega\omega t3 }{\epsilon 60,\infty 8}e^{o}\frac{8}{}\frac{\frac{\omega}{o}}{\frac{\zeta)}{\tau\fcircle},\frac{O}{>}\frac{p\zeta 0}{\frac{r_{O}\frac{\tau_{\omega}\vec{\circ\circ}}{\omega\infty}Q)08}{\prime\acute{E}-\frac{oq\frac {}{}o)g}{=q}}}\zeta z_{6}\triangleleft m}.\vert \overline{\overline{8c6}}).$ $\backslash -\overline{-}$ $\hat{\circ\frac{o}{\vee 9 }\frac{q)\frac{\underline{o\overline{}}}{q}\check{\zeta 6}\underline{\succ}\frac{8}{o}}{\frac{\frac{d}{1}}{Z^{O}}},}\ovalbox{\tt\small REJECT}\underline{\wedge\overline{\vee\dot{\circ}g\frac{\underline{\overline{o}}}{\frac{aq\triangleright}{+\zeta 6\partial}}\frac{S}{o}- 0}}$ $o\frac{o}{\xi }\geq 0\iota 0\geq 1\circ No\frac{o}{\geq }\geq 0\iota 0\geq\iota O\circ\iota _{1} ^{1} $ $\underline{\underline{\underline{\underline{\frac{\fcircle}{\fcircle}o_{\hat{k}}^{m}}o_{\xi^{o}}^{m}\frac{o}{\fcircle}}o^{\infty}\frac{o}{\fcircle}}\frac{o}{\fcircle}o^{\infty}}\underline{\underline{\frac{o}{o}o^{m}}\frac{o}{o}o^{n}}\overline{\tau^{q)}}\triangleleft \mathfrak{t}\triangleleft\triangleleft\triangleleft r\triangleleft.$ $\triangleright\infty 0\circ\iota oco\infty oo_{1}\circ)\iota 0\triangleleft oooo\circ)oo_{1\circ)}\infty\triangleright 0\circ 0\triangleleft cooo\circ$ $\triangleleft$ $\triangleleft\circ$ $\ovalbox{\tt\small REJECT}$ 66 coooo $OO\iota 0\underline{\circ l}$ noooo $\triangleleft u\circ\triangleright oo_{o}ooooo_{\circ 3}-\infty\infty\triangleleft to\triangleright\circ)\trianglerightrightarrow\infty t\circrightarrow\overline{\infty\triangleright}- \triangleright\sim 1\circ noo\triangleright^{)\circ 0}\triangleright N-)$ oo coolonolo oooo $rightarrow$ ocncono-oo ol $\triangleright$.. $0$ $\underline{\omega}.-$ $o\underline{\iota}$ $\zeta 60$ 1 $0$

10 $\wedge\overline{\dot{o}}$ $\ddot{\infty}q)$ $11$ $E^{c6}\neg.\underline{-}$ 67 $\omega$ \S $\overline{o}o$

11 $E^{\zeta}\neg\prime oo_{6_{\vee^{*}}}$ 68. $\overline{o\overline{\overline{o}}}.$

12 $\tau_{\frac{}{\zeta 6}}$ $\frac{\tau_{\omega}\infty Q)0}{E-s^{Q)}r_{8}trightarrow \mathfrak{q}\not\supset\infty\frac{\circ}{\circ Q)}Q)}\frac{\infty}{\approx^{q)}}\frac{}{s^{h}}t1^{\frac{o}{o}}c\frac{O}{.\fcircle,\frac{o}{o^{f\mathfrak{Q}}c6Q}}\frac{a_{U}\backslash \circ J\iota O\iota 0\prec\underline{\underline{+d}}c6\infty\infty Q)Q\rangle\frac{\sum_{s}}{0\geq}\frac{8}{o}Q)\dot{\mathfrak{c}6}\underline{9\succ} }{\alpha,\infty c6s}krightarrow 08^{\wedge}08^{\wedge}\infty\vee\overline{\overline{6\dashv}}\Vert$ $\triangleleft\frac{o}{-}$ $U\ulcorner \mathfrak{q}$ $\hat{\underline{\vee B^{\vee}\frac{\dotplus^{\frac{Q)\succ}{\circ h8^{6}\zeta\ni}}\overline{\frac{o}{\frac{}{\alpha}}}}{z\frac{-}{}\frac{1}{o}}\hat{\tilde{\dot{o}}}\frac{o}{\vee\not\subset 0}o\overline{\check{8_{t}^{6}0\zeta}\omega\frac{\underline{o}}{\alpha}}\underline{e^{o}\succ }}}$ $\Xi\circ 1\iota 0\Xi 0\iota oo\xi\frac{o}{\xi }\Xi 1\circ\circ J\Xi 01\circ oo\sim $ $\overline{\xi^{o}\varpi^{q)}}\underline{\fcircle^{\infty}\underline{o^{\alpha)}}\frac{o}{\mathfrak{q}}\underline{o^{\alpha)}}\frac{\fcircle}{\fcircle}\underline{o^{\infty}}\frac{o}{\fcircle}\underline{c)}\frac{\fcircle}{\fcircle}\underline{o^{m}\frac{o}{\fcircle}\frac{o}{\fcircle}}\triangleleft\triangleleft\triangleleft q}\infty\triangleleft\triangleleft$ $\triangleright ooo\triangleleft oo\underline{ _{}\hat{\aleph^{co}}0}ooo\infty oo\triangleright oo\circ\llcorner ooo\underline{10}oooo)oooooo,oooo0rightarrow 00\circ)oooooooooo o_{\aleph\dagger i}oooo_{j}oo\underline{o}\overline{\aleph}\aleph\hat{\re^{q\hat{\frac{\aleph^{c1}\wedge}{\re}}}}\aleph^{h^{\wedge}}\hat{\mathring{\aleph}^{1\frac{\aleph^{\hat{n}}\wedge}{\aleph}}}\hat{\sim}\zeta oooo_{?}\circ\infty OJ\infty\triangleright^{)}\overline{O1}\infty\circ\infty\circ?o_{J}o^{oo\triangleleft}\hat{\infty}\hat{\infty}=$ 69 $- $ $0$ ooocooo $\vee\vee\vee\vee\vee\vee\vee\wedge ooo_{\overline{o}}oooooo_{\infty\circ}$ $\underline{\omega}.-$ $r\circ\underline{ }$ c6 $0$

13 $\hat{rightarrow\dot{o}}$ $11$ $E^{c6}\neg.\underline{-}$ 70 $\iota\ddot{o}q)\re Q)$ \={o} $O$

14 $Q)\omega$ $r\xi \mathfrak{b}$ $q)g\alpha$ $\underline{\circ 8}$ $\infty(6$ $s\phi$ $\eta\supset\s$ $\overline{c6}$ $\infty$ 71 $d$. $ $

15 72 References Akaike, H. (1973) Information theory as an extension of the maximum likelihood principle. Second International Symposium on Information Theory Akademiai Kiado, Budapest pp Ando, T. (2007) Bayesian predictive information criterion for the evaluation of hierarchical bayesian and empirical bayes models. Biometrika 94(2), Bedrick, E.., and C. L. Tsai (1994) $J$ samples. Biometrics pp $Mo$ del selection for multivariate regression in small Bumham, K.. (2002) Discussion of a paper by D.J. Spiegelhalter et al. Joumal of the $P$ Royal Statistical Society. Series (Statistical Methodology) 64(4), 629 $B$ Gelfand, A.., and S. K. Ghosh (1998) Model choice: $E$ $A$ loss approach. Biometrika 85(1), 1-11 minimum posterior predictive Hurvich, C. $M$., and C.L. Tsai (1989) Regression and time series model selection in small samples. Biometrika 76(2), Ibrahim, J. $G$., M. H. Chen, and D. Sinha (2001) Criterion-based methods for bayesian model assessment. Statistica Sinica 11 (2), Kitagawa, G. (1997) Information criteria for the predictive evaluation of bayesian $mo$dels. Communications in Statistics-Theory and Methods 26(9), Laud, P. $W$., and J. G. Ibrahim (1995) Predictive model selection. Joumal of the Royal Statistical Society. Series $B$ (Methodological) pp $J$ Spiegelhalter, D.., N. G. Best, B. P. Carlin, and A. van der Linde (2002) Bayesian measures of model complexity and fit. Joumal of the Royal Statistical Society. Series $B$ (Statistical Methodology) 64(4), Sugiura, N. (1978) Further analysts of the data by akaike s information criterion and the finite corrections. Communications in Statistics-Theory and Methods 7(1), 13-26

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models Junfeng Shang Bowling Green State University, USA Abstract In the mixed modeling framework, Monte Carlo simulation