arxiv: v2 [stat.me] 13 Jun 2012

Size: px

Start display at page:

Download "arxiv: v2 [stat.me] 13 Jun 2012"

Stephanie Stevens
5 years ago
Views:

1 A Bayes factor with reasonable model selection consistency for ANOVA model arxiv: v2 [stat.me] 13 Jun 2012 Yuzo Maruyama The University of Tokyo Abstract: For the ANOVA model, we propose a new g-prior based Bayes factor without integral representation, with reasonable model selection consistency for any asymptotic situations (either number of levels of the factor and/or number of replication in each level goes to infinity). Exact analytic calculation of the marginal density under a special choice of the priors enables such a Bayes factor. AMS 2000 subject classifications: Primary 62F15, 62F07; secondary 62A10. Keywords and phrases: Bayesian model selection, model selection consistency, fully Bayes method, Bayes factor. 1. Introduction In this paper, we consider Bayesian model selection for ANOVA model. We start with one-way ANOVA with two possible models. In one model, all random variables have the same mean. In the other model, there are some levels and random variables in each level have different means. Formally, the independent observations y ij (i = 1,...,p, j = 1,...,r i, n = p i=1 r i) are assumed to arise from the linear model: y ij = µ+α i +ǫ ij, ǫ ij N(0,σ 2 ) (1.1) where µ, α i (i = 1,...,p) and σ 2 are unknown. Clearly two models described above are written as follows: M 1 : α i = 0 for all 1 i p, vs M A+1 : α i 0 for some i. (1.2) Model selection, which refers to using the data in order to decide on the plausibility of two or more competing models, is a common problem in modern statistical science. A natural Bayesian approach is by Bayes factor (ratio of marginal densities of two models), which is a function of the posterior model probabilities (Kass and Raftery (1995)). From a theoretical viewpoint, one of the most important topic on Bayesian model selection is consistency. Consistency means that the true model will be chosen if enough data are observed, assuming that one of the competing models is true. It is well-known that BIC by Schwarz (1978) is consistent under fixed number of parameters. In our situation, when p is fixed and r = n p p = i=1 r i p 1, (1.3)

2 Y. Maruyama/A Bayes factor 2 the BIC is consistent. As a high-dimensional problem of statistical inference, the consistency in the case where p in one-way ANOVA setup, has been addressed by Stone (1979) and Berger, Ghosh and Mukhopadhyay (2003). In the following, let CASE I and CASE II denote the cases where I. n, p and r is bounded, II. n, p and r, respectively. When σ 2 is known and CASE I is assumed, Stone (1979) showed that BIC chooses the null model M 1 with probability 1 (that is, BIC is not consistent under M A+1 ). This is reasonable because BIC is originally derived by the Laplace approximation under fixed number of parameters. In the same situation as Stone (1979), Berger, Ghosh and Mukhopadhyay(2003) proposed a Bayesian criterion called GBIC, which is derived by the Laplace approximation under CASE I. Then they showed that GBIC has model selection consistency under CASE I. Recently Moreno, Girón and Casella(2010) showed that under CASE II with p = O(n b ) for 0 < b < 1 or equivalently p = O({ r} b/(1 b) ), BIC is consistent. By consideringbveryclose to 1, BIC seemsto haveconsistencyunder CASE II even if r approaches infinity much slower than p. As in Theorem 3.1, however, we show that this is not so but the consistency of BIC under CASE II depends on how quick r goesto infinity comparedto p. As far as we know, this inconsistency of BIC has not yet been reported in the literature. As an alternative to BIC, following Zellner(1986); Liang et al.(2008); Maruyama and George (2011), we will propose a new g-prior based Bayes factor with consistency under CASE II, which is given by where BF F (A+1):1 = Γ(p/2)Γ({n p}/2) ( 1+ W ) (n p 1)/2 H, (1.4) Γ(1/2)Γ({n 1}/2) W E W H = r i (ȳ i ȳ ) 2, W E = i ij j ȳ i = y ij ij, ȳ = y ij r i i r. i (y ij ȳ i ) 2, As shown in Theorem 3.1, BF F (A+1):1 is consistent under CASE II even if r approaches infinity much slower than p. Under CASE I where BIC is not consistent, the consistency of BF F (A+1):1 is shown to depend on a kind of distance between M A+1 and M 1, which is c A = ij (E[y ij] E[ȳ ]) 2 nσ 2. (1.5) Naturally c A = 0 under M 1 and c A > 0 under M A+1. In Theorem 3.1, under CASE I with a positive small c A, we show that BF F (A+1):1 chooses M 1 with

3 Y. Maruyama/A Bayes factor 3 probability 1. Actually such an inconsistency result has been already reported by Moreno, Girón and Casella (2010). In Remark 3.1, we demonstrate that the existence of inconsistency region for small c A and large p is reasonable from the prediction point of view. The rest of the paper is organized as follows. In Section 2, we re-parameterize the ANOVA model givenby (1.1) as a linear regressionmodel and give priorsfor the regression model. Then we derive marginal densities under M 1 and M A+1 and eventually the Bayes factor given by (1.4), as the ratio of the marginal densities. In Section 3, we show that the Bayes factor has a reasonable model selection consistency for any asymptotic situation. In Section 4, the results derivedinsections2and3areextendedtotwo-wayanovamodel.theappendix presents some of the more technical proofs. 2. A Bayes factor for one-way ANOVA 2.1. re-parameterized ANOVA Let α = (α 1,...,α p ), y = (y 11,...,y 1r1,y 21,...,y 2r2,...,y p1,...,y prp ), ǫ = (ǫ 11,...,ǫ 1r1,ǫ 21,...,ǫ 2r2,...,ǫ p1,...,ǫ prp ). Further let an n p matrix X A = (x ij ) satisfy { 1 if j 1 k=1 x ij = r k +1 i j k=1 r k, 0 otherwise. (2.1) Then the linear model given by (1.1) is written as y = µ1 n +X A α+ǫ = θ 1 1 n + X A α+ǫ where θ 1 = µ+r α/n with r = (r 1,...,r p ) and X A = X A 1 nr which is the centered matrix of X A. The matrix X A is written as r 1 r 2 n {}}{{}}{{}}{ ( ν 1,...,ν 1, ν 2,...,ν 2,... ν p,...,ν p ) (2.2) where ν i = e i n 1 r with e i = (0,...,0,1,0,...,0). The n p matrix X A is not full rank but rank X A = p 1 since {ν 1,...,ν p 1 } is linearly independent and p i=1 r iν i = 0. For identifiability of α, the linear restriction r p ν 0α = 0 (2.3)

4 Y. Maruyama/A Bayes factor 4 is assumed such that {ν 0,ν 1,...,ν p 1 } is linearly independent. Typically 1 p or r are chosen as ν 0. We will see that any linear restriction does not affect our result. Let the singular value decomposition of XA with rank p 1 be X A = U A D A W A, whereu A andw A aren (p 1)andp (p 1)orthogonalmatrices,respectively. Then the one-way ANOVA is re-parameterized as the linear regression model where y = θ 1 1 n +U A θ A +ǫ (2.4) θ A = D A W Aα R p 1. (2.5) Under the restriction ν 0 α = 0, θ A = 0 p 1 is equivalent to α = 0 p because the p p matrix (W A D A,ν 0 ) is non-singular and ( ) ( ) DA W A θa α =. (2.6) 0 ν 0 Hence, under (2.4), two models described in (1.2) are written as M 1 : θ A = 0 p 1 and M A+1 : θ A 0 p Priors and the Bayes factor In Bayesian model selection, the specification of priors are needed on the models and parameters in each model. For the former, let Pr(M 1 ) = Pr(M A+1 ) = 1/2 as usual. For the latter, at moment, we write joint prior densities as p(θ 1,σ 2 ) for M 1 and p(θ 1,θ A,σ 2 ) for M A+1. From the Bayes theorem, M A+1 is chosen when Pr(M A+1 y) > 1/2 where Pr(M A+1 y) = BF (A+1):1 1+BF (A+1):1 and BF (A+1):1 is the Bayes factor given by BF (A+1):1 = m A+1 (y)/m 1 (y). (2.7) In other words, M A+1 is chosen if and only if BF (A+1):1 > 1. In (2.7), m γ (y) is the marginal density under M γ for γ = 1,A+1 as follows: m 1 (y) = p(y θ 1,σ 2 )p(θ 1,σ 2 )dθ 1 dσ 2 m A+1 (y) = p(y θ 1,θ A,σ 2 )p(θ 1,θ A,σ 2 )dθ 1 dθ A dσ 2, where p(y θ 1,σ 2 ) and p(y θ 1,θ A,σ 2 ) are sampling densities of y under M 1 and M A+1, respectively.

5 Y. Maruyama/A Bayes factor 5 In this paper, we use the following joint prior density for M 1 and p(θ 1,σ 2 ) = p(θ 1 )p(σ 2 ) = 1 σ 2 (2.8) p(θ 1,θ A,σ 2 ) = p(θ 1 )p(σ 2 )p(θ A σ 2 ) = 1 σ 2 0 p(θ A g,σ 2 )p(g)dg (2.9) for M A+1. Note that p(θ 1 )p(σ 2 ) = σ 2 in both (2.8) and (2.9) is a popular non-informative prior. It is clear that θ 1 and σ 2 are location-scale parameters, and the improper prior for them is the right-haar prior from the locationscale invariance group. These are reasons why the use of these improper priors for θ 1 and σ 2 is formally justified. See Berger, Pericchi and Varshavsky (1998); Berger, Bernardo and Sun (2009) for details. As p(θ A g,σ 2 ), we use so-called Zellner s (1986) g-prior p(θ A σ 2,g) = N p 1 (0,gσ 2 (U AU A ) 1 ) = N p 1 (0,gσ 2 I p 1 ). (2.10) As the prior of g, following Maruyama and George (2011), we use PearsonType VI distribution with the density p(g) = gb (1+g) a b 2 B(a+1,b+1) = g(n p)/2 a 2 (1+g) (n p)/2 B(a+1,(n p)/2 a 1) (2.11) where 1 < a < (n p)/2 1 and b = (n p)/2 a 2. (2.12) Remark 2.1. For the choice of a, my recommendation is a = 1/2. We will describe it briefly. The asymptotic behavior of p(g) given by (2.11), for sufficiently large g, is proportional to g a 2. From the Tauberian Theorem, which is well-known for describing the asymptotic behavior of the Laplace transform, we have p(θ A σ 2 ) = 0 p(θ A σ 2,g)p(g)dg (σ 2 ) a+1 θ A (p+2a+1), (2.13) for sufficiently large θ A R p 1, a > 1 and b > 1. Hence the asymptotic tail behavior of p(θ A σ 2 ) for a = 1/2 is multivariate Cauchy, θ A (p 1) 1, which has been recommended by Zellner and Siow(1980) and others in objective Bayes context. Before we proceed to give the Bayes factor with respect to priors described above, we review statistical inference in ANOVA from frequentist point of view. The key decomposition is given by W T = y ȳ 1 n 2 = ij (y ij ȳ ) 2 = U A U A{y ȳ 1 n } 2 + (I U A U A){y ȳ 1 n } 2 = i r i(ȳ i ȳ ) 2 + ij (y ij ȳ i ) 2 = W H +W E, (2.14)

6 where W H and W E are independent and Y. Maruyama/A Bayes factor 6 W H σ 2 [ χ2 p 1 θa 2 /σ 2], W E σ 2 χ2 n p. In (2.14), W T, W H and W E are called total sum of squares, between group sum of squares and within group sum of squares, respectively. The hypothesis H 0 : α = 0 p (or θ A = 0 p 1 ) is rejected if W H /W E is relatively larger. In the theorem below, the Bayes factor under our priors is an increasing function of W H /W E, which is reasonable from frequentist point of view, too. Theorem 2.1. Under priors given by (2.8) under M 1 and by (2.9) with (2.10) and (2.11) under M A+1, the Bayes factor is given by m A+1 (y) m 1 (y) Proof. See Appendix. = Γ(p/2+a+1/2)Γ((n p)/2) Γ(a+1)Γ({n 1}/2) ( 1+ W ) (n p 2)/2 a H. W E By Remark 2.1 and Theorem 2.1, the Bayes factor which we recommend is written as BF F (A+1):1 = m A+1(y) = Γ(p/2)Γ({n p}/2) ( 1+ W ) (n p 1)/2 H, (2.15) m 1 (y) Γ(1/2)Γ({n 1}/2) W E where the subscript F means Fully Bayes. 3. Model selection consistency In this section, we consider the model selection consistency as n = i r i. Generally, the posterior consistency for model choice is defined as plimpr(m γ y) = 1 or plimbf γ :γ = 0 (3.1) n n when M γ is the true model and M γ is not. Here plim denotes convergence in probability and the probability distribution in (3.1) is the sampling distribution under the true model M γ. We will show that BF F (A+1):1 given by (2.15) has a reasonable model selection consistency. As a natural competitor of BF F (A+1):1, the BIC based Bayes factor, ( BF BIC (A+1):1 = 1+ W ) n/2 H n (p 1)/2, W E is considered. As remarked in Section 1, the consistency depends on c A given by (1.5) or equivalently θ A 2 /(nσ 2 ). Let Then we have a following result. c A = lim c θ A 2 A = lim n n nσ 2. (3.2)

7 Theorem 3.1. Y. Maruyama/A Bayes factor 7 I. Assume r and p is fixed. (a) BF F (A+1):1 and BF BIC (A+1):1 are consistent whichever the true model is. II. Assume p. (a) BF F (A+1):1 and BFBIC (A+1):1 are consistent under M 1. (b) Assume r is fixed under M A+1. i. BF F (A+1):1 is consistent (inconsistent) if c A > (<)h( r) where ii. BF BIC (A+1):1 is inconsistent. (c) Assume r under M A+1. i. BF F (A+1):1 is consistent. h(r) = r 1/(r 1) 1. (3.3) ii. BF BIC (A+1):1 is consistent (inconsistent) if p slower (faster) than (e{1+ c A } r )/ r. Note: h(r) is a convex decreasing function in r which satisfies h(2) = 1, h(5). = 0.5, h(10). = 0.29, and h( ) = 0. Proof. See Appendix. Note: After Maruyama (2009), the first version of this paper, where the balanced ANOVA was treated only (that is, r i r is assumed), Wang and Sun (2012) extended consistency results of Maruyama (2009) to unbalanced case while the proof itself essentially follows from Maruyama (2009). Remark 3.1. We give some remarks on inconsistency shown in Theorem 3.1. I. As shown in II(b)ii of Theorem 3.1, BIC always chooses M 1 when p and r is fixed, even if c A is very large. This is interpreted as unknown variance version of Stone s (1979). II. As seen in II(b)i of Theorem 3.1, BF F (A+1):1 has an inconsistency region. Actually existing such an inconsistency region has been also reported by Moreno, Girón and Casella (2010). They proposed intrinsic Bayes factor for normal regression model and their upper-bound of inconsistency region for one-way balanced ANOVA is given by r 1 (r+1) (r 1)/r 1 1 which is slightly smaller than h(r) given by (3.3). III. The existence of inconsistency region for small c A and large p is quite reasonable from the following reason. Assume new independent observations z ij (i = 1,...,p, j = 1,...,r i ) from the same model as y ij. Then the difference of scaled mean squared prediction errors of ȳ i and ȳ is given

8 by Y. Maruyama/A Bayes factor 8 [ ] [ ] E y,z i,j (z ij ȳ ) 2 E y,z i,j (z ij ȳ i ) 2 [ȳ ;ȳ i ] = nσ 2 nσ 2 = c A p 1 n for any p and r. First assume that M 1 is true. We see that [ȳ ;ȳ i ] = (p 1)/n < 0, which is reasonable. Then assume that M A+1 is true. When r and p is fixed, lim r [ȳ ;ȳ i ] = c A > 0, which is reasonable. On the other hand, if p and r is fixed, lim [ȳ ;ȳ p i ] = c A 1/ r. Hence when c A < 1/ r, [ȳ ;ȳ i ] is negative even if M A+1 is true. Therefore, from the prediction point of view, the existence of inconsistency region for small c A and large p is reasonable. Table1showsfrequencyofchoiceofthetruemodelinsomecasesinnumerical experiment where he balanced case r 1 = = r p is considered. We see that it clearly guarantees the validity of Theorem two-way ANOVA 4.1. re-parameterized two-way ANOVA In this section, we extend the results in Sections 2 and 3 to two-way ANOVA model. We have n independent normal random variables y ijk (i = 1,...,p, j = 1,...,q, k = 1,...,r ij, n = i,j r ij) where y ijk = µ+α i +β j +γ ij +ǫ ijk, ǫ ijk N(0,σ 2 ). (4.1) In the following, for simplicity as in Christensen (2011), we assume that r ij is given by r ij = n pq ν iξ j (4.2) where ν i for i = 1,...,p and ξ j for j = 1,...,q satisfy p ν i = p, i=1 q ξ j = q j=1

9 Y. Maruyama/A Bayes factor 9 Table 1 Frequency of choice of the true model p\r under M 1 c = 0.1 under M A+1 FB BIC c = 0.5 under M A+1 c = 1 under M A+1 FB BIC c = 2 under M A+1 c = 5 under M A+1 FB BIC

10 Y. Maruyama/A Bayes factor 10 and that {n/(pq)}ν i ξ j for any (i,j) is a positive integer. Therefore we allow some kinds of unbalanced design as well as balanced design, r ij n/(pq) for any (i,j). Let y = ({y 111,...,y 11r11 },{y 121,...,y 12r12 },...,{y pq1,...,y pqrpq }), ǫ = ({ǫ 111,...,ǫ 11r11 },{ǫ 121,...,ǫ 12r12 },...,{ǫ pq1,...,ǫ pqrpq }), α = (α 1,...,α p ), β = (β 1,...,β q ) and γ = ({γ 11,...,γ 1q },{γ 21,...,γ 2q },...,{γ p1,...,γ pq }). The n (p+q +pq) design matrix is given by (X A,X B,X AB ), where the n p matrix X A = (x A ti ), the n q matrix X B = (X B1,...,X Bp ) with an ( q j=1 r ij) q matrix X Bi = (x Bi tj ), the n pq matrix X AB = (x AB tm ) satisfy { x A 1 if i 1 q m=1 l=1 ti = r ml +1 t i q m=1 l=1 r ml, 0 otherwise, and x AB x Bi tj = { 1 if j 1 m=1 r im +1 t j m=1 r im, 0 otherwise, { tm = 1 if s<m r i(s)j(s) +1 t s m r i(s)j(s), 0 otherwise, where i(s) and j(s) are the unique integer solution of s = (i 1)q + j where i = 1,...,p and j = 1,...,q, respectively. Then the linear model of (4.1) is written as y = µ1 n +X A α+x B β +X AB γ +ǫ, ǫ N n (0 n,σ 2 I n ). (4.3) Note that {µ,α,β,γ} do not have identifiability since 1 n = X A 1 p = X B 1 q = X AB 1 pq, X A e i = X AB (0 q,...,0 q,1 q,0 q,...,0 q ), for i = 1,...,p, X B e j = X AB (e j,...,e j ), for j = 1,...,q. (4.4) So we will re-parameterize the model (4.3) as a linear regression model with non-constrained parameters. The expectation E[y] is re-written as µ1 n +X A α+x B β +X AB γ = θ 1 (µ,α,β,γ)1 n + X A α+ X B β + X AB γ

11 Y. Maruyama/A Bayes factor 11 where θ 1 (µ,α,β,γ) = µ+ ν α p + ξ β q + (r 11,...,r pq ) γ n and X A, XB, XAB are the centered matrices. As explained in Christensen (2011), under the condition (4.2) on r ij including the balanced design, XA and X B are orthogonal, that is, X A XB is the zero matrix. Let the singular value decomposition of XA, XB and (I n U A U A )(I n U B U B ) X AB be X A = U A D A W A, XB = U B D B W B, (I n U A U A)(I n U B U B) X AB = U AB\(A+B) D AB\(A+B) W AB\(A+B), (4.5) where the rank of XA, XB and (I n U A U A )(I n U B U B ) X AB are p 1, q 1 and (p 1)(q 1), respectively. Then where E[y] = θ 1 (µ,α,β,γ)1 n + X A α+u A U AX AB γ + X B β +U B U BX AB γ +(I n U A U A U B U B) X AB γ = θ 1 (µ,α,β,γ)1 n +U A {D A W A α+u A X ABγ} +U B {D B W Bβ +U BX AB γ}+(i n U A U A U B U B) X AB γ = θ 1 (µ,α,β,γ)1 n +U A θ A (α,γ) +U B θ B (β,γ)+u AB\(A+B) θ AB\(A+B) (γ) θ A (α,γ) = D A W A α+u A X ABγ R p 1, θ B (β,γ) = D B W Bβ +U BX AB γ R q 1, θ AB\(A+B) (γ) = D AB\(A+B) W AB\(A+B) γ R(p 1)(q 1). (4.6) Therefore the linear model with non-constrained parameters is given by {θ 1,θ A,θ B,θ AB\(A+B) } y = θ 1 1 n +U A θ A +U B θ B +U AB\(A+B) θ AB\(A+B) +ǫ. (4.7) In the two-way ANOVA model given by (4.3), the following five submodels are important. M 1 : E[y] = µ1 n, (α = 0 p,β = 0 q,γ = 0 pq ) M A+1 : E[y] = µ1 n +X A α, (β = 0 q,γ = 0 pq ) M B+1 : E[y] = µ1 n +X B β, (α = 0 p,γ = 0 pq ) M A+B+1 : E[y] = µ1 n +X A α+x B β, (γ = 0 pq )

12 Y. Maruyama/A Bayes factor 12 M (A+1)(B+1) : E[y] = µ1 n +X A α+x B β +X AB γ. Under suitable constraints on α, β and γ, for example, ν α = 0, ξ β = 0, r 11,...,r pq U A X AB γ = 0 p+q 1, (4.8) U B X AB {µ,α,β,γ} are identifiable, and further between (4.3) and (4.7), there are the following equivalences: M 1 : (α = 0 p,β = 0 q,γ = 0 pq ) (θ A = 0 p 1,θ B = 0 q 1,θ AB\(A+B) = 0 (p 1)(q 1) ), M A+1 : (β = 0 q,γ = 0 pq ) (θ B = 0 q 1,θ AB\(A+B) = 0 (p 1)(q 1) ), M B+1 : (α = 0 p,γ = 0 pq ) (θ A = 0 p 1,θ AB\(A+B) = 0 (p 1)(q 1) ), M A+B+1 : γ = 0 pq θ AB\(A+B) = 0 (p 1)(q 1). Note that the constraints for {α,β,γ} as (4.8) do not affect our results priors and the Bayes factor Following Section 2.2, we assume that π(θ 1,σ 2 ) = 1/σ 2 for every model and θ A {g,σ 2 } N p 1 (0 p 1,gσ 2 I p 1 ) for M A+1,M A+B+1,M (A+1)(B+1), θ B {g,σ 2 } N q 1 (0 q 1,gσ 2 I q 1 ) for M B+1,M A+B+1,M (A+1)(B+1), and θ AB\(A+B) {g,σ 2 } N (p 1)(q 1) (0 (p 1)(q 1),gσ 2 I (p 1)(q 1) ) for M (A+1)(B+1). Further we assume p(g) = gb (1+g) a b 2 B(a+1,b+1) (4.9) where a = 1/2 and b(s) = (n s 1)/2 a 2 (4.10) for p 1 under M A+1 q 1 under M B+1 s = p+q 2 under M A+B+1 pq 1 under M (A+1)(B+1).

13 Y. Maruyama/A Bayes factor 13 Before we proceed to give the Bayes factor with respect to priors described above, we review the decomposition of squares for two-way ANOVA. As similar in one-way ANOVA, the total sum of squares W T = ijk (y ijk ȳ ) 2 = y ȳ 1 n 2 can be decomposed as where each sums of squares are given by W A = ijk W B = ijk W AB\(A+B) = ijk W T = W A +W B +W AB\(A+B) +W E (4.11) (ȳ i ȳ ) 2 = U A U A{y ȳ 1 n } 2 = U A{y ȳ 1 n } 2, (ȳ j ȳ ) 2 = U B U B {y ȳ 1 n } 2 = U B {y ȳ 1 n } 2, (ȳ ij ȳ i ȳ j +ȳ ) 2 = U AB\(A+B) U AB\(A+B) {y ȳ 1 n } 2 = U AB\(A+B) {y ȳ 1 n } 2, W E = ijk and (y ijk ȳ ij ) 2 = (I n U A U A)(I n U B U B)(I n U AB\(A+B) U AB\(A+B) ){y ȳ 1 n } 2 ȳ = 1 y ijk, ȳ i = 1 n ijk j r y ijk, ij jk ȳ j = 1 i r y ijk, ȳ ij = 1 y ijk. ij r ij ik In (4.11), W A,W B,W AB\(A+B) and W E are independently distributed as W A σ 2 χ2 p 1 ( θ A 2 /σ 2 W B ), σ 2 χ2 q 1 ( θ B 2 /σ 2 ), W AB\(A+B) σ 2 χ 2 (p 1)(q 1) ( θ AB\(A+B) 2 /σ 2 ). k W E σ 2 χ2 n pq, (4.12) Note that θ A 2, θ B 2 and θ AB\(A+B) 2 do not depend on parameterization since these are given by θ A 2 = U A{E[y] E[ȳ ]1 n } 2, θ B 2 = U B {E[y] E[ȳ ]1 n } 2, θ AB\(A+B) 2 = U AB\(A+B) {E[y] E[ȳ ]1 n } 2. Based on the decomposition, Bayes factors for two-way ANOVA are given as follows.

14 Y. Maruyama/A Bayes factor 14 Theorem 4.1. When M γ for γ = A+1,B+1,A+B+1, and (A+1)(B+1) and M 1 are pairwisely compared, the corresponding Bayes factors can be derived as follows: BF F (A+1):1 = Γ(p/2)Γ((n p)/2) ( 1+ Γ(1/2)Γ({n 1}/2) BF F (B+1):1 = Γ(q/2)Γ((n q)/2) Γ(1/2)Γ({n 1}/2) ( 1+ BF F (A+B+1):1 = Γ(p+q 1 2 )Γ( n p q+1 2 ) Γ(1/2)Γ({n 1}/2) BF F (A+1)(B+1):1 = Γ(pq/2)Γ((n pq)/2) Γ(1/2)Γ({n 1}/2) W A W T W A W B W T W B ( 1+ ) (n p)/2 1/2 ) (n q)/2 1/2 ) (n p q)/2 W A +W B W T W A W B ( 1+ W ) (n pq)/2 1/2 T W E. W E 4.3. Consistency In this subsection, we consider model selection consistency of Bayes factors proposed in Theorem 4.1. We call {BF F }, the set of Bayes factors given in Theorem 4.1, consistent under M γ if and only if BF F γ :1 plim n BF F = 0 γ:1 for any γ γ (Let BF F 1:1 = 1). As the competitor, the corresponding Bayes factors based on BIC are given as follows. Let BF BIC (A+1):1 = n (p 1)/2 (1+ BF BIC (B+1):1 = n (q 1)/2 (1+ BF BIC (A+B+1):1 = n (p+q 2)/2 (1+ W A W T W A W B W T W B ) n/2 ) n/2 W A +W B W T W A W B ) n/2. BF BIC (A+1)(B+1):1 = n (pq 1)/2 ( 1+ W T W E W E ) n/2 θ A 2 c A = lim n nσ 2, c θ B 2 B = lim n nσ 2, c θ AB\(A+B) 2 AB\(A+B) = lim n nσ 2. Then we have a following result on consistency. Theorem 4.2. I. Assume r and p and q are fixed.

15 Y. Maruyama/A Bayes factor 15 (a) Both {BF F } and {BF BIC } are consistent whichever the true model is. II. Assume p and q. (a) Both {BF F } and {BF BIC } are consistent except under M (A+1)(B+1). (b) Assume r is fixed. i. {BF F } is consistent (inconsistent) under M (A+1)(B+1) when c AB\(A+B) > (<)H( r, c A + c B ), (4.13) where H(r,c) with positive c is the (unique) positive solution of (x+1) r /r (x+1) c = 0. (4.14) ii. {BF BIC } is inconsistent under M (A+1)(B+1). (c) Assume r. Proof. See Appendix. i. {BF F } is consistent under M (A+1)(B+1). ii. {BF BIC } is consistent (inconsistent) under M (A+1)(B+1) when pq slower (faster) than (e{1+ c AB\(A+B) } r )/ r. The function H(r, c) satisfies H(r,c) with fixed r is increasing in c and H(r,0) = h(r) where h(r) is given by (3.3). H(r,c) with fixed c is decreasing in r and lim H(r,c) = 0. r Remark 4.1. We have results under fixed q (and fixed r or r ), too. We omit the detail since they are somewhat complicated. Appendix A: Proof of Theorem 2.1 First we derive the marginal density under M 1. Using the Pythagorean relation y θ 1 1 n 2 = n(ȳ θ 1 ) 2 +W T, where ȳ = n 1 i,j y ij and W T = y ȳ 1 n 2 = ij (y ij ȳ ) 2, we have 1 m 1 (y) = exp ( y θ 11 n 2 ) 0 (2π) n/2 σn+2 2σ 2 dθ 1 dσ 2 n 1/2 ( 1 = exp W ) T (2π) n/2 1/2 σn+1 2σ 2 dσ 2 0 = n1/2 Γ({n 1}/2) π (n 1)/2 {W T } (n 1)/2.

16 Y. Maruyama/A Bayes factor 16 Then we derive the marginal density under M A+1. Using the relationship y θ 1 1 n U A θ A 2 +g 1 θ A 2 = n(ȳ θ 1 ) 2 + y ȳ 1 n U A θ A 2 +g 1 θ A 2 = n(ȳ θ 1 ) 2 + g +1 g θ A gu A (y ȳ 1 n ) 2 g +1 g g +1 U A(y ȳ 1 n ) 2 + y ȳ 1 n 2 = n(ȳ θ 1 ) 2 + g +1 g θ A gu A (y ȳ 1 2 n ) g +1 + W T +gw E, g +1 where W E is given in (2.14), we have the conditional marginaldensity of y given g under M A+1, m A+1 (y g) = p(y θ 1,θ A,σ 2 )p(θ A σ 2,g)p(σ 2 )dθ 1 dθ A dσ 2 R p = R p 1 0 (2πσ 2 ) n/2 (2πgσ 2 ) (p 1)/2 exp ( y θ 11 n U A θ A 2 2σ 2 θ A 2 ) 2gσ 2 p(σ 2 )dθ 1 dθ A dσ 2 = 0 n 1/2 (1+g) (p 1)/2 (2πσ 2 ) (n 1)/2 (1+g) (n p)/2 exp = m 1 (y) (g{w E /W T }+1) (n 1)/2. The fully marginal density m A+1 (y) = 0 ( W T +gw E 2σ 2 (g +1) m A+1 (y g)p(g)dg ) 1 σ 2dσ2 under the prior of g given by (2.11), has a closed simple form as follows; m A+1 (y) = m 1 (y) 0 1 = m 1 (y) B(a+1,b+1) which completes the proof. (1+g) (n p)/2 g b (1+g) a b 2 (g{w E /W T }+1) (n 1)/2 B(a+1,b+1) dg = m 1 (y) Γ(p/2+a+1/2)Γ((n p)/2) Γ(a+1)Γ({n 1}/2) Appendix B: Proof of Theorem 3.1 preparation: random part 0 g b (g{w E /W T }+1) (n 1)/2dg ( ) (n p 2)/2 a WT, W E (A.1)

17 Y. Maruyama/A Bayes factor 17 Recall that, for any n and p, W H and W E are independent and W H σ 2 [ χ2 p 1 θa 2 /σ 2], W E σ 2 χ2 n p. P In the following, n denotes convergence in probability as n grows. When p is fixed and r(= n/p) goes to, we have W E P W H r p rσ 2 1, σ 2 χ2 p under M 1, W H P r p rσ 2 c A under M A+1. Hence, for any ǫ > 0 and any positive number r, there exists an l(ǫ) such that Pr ( 1/l(ǫ) < {1+W H /W E } p r < l(ǫ) ) > 1 ǫ under M 1. Under M A+1, we have When p goes to infinity, we have Hence we have 1 W E P p r 1 pσ 2 1, 1+ W H W E P r 1+ c A. W H pσ 2 P p 1 under M 1, W H pσ 2 P p 1+ r c A under M A W H W E P p r r 1 under M 1, 1+ W H P r{1+ c A } p under M A+1. W E r 1 preparation: non-random part For the asymptotic behavior of the gamma function, Stiring s formula, Γ(ax+b) 2πe ax (ax) ax+b 1/2 (B.1) (B.2) (B.3) (B.4) for sufficiently large x is useful. Here f g means limf/g = 1. Using (B.4), we get Γ(p/2)Γ({p r p}/2) Γ(1/2)Γ({p r 1}/2) Γ(p/2) 1 Γ(1/2)(p/2) (p 1)/2 r (p 1)/2, (B.5) when r and p is fixed, and when p. Γ(p/2)Γ({p r p}/2) Γ({p r 1}/2)Γ(1/2) ( r 1)p( r 1)/2 1/2 2 = 2 r ( r 1) 1/2 r p r/2 1 { r 1 r r/( r 1) } p( r 1)/2 (B.6)

18 Y. Maruyama/A Bayes factor 18 main part First, we consider the consistency in the case where r and p is fixed. Using (B.1), (B.2) and (B.5), we have BF F (A+1):1 P r 0 under M 1, 1/BF F P (A+1):1 r 0 under M A+1, which means BF F (A+1):1 is consistent under both M 1 and M A+1. Similarly we see that BF BIC (A+1):1 is consistent under both M 1 and M A+1 when r and p is fixed. Then we consider the consistency in the case where p. Using (B.3) under M 1 and (B.6), we have BF F (A+1):1 P p 0, BF BIC P (A+1):1 p 0 which means BF F (A+1):1 and BFBIC (A+1):1 are consistent under M 1 when p. Using (B.3) under M A+1 and (B.6), we have 1 BF F (A+1):1 { } p( r 1)/2 2 r 1+ ca P p 1. ( r 1) 1/2 r 1/( r 1) Hence when c A > h( r) = r 1/( r 1) 1, we have 1/BF F (A+1):1 P p 0, which means consistency of BF F (A+1):1 under M A+1. Since h( r) 0 as r, BF F (A+1):1 has consistency when both p and r go to infinity. On the other hand, when c A < h( r), we have BF F (A+1):1 P p 0, which means inconsistency of BF F (A+1):1 under M A+1. Using (B.3) under M A+1 and (B.6), we have BF BIC (A+1):1 ( We see that for any fixed r, p r { ) } r p/2 r 1 (p r) 1/2 P p 1. r(1+ c A ) BF BIC (A+1):1 P p 0 which means inconsistency of BIC under M A+1. When r as well as p goes to, ( ) p/2 BF BIC p r (A+1):1 e(1+ c ) r (p r) 1/2 P p 1. A

19 Y. Maruyama/A Bayes factor 19 Hence if p approaches infinity faster than (e{1+ c A } r )/ r, we have BF BIC (A+1):1 P p, r 0, which means inconsistency of BIC even if r. On the other hand, if p approaches infinity slower than (e{1+ c A } r )/ r, we have which means consistency of BIC. 1/BF BIC (A+1):1 P p, r 0. Appendix C: Proof of Theorem 4.2 By following the proof of Theorem 3.1 in Appendix B, all of the proof are straightforward. Here we give the sketch of the proof for part II(b)i only. When p,q and r is fixed, W A P W B P W E P p,q nσ 2 c A, p,q nσ 2 c B, p,q nσ 2 1 1/ r, W AB\(A+B) P p,q nσ 2 c AB\(A+B) +1/ r. (C.1) Using (C.1), Stiring s formula given in (B.4) and lim p p 1/p = 1, we have Therefore if the ratio { } ( 2/(pq) BF F P p,q 1+ ca + c B + c AB\(A+B) (A+1):1 1+ c B + c AB\(A+B) { } ( 2/(pq) BF F P p,q 1+ ca + c B + c AB\(A+B) (B+1):1 1+ c A + c AB\(A+B) ) r, ) r, { } ( 2/(pq) BF F P p,q 1+ ca + c B + c AB\(A+B) (A+B+1):1 1+ c AB\(A+B) ) r, { } 2/(pq) BF F P p,q (1+ c A + c B + c ) r 1 AB\(A+B) (A+1)(B+1):1. r (1+ c AB\(A+B) ) r r { > 1+ c A + c B + c AB\(A+B), (C.2) BF F γ:1 BF F (A+1)(B+1):1 } 2/(pq), (C.3) for γ = A+1,B+1,A+B+1, approaches a positive constant strictly less than 1, which guarantees the consistency of BF F (A+1)(B+1):1 under (C.2).

20 Y. Maruyama/A Bayes factor 20 References Berger, J. O., Bernardo, J. M. and Sun, D. (2009). The formal definition of reference priors. Ann. Statist MR Berger, J. O., Ghosh, J. K. and Mukhopadhyay, N. (2003). Approximations and consistency of Bayes factors as model dimension grows. J. Statist. Plann. Inference MR Berger, J. O., Pericchi, L. R. and Varshavsky, J. A.(1998). Bayes factors and marginal distributions in invariant situations. Sankhyā Ser. A MR Christensen, R. (2011). Plane answers to complex questions, fourth ed. Springer Texts in Statistics. Springer, New York. The theory of linear models. MR Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc Liang, F.,Paulo, R.,Molina, G.,Clyde, M. A.andBerger, J. O.(2008). Mixtures of g priors for Bayesian variable selection. J. Amer. Statist. Assoc MR Maruyama, Y. (2009). A Bayes factor with reasonable model selection consistency for ANOVA model. Arxiv v1 [stat.me]. Maruyama, Y. and George, E. I. (2011). Fully Bayes factors with a generalized g-prior. Ann. Statist Moreno, E., Girón, F. J. and Casella, G. (2010). Consistency of objective Bayes factors as the model dimension grows. Ann. Statist Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist MR Stone, M. (1979). Comments on Model Selection Criteria of Akaike and Schwarz. J. R. Stat. Soc. Ser. B Stat. Methodol Wang, M. and Sun, X. (2012). Bayes Factor Consistency for Unbalanced ANOVA Models. Arxiv v1 [stat.me]. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian inference and decision techniques. Stud. Bayesian Econometrics Statist North-Holland, Amsterdam. MR Zellner, A. and Siow, A. (1980). Posterior Odds Ratios for Selected Regression Hypotheses. In Bayesian Statistics: Proceedings of the First International Meeting held in Valencia (Spain) (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.) University of Valencia.

An Extended BIC for Model Selection

An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,