An Extended BIC for Model Selection

Size: px

Start display at page:

Download "An Extended BIC for Model Selection"

Griffin Flowers
5 years ago
Views:

1 An Extended BIC for Model Selection at the JSM meeting Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri, University of Valencia; Woncheol Jang, University of Georgia; Luis R. Pericchi, University of Puerto Rico JSM: July-August slide #1

2 Motivation and Outline Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Initiated from a SAMSI Social Sciences working group. getting the model right in structural equation modeling. Introduced in the Wald Lecture by Jim Berger. Specific characteristics They have independent cases (individuals). Often Multilevel or random effects models (p grows as n grows). Regularity Conditions may not be satisfied. Problems with BIC. Can a generalization of BIC work? Goal: Use only Standard software output and close-form expressions. Present a generalization of BIC ( extended BIC or ) and some examples in linear regression scenario. Work in progress (comments are welcome!) JSM: July-August slide #2

3 Bayesian Information Criterion: BIC Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p l(ˆθ x) the log-likelihood of ˆθ given x BIC = 2l(ˆθ x) p log(n), p the number of estimated parameters. n the cardinality of x. Swartz s result: As n (with p fixed) this is an approximation (up to a constant) to twice the Bayesian log likelihood for the model, m(x) = R f(x θ)π(θ)dθ, so that m(x) = c π e BIC/2 (1 + o(1)). BIC approaches to BF s further ignore the c π s check this JSM: July-August slide #3

4 First Ambiguity: What is n? Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Consider time series y it, y it = βy i t 1 + ε it, i = 1,..., n, t = 1,... T, ε it N(0, σ 2 ). When the goal is to estimate the mean level ȳ, consider two extreme cases: 1. β = 0; then the y it are i.i.d., and the estimated parameter ȳ has sample size n = n T 2. β = 1; then clearly the effective sample size for ȳ equals n = n See Thiebaux & Zwiers (1984) for discussion of effective sample size in time series. JSM: July-August slide #4

5 Second Ambiguity: Can n be different for differnt parameters? Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Consider data from p groups y ip = µ p + ε ip, i = 1... n, p = 1... r ε ip N(0, σ 2 ) µ p and σ 2 both unknown. Group 1 y 11 y y n 1 Group 2 y 12 y y n Group p y 1p y 2p... y n p Group r y 1r y 2r... y n r JSM: July-August slide #5

6 Second Ambiguity: Can n be different for differnt parameters? Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Consider data from p groups y ip = µ p + ε ip, i = 1... n, p = 1... r ε ip N(0, σ 2 ) µ p and σ 2 both unknown. Different parameters can have different sample sizes: the sample size for estimating each µ p is n the sample size for estimating σ 2 is n r JSM: July-August slide #5

7 Second Ambiguity: Can n be different for differnt parameters? Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Consider data from p groups y ip = µ p + ε ip, i = 1... n, p = 1... r ε ip N(0, σ 2 ) µ p and σ 2 both unknown. Computation for this example yields Î = n I ˆσ 2 r r 0 0 n 2ˆσ 4 1 A, where ˆσ 2 = 1 n P p i=1 P r l=1 (X il X i ) 2. JSM: July-August slide #5

8 Third Ambiguity: What is the number of parameters Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Example: Random effects group means In the previous Group Means example, with p groups and r observations X il per group, with X il N(µ i, σ 2 ) for l = 1,..., r, consider the multilevel version in which with ξ and τ 2 being unknown. µ i N(ξ, τ 2 ), What is the number of parameters? (see also Pauler (1998)) (1) If τ 2 = 0, there is only two parameters: ξ and σ 2. (2) If τ 2 is huge, is the number of parameters p + 3? the means along with ξ, τ 2 and σ 2 Continued... JSM: July-August slide #6

9 Example: Random effects group means Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p (3) But, if one integrates out µ = (µ 1,..., µ p ), then Z f(x σ 2, ξ, τ 2 ) = f(x µ, ξ, σ 2 )π(µ ξ, τ 2 )dµ 1 ˆσ2 exp σ p(r 1)! py 2σ 2 i=1 so p = 3 if one can work directly with f(x σ 2, ξ, τ 2 ).! exp ( x i ξ) 2 2( σ2 + τ, r 2 ) Note: This seems to be common practice in Social Sciences: latent variables are integrated out, so remaining parameters are the only unknowns when applying BIC Note: In this example the effective sample sizes should be p r for σ 2, p for ξ and τ 2, and r for the µ i s. JSM: July-August slide #7

10 Example: Common mean, differing variances: Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p (Cox mixtures) Suppose each independent observation X i, i = 1,..., n, has probability 1/2 of arising from the N(θ, 1), and probability 1/2 of arising from the N(θ, 1000). Clearly half of the observations are worthless, so the effective sample size is roughly n/2. JSM: July-August slide #8

11 Problem with n and p Motivation and Outline Bayesian Information Criterion: BIC First Ambiguity: What is n? Second Ambiguity: Can n be different for differnt parameters? Third Ambiguity: What is the number of parameters Example: Random effects group means Example: Common mean, differing variances: Problem with n and p Problems with n. Is n the number of vector observations or the number of real observations? Different θ i can have different effective sample sizes. Some observations can me more informative than others as in mixture models models with mixed, continuous and discrete observations Problems with p. What is p with random effects, latent variables, mixture model... etc.? Often p grows with n JSM: July-August slide #9

12 : a proposed solution : a proposed solution Laplace approximation Based on a modified Laplace approximation to m(x) for good priors: The good priors can deal with growing p The Laplace approximation retains the constant c π in the expansion often good even for moderate n Implicitly uses effective sample size (discussion later) Computable using standard software using mle s and observed information matrices. JSM: July-August slide #10

13 : a proposed solution : a proposed solution Laplace approximation Based on a modified Laplace approximation to m(x) for good priors: The good priors can deal with growing p The Laplace approximation retains the constant c π in the expansion often good even for moderate n Implicitly uses effective sample size (discussion later) Computable using standard software using mle s and observed information matrices. JSM: July-August slide #10

: a proposed solution : a proposed solution Laplace approximation Based on a modified Laplace approximation to m(x) for good priors: The good priors can deal with growing p The Laplace approximation

14 : a proposed solution : a proposed solution Laplace approximation Based on a modified Laplace approximation to m(x) for good priors: The good priors can deal with growing p The Laplace approximation retains the constant c π in the expansion often good even for moderate n Implicitly uses effective sample size (discussion later) Computable using standard software using mle s and observed information matrices. JSM: July-August slide #10

15 Laplace approximation : a proposed solution Laplace approximation 1. Preliminary: nice reparameterization. 2. Taylor expansion By a Taylor s series expansion of e l(θ) about the mle b θ, m(x) = Z Z f(x θ)π(θ)dθ = exp Z e l(θ) π(θ)dθ» l(ˆθ) + (θ ˆθ) t l(ˆθ) 1 2 ˆθ t ˆθ) θ Î θ π(θ)dθ where denotes the gradient and I b = (Îjk) is the observed information matrix, with (j, k) entry 3. Approximation to marginal m(x) If b θ occurs on the interior of the parameter space, so l( b θ) = 0 (if not true, the analysis must proceed as in Haughton 1991,1993), mild conditions yield Z m(x) = e l( θ) b e 1 2 (θ b θ) t b I(θ c θ) π(θ)dθ(1 + o n (1)). JSM: July-August slide #11

16 Choosing a good prior Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC Integrate out common parameters Orthogonalize the parameters. JSM: July-August slide #12

17 Choosing a good prior Choosing a good prior Univariate priors Berger s robust priors Integrate out common parameters Orthogonalize the parameters. Let O be orthogonal and D = diag(d i ), i = 1,... p 1 such that Σ = O t DO Proposal for Comparison of and BIC JSM: July-August slide #12

18 Choosing a good prior Choosing a good prior Univariate priors Berger s robust priors Proposal for Integrate out common parameters Orthogonalize the parameters. Let O be orthogonal and D = diag(d i ), i = 1,... p 1 such that Σ = O t DO Make the change of variables ξ = Oθ (1), and let b ξ = O b θ (1) Comparison of and BIC JSM: July-August slide #12

19 Choosing a good prior Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC Integrate out common parameters Orthogonalize the parameters. Let O be orthogonal and D = diag(d i ), i = 1,... p 1 such that Σ = O t DO Make the change of variables ξ = Oθ (1), and let b ξ = O b θ (1) If there is no θ (2) to be integrated out, Σ = Î 1 = O t DO JSM: July-August slide #12

20 Choosing a good prior Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC Integrate out common parameters Orthogonalize the parameters. Let O be orthogonal and D = diag(d i ), i = 1,... p 1 such that Σ = O t DO Make the change of variables ξ = Oθ (1), and let b ξ = O b θ (1) If there is no θ (2) to be integrated out, Σ = Î 1 = O t DO We now use a prior that is independent in the ξ i i.e, π(ξ) = Q p 1 i=1 π i(ξ i ) JSM: July-August slide #12

21 Choosing a good prior Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC Integrate out common parameters Orthogonalize the parameters. Let O be orthogonal and D = diag(d i ), i = 1,... p 1 such that Σ = O t DO Make the change of variables ξ = Oθ (1), and let b ξ = O b θ (1) If there is no θ (2) to be integrated out, Σ = Î 1 = O t DO We now use a prior that is independent in the ξ i i.e, π(ξ) = Q p 1 i=1 π i(ξ i ) The approximation to m(x) is then " m(x) e l( ˆθ) Y p1 Z # (2π) p/2 1 Î 1/2 e (ξ i ˆξ i ) 2 2d i π i (ξ i )dξ i. 2πdi i=1 where we recall that d i are the diagonal elements of D from Σ = O t DO, that is, Var(ξ i ). They will appear often in what follows. JSM: July-August slide #12

22 Univariate priors Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC For the priors π i (ξ i ), there are several possibilities: 1. Jeffreys recommended the Cauchy(0, b i ) density Z «πi C 1 (ξ i ) = N ξ i 0, b i Ga λ i where (b i ) 1 = (d i) 1 0 n i λ i 1 2, 1 «2 dλ i is the unit information for ξ i with n i being the effecteive sample size for ξ i (Kass & Wasserman, 1995). Note: For i.i.d. scenarios, d i is like σ 2 i /n i so b i is like the variance of one observation. 2. Use other sensible (objective) testing priors, like the intrinsic prior. 3. Goal is to get closed form expression: We use priors proposed in Berger (1985) which is very close to both the Cauchy and the intrinsic priors and produces close form expressions for m(x) for normal likelihoods. JSM: July-August slide #13

23 Berger s robust priors Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC For if b i d i, the prior is given by Z 1 «πi R 1 1 (ξ i ) = N ξ i 0, (d i + b i ) d i 2λ i 2 dλ i, λ i 0 JSM: July-August slide #14

24 Berger s robust priors Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC For if b i d i, the prior is given by Z 1 «πi R 1 1 (ξ i ) = N ξ i 0, (d i + b i ) d i 2λ i 2 dλ i, λ i 0 Introduced by Berger in the context of Robust Analyses No closed form expression. Not widely used in model selection Behaves like Cauchy priors with parameters b i. JSM: July-August slide #14

25 Berger s robust priors Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC For if b i d i, the prior is given by Z 1 «πi R 1 1 (ξ i ) = N ξ i 0, (d i + b i ) d i 2λ i 2 dλ i, λ i 0 Introduced by Berger in the context of Robust Analyses No closed form expression. Not widely used in model selection Behaves like Cauchy priors with parameters b i. Gives closed form expression for approximation to m(x) m(x) e l( ˆθ) p/2 (2π) Î 1/2 2 4 p 1 Y i= e ˆξ i 2 /(d i+b i ) p 5 2π(di + b i ) 2 ˆξ2 i /(d i + b i ) where (d i ) 1 is global information about ξ i, and (b i ) 1 is unit information. JSM: July-August slide #14

26 Proposal for Finally, we have, as the approximation to 2 log m(x), Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC 2l(ˆθ) + (p p 1 ) log(2π) log Î22 P p 1 i=1 log(1 + n i) +2 P p 1 i=1 log ( 1 e v i) 2 vi, where v i = ˆξ i 2 b i +d i. (error as approximation to 2 log m(x) is o n (1); exact for Normals) If no parameters are integrated out (so that p 1 = p), then px px `1 e v i = 2l(ˆθ) log(1 + n i ) + 2 log. 2 vi i=1 i=1 If all n i = n, the dominant terms in the expression (as n ) are 2l(ˆθ) p log n. The third term (the constant ignored by Swartz) is negative. JSM: July-August slide #15

27 Comparison of and BIC Choosing a good prior Univariate priors Berger s robust priors Proposal for Comparison of and BIC Compare this and BIC: = 2l(ˆθ) px log(1 + n i ) k π i=1 BIC = 2l(ˆθ) p log n where k π > 0 is the ignored constant from π It is easy to intuitively notice that BIC makes two mistakes : Penalizes too much with p log n. Note that n i n Penalizes too little with the prior (the third term, a new penalization, is absent) The second mistake actually helps the first one (in this sense, BIC is lucky ), but often the first mistake completely overwhelms this little help. JSM: July-August slide #16

28 : Modification More Favorable to Complex Models : Modification More Favorable to Complex Models Priors centered at zero penalize complex models too much? Priors centered around MLE ( Raftery, 1996) favor complex models too much? An attractive compromise: Cauchy-type priors centered at zero, but with the scales, b i, chosen so as to maximize the marginal likelihood of the model. Empirical Bayes alternative, popularized in the robust Bayesian literature (Berger, 1994) JSM: July-August slide #17

29 : Modification More Favorable to Complex Models The b i that maximizes m(x) can easily be seen to be : Modification More Favorable to Complex Models ˆbi = max{d i, ˆξ 2 i w d i}, with w s.t. e w = 1 + 2w, or w 1.3. Problem: when ξ i = 0, this empirical Bayes choice can result in inconsistency as n i. Solution: prevent ˆb i from being less than n i d i. This results, after an accurate approximation (for fixed p) in 2l(ˆθ) px log(1 + n i ) i=1 px log (3v i + 2 max{v i, 1}). i=1 again a very simple expresion JSM: July-August slide #17

30 Effective Sample size Calculate I = I jk, the expected information matrix, at the MLE, I jk = E X θ 2 log f(x θ) θ j θ k θ= θ b. Effective Sample size Effective Sample size: Continued JSM: July-August slide #18

31 Effective Sample size Calculate I = I jk, the expected information matrix, at the MLE, I jk = E X θ 2 log f(x θ) θ j θ k θ= θ b. Effective Sample size Effective Sample size: Continued For each case x i, compute I i = (I i,jk ), having (j, k) entry I i,jk = E X i θ 2 log f i (x i θ) θ j θ k θ= θ b. JSM: July-August slide #18

32 Effective Sample size Calculate I = I jk, the expected information matrix, at the MLE, I jk = E X θ 2 log f(x θ) θ j θ k θ= θ b. Effective Sample size Effective Sample size: Continued For each case x i, compute I i = (I i,jk ), having (j, k) entry I i,jk = E X i θ 2 log f i (x i θ) θ j θ k θ= θ b. Define I jj = P n i=1 I i,jj Define n j as follows: Define information weights w ij = I i,jj / P n k=1 I k,jj. Define the effective sample size for θ j as n j = I jj P n i=1 w iji i,jj = (I jj ) 2 P n i=1 (I i,jj) 2. JSM: July-August slide #18

33 Effective Sample size: Continued Effective Sample size Effective Sample size: Continued Intuitively, P w ij I i,jj is a weighted measure of the information per observation, and dividing the total information about θ j by this information per case seems plausible as an effective sample size. This is just a tentative proposal Works for most of the examples we discussed Research in progress JSM: July-August slide #19

34 A small linear regression simulation: A small linear regression simulation: Linear regression Example Linear regression Example Linear regression Example Conclusions and future work References Y n 1 = X n 8 β ɛ n 1 ɛ N(0, σ 2 ), β = (3, 1.5, 0, 0, 2, 0, 0, 0), and design Matrix X N(0, Σ x ), Σ x = ρ ρ ρ 1 ρ C A.. ρ ρ 1 JSM: July-August slide #20

35 Linear regression Example PSfrag replacements Method abf2 aic bic Gbic Gbic* hbf σ=3 n=30 σ=3 σ=3 σ=3 σ= A small linear regression simulation: Linear regression Example Linear regression Example Linear regression Example Conclusions and future work References exact σ=2 n=30 σ=2 σ=2 σ= σ= σ=1 n=30 σ=1 σ=1 σ=1 σ= ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 Percentage of true models selected under different situations JSM: July-August slide #21

36 Linear regression Example PSfrag replacements Method abf2 aic bic Gbic Gbic* hbf 5.0 σ=3 n=30 σ=3 σ=3 σ=3 σ= A small linear regression simulation: Linear regression Example Linear regression Example Linear regression Example Conclusions and future work References correct 3.5 σ=2 n=30 σ=2 σ=2 σ= σ= σ=1 n=30 σ=1 σ=1 σ=1 σ= ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 Average number of 0 s that were determined to be 0. (True Model) JSM: July-August slide #22

37 Linear regression Example PSfrag replacements σ=3 n=30 Method abf2 aic bic Gbic Gbic* hbf σ=3 σ=3 σ=3 σ= A small linear regression simulation: Linear regression Example Linear regression Example 0.0 σ=2 n=30 σ=2 σ=2 σ= σ= Linear regression Example Conclusions and future work References incorrect σ=1 n=30 σ=1 σ=1 σ=1 σ= ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 ρ=0 ρ=.2 ρ=.5 Average number of non-zeroes that were determined to be zero (small is good). JSM: July-August slide #23

38 Conclusions and future work A small linear regression simulation: Linear regression Example Linear regression Example Linear regression Example Conclusions and future work BIC can be improved by keeping rather than ignoring the prior using a sensible, objective prior producing close-form expressions carefully acknowledging effective sample size for each parameter to guide choice of the scale of the priors Work in progress A satisfactory definition of effective sample size (if possible). Current proposal works in most cases. Extension to the non-independent case References JSM: July-August slide #24

39 References A small linear regression simulation: Linear regression Example Linear regression Example Linear regression Example Conclusions and future work References 1. Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd edition. Springer-Verlag. 2. Berger, J.O., Ghosh, J.K., and Mukhopadhyay (2003). Approximations and Consistency of Bayes Factors as Model Dimension Grows. Journal of Statistical Planning and Inference 112, Berger, J., Pericchi, L, and Varshavsky, J. (1998). Bayes factors and marginal distributions in invariant situations. Sankhya A 60, Bollen, K., Ray S., Zavisca J.,(2007) A scaled unit information prior approximation to the Bayes Factor. under prepration JSM: July-August slide #25

Some Curiosities Arising in Objective Bayesian Analysis

. Some Curiosities Arising in Objective Bayesian Analysis Jim Berger Duke University Statistical and Applied Mathematical Institute Yale University May 15, 2009 1 Three vignettes related to John s work