Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Size: px

Start display at page:

Download "Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models"

Lee McBride
5 years ago
Views:

1 Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Dimitris Fouskakis, Department of Mathematics, School of Applied Mathematical and Physical Sciences, National Technical University of Athens, Athens, Greece; Joint work with: Ioannis Ntzoufras & David Draper Department of Statistics Department of Applied Mathematics and Statistics Athens University of Economics and Business University of California Athens, Greece; Santa Cruz, USA; Presentation is available at: fouskakis/presentations/irvine/presentation Irvine.pdf.

2 4th June 2012: Department of Statistics, UCIrvine, Irvine 2 Synopsis 1. Bayesian Variable Selection 2. Prior Specification 3. Expected-Posterior Priors 4. Motivation 5. Power-Expected Posterior (PEP) Methodology 6. MCMC for Sampling from the Posterior 7. Marginal Likelihood Computation 8. Simulated Example 9. Real Life Example

3 4th June 2012: Department of Statistics, UCIrvine, Irvine 3 1 Bayesian Variable Selection Within the Bayesian framework the identification of the best set of predictors between the M = {0, 1} p competitors is equivalent (assuming a zero-one loss function) to find the model m with the highest posterior model probability, defined as f(m y) = f(y m)f(m) m l M f(y m l)f(m l ), where f(y m) is the marginal likelihood under model m and f(m) is the prior probability of model m. The marginal likelihood function in the above calculation can be further expanded to include the effect of the model parameters: f(y m) = f(y θ m, m)f(θ m m)dθ m, where f(y θ m, m) is the likelihood under model m with parameters θ m and f(θ m m) is the prior distribution of model parameters given model m.

4 4th June 2012: Department of Statistics, UCIrvine, Irvine 4 Posterior odds and Bayes factors Based on the above posterior model probabilities, pairwise comparisons of any two models, m k and m l, is given by the Posterior Odds (PO) PO mk,m l f(m k y) f(m l y) = f(y m k) f(y m l ) f(m k) f(m l ) = B m k,m l O mk,m l which is a function of the Bayes Factor B mk,m l and the Prior Odds O mk,m l. The posterior model probability can be then expressed entirely in terms of Bayes Factors and Prior Odds as f(m y) = ( m l M PO ml,m) 1 = [ m l M B ml,m O ml,m] 1.

5 4th June 2012: Department of Statistics, UCIrvine, Irvine 5 2 Prior Specification Prior on the model space Uniform prior on the model space f(m) I [m M]. The above prior is equivalent of assuming that each covariate has prior probability 0.5 of entering the model. Beta-Binomial hierarchical prior on the model size W W Bin(p,θ) and θ Beta(α,β). If α = β = 1 then we have a uniform prior on the model size.

6 4th June 2012: Department of Statistics, UCIrvine, Irvine 6 Prior on model parameters Proper prior distributions (conjugate if available). For example in the case of the Gaussian regression models a popular choice is the Zellner s g-prior (Zellner, 1986). Main issue: Specification of hyperparameter g that controls the prior variance. Large values of g Bartlett s paradox (e.g., Bartlett, 1957 Biometrika). For g = n unit information prior (Kass & Wasserman, 1995, JASA). Beta prior on g g+1 Mixtures of g-priors (e.g. Liang et al., 2008, JASA). Non-local priors (e.g. Johnson and Rossell, 2010, RSSS B). They have zero mass for values of the parameter under the null hypothesis. Products of independent normal moment priors.

7 4th June 2012: Department of Statistics, UCIrvine, Irvine 7 Prior on model parameters (cont.) Shrinkage priors. E.g. Bayesian Lasso (Park and Casella, 2008, JASA, horseshoe prior (Carvalho et al., 2010, Biometrika), etc. Improper (reference) priors (defined up to arbitrary constants). Objectivity. Jeffreys prior. Bayes factors cannot be determined. Priors defined via imaginary data. Power prior (Ibrahim & Chen, 2000, Statistical Science). Expected-Posterior prior (Pérez & Berger, 2002, Biometrika).

8 4th June 2012: Department of Statistics, UCIrvine, Irvine 8 3 Expected-Posterior Priors Expected-posterior priors (EPPs) are defined as the posterior distribution of a parameter vector for the model under consideration, averaged over all possible imaginary samples y coming from a suitable predictive distribution m (y ). Hence the EPP for the parameters of any model m l M, with M denoting the model space, is π E l (θ l ) = = = f(θ l y, m l ) m (y ) dy π N l (θ l y ) m (y ) dy f(y θ l, m l )πl N(θ l) m (y ) dy. (1) f(y θ l, m l )πl N (θ l )dθ l } {{ } m N l (y )

9 4th June 2012: Department of Statistics, UCIrvine, Irvine 9 Expected-posterior priors (cont.) π N l (θ l) can be proper or improper. If improper, Bayes factors can be defined. Different choices of m include: Base-model approach. Select a reference model m 0 for the training sample and define m (y ) = m N 0 (y ) f(y m 0 ) to be the prior predictive distribution, evaluated at y, for the reference model m 0 under the baseline prior π N 0 (θ 0 ). The reference model should be at least as simple as the other competing models, and therefore a reasonable choice is to take m 0 to be the constant model (with no predictors). Use of empirical distribution for m. Subjective specification of m. Under the base-model approach the resulting expected-posterior Bayes factors becomes essentially identical to the arithmetic Intrinsic Bayes Factors (IBF), Berger and Pericchi (1996, JASA).

10 4th June 2012: Department of Statistics, UCIrvine, Irvine 10 Imaginary data We need to specify the imaginary responses y (of size n ) and the imaginary design matrix X (of size n d). Then the EPP, will depend on X but not on y, since the latter is integrated out. How large should such training samples be, i.e how large should n be? Minimal training sample. With p covariates and unknown error variance, we set n = (p + 1) + 1 = d + 1. Then to specify X we randomly select a sub-sample of the rows of the original matrix X (n should be < n). When the sample size is small in comparison to the number of covariates, working with a minimal training sample can result in an influential prior. How should the rows of X be chosen? With (n, p) = (100, 50) there are about possible choices. Arithmetic mean of the Bayes factors over all possible X. Take a random sample from the set of all possible training samples, but this adds an extraneous layer of Monte-Carlo noise to the model-comparison process. If the data derive from a highly structured situation, such as a complete randomized blocks experiment, any choice of a small part of the data to act as a training sample would be somewhat untypical.

11 4th June 2012: Department of Statistics, UCIrvine, Irvine 11 4 Motivation 1. Produce a less influential expected-posterior prior. 2. Diminish the effect of training samples on the expected-posterior prior methodology. We combine ideas from the power prior approach and unit information prior approach power-expected-posterior prior: We raise the likelihood involved in the expected-posterior prior distribution to the power of 1/δ. For δ = n this produces a prior information content equivalent to one data point. The method is sufficiently insensitive to the size of n ; one may take n = n and dispense with training samples altogether; this both removes the instability arising from the random choice of training samples and greatly reduces computing time.

12 4th June 2012: Department of Statistics, UCIrvine, Irvine 12 Specification We focus on variable selection problems for Gaussian linear models. We consider two models m l (l = 0, 1) with parameters θ l = (β l, σl 2 ) and likelihood specified by Y X l, β l, σl, 2 m l N n (X l β l, σli 2 n ), where Y = (Y 1,...,Y n ) is a vector containing the responses for all subjects, X l is an n d l design matrix containing the values of the explanatory variables in its columns, I n is the n n identity matrix, β l is a vector of length d l summarizing the effects of the covariates on the response Y and σl 2 is the error variance. Two diffrerent Baseline Prior Choice: Independence Jeffrey s prior, i.e. p(β l, σ 2 l m l) σ 2 l. Zellner s g-prior. Base-model approach; the null model is the reference model.

13 4th June 2012: Department of Statistics, UCIrvine, Irvine 13 5 Power-Expected-Posterior (PEP) Methodology The PEP prior is: π P E l (β l, σ 2 l X l, δ) = πn l (β l, σ2 l X l ) m N 0 (y X 0, δ) m N l (y X l, δ) f N n (y ; X l β l, δ σ2 l I n ) dy, The posterior under the PEP prior becomes: π P E l (β l, σl 2 y;x l,x l, δ) f(β l, σl 2 y, y, m l ;X l,x l, δ) mn l (y y ;X l, X l, δ) mn 0 (y X 0, δ) dy, f(β l, σl 2 y, y, m l ; X l, X l, δ) : posterior using data y, design matrix X l under prior f(β l, σl 2 y, m l ; X l, δ) (i.e. the posterior under power normal likelihood and baseline prior). m N l (y y ; X l, X l, δ) : marginal likelihood of model m l, using data y, design matrix X l and prior f(β l, σl 2 y, m l ; X l, δ).

14 4th June 2012: Department of Statistics, UCIrvine, Irvine 14 PEP under the Jeffreys baseline Baseline Prior: πl N (β l, σ 2 X l) = c l σl 2 where c l is an unknown normalizing constant., (2) J-PEP prior: π PE l (β l, σ 2 l X l, δ) = [ f Ndl β l ; β ] ( l, δ (X T l X l) 1 σl 2 f IG σl 2 ; n d l, RSS l 2 2δ ) m N 0 (y X 0, δ) dy, with β l = (X T l X l ) 1 X T l y, RSSl = y T( I n X l (X l T X l ) 1 X ) T l y and ( m N l (y X l, δ) = c l π d l n 2 X T l X l 1 2 n ) d d l l Γ RSS n 2 l. 2

15 4th June 2012: Department of Statistics, UCIrvine, Irvine 15 PEP under the Jeffreys baseline (cont.) The posterior under the J-PEP prior: πl P E (β l, σl 2 y;x l,x l, δ) f Ndl ( βl ; β N, Σ N σ 2 l) fig (σ 2 l ; ãn l, b N l ) m N l (y y ; X l,x l, δ) mn 0 (y X 0, δ) dy, with β N = Σ N (X T l y + δ 1 X T l y ), Σ N = [X T l X l + δ 1 X T l X l] 1, where ã N l = n + n d l 2, b N l = SSN l + δ 1 RSS l 2 + b l ; SSl N = ( [ y X l β l )T I n + δ X l (X T l X l ) 1 X T l ] 1 (y Xl β l ), while, m N l (y y ;X l,x l, δ) = f St n { y; n d l, X l β l, RSS l δ(n d l ) [I n + δ X l (X T l X l ) 1 X T l ] }.

16 4th June 2012: Department of Statistics, UCIrvine, Irvine 16 PEP under the Zellner s g-prior baseline Baseline Prior: [ ] πl N (β l σl 2 ; X l) = f Ndl β l ; 0, g (X T l X l) 1 σl 2 and π N l (σ 2 l) = f IG ( σ 2 l ; a l, b l ). (3) Z-PEP Prior: π P E l (β l,σ 2 l X l,δ) = [ f Ndl β l ; w β ] l, w δ (X T l X l) 1 σl 2 f IG (σ l 2 ; a l + n 2, b l + SS l 2 ) m N 0 (y X 0,δ) dy. Here w = g, β g+δ l = (X T l X l) 1 X T l y, SSl = y T Λ l y and ( m N l (y X l, δ) = f Stn y ; 2 a l,0, b ) l Λ 1 l, a l with Λ l 1 = δ [ I n g g + δ X l ( ) ] 1 1 ( 1 X T l X l X T l = δ I n + g X l X T l Xl) X T l.

17 4th June 2012: Department of Statistics, UCIrvine, Irvine 17 PEP under the Zellner s g-prior baseline (cont.) Theorem 1. Under the baseline prior setup (3), the power-expected-posterior prior mean of β l is equal to zero (i.e. E [β l ] = 0) while the power-expected-posterior prior variance is given by V [β l ] = { δw a l 1 + n /2 [ b l ] b 0 a 0 1 tr(λ l Λ 1 0 ) I dl + w2 b 0 a 0 1 (X l T X l ) 1 X T l Λ 0 1 X l } (X T l X l ) 1. where tr(a) is the trace of matrix A. Theorem 2. Under the baseline prior setup (3), the power-expected-posterior prior mean of σ 2 l is given by E [ σ 2 l ] = b 0 a tr [ Λ l Λ 0 1] + (a 0 1)b l /b 0 n 2 + a l 1 while the power-expected-posterior prior variance is given by V [ σl 2 ] = = E [ σl 4 ] [ E σl 2 ] 2 [( n ( 2 + a l 1 b 0 a 0 1 ) 2 ) ( n 2 + a l 2 )] tr(λ l Λ 0 1 ) + (a 0 1)b l /b 0 n /2 + a l 1 b 2 l + b l b 0 tr(λ l Λ 0 1 ) a b 2 0 {2tr(Λ l Λ 0 1 Λ l Λ 0 1 ) + tr(λ l Λ 0 1 ) 2} 4(a 0 1)(a 0 2)

18 4th June 2012: Department of Statistics, UCIrvine, Irvine 18 PEP under the Zellner s g-prior baseline (cont.) The posterior under the Z-PEP prior: with π P E l (β l, σ 2 l y;x l, X l, δ) β N = Σ N (X T l y + δ 1 X T l y ), ΣN = f Ndl ( βl ; β N, Σ N σ 2 l) fig (σ 2 l ; ã N l, b N l ) m N l (y y ; X l, X l,δ) m N 0 (y X 0, δ)dy, [X T l X l + (w δ) 1 X T l X l] 1, where ã N l = n + n 2 SS N l + a l, b N l = SSN l + SS l 2 + b l ; = ( [ y w X l β l )T I n + δ w X l (X T l X l) 1 X T l ] 1 ( y w Xl β l ), while { m N l (y y ; X l, X l, δ) = f Stn y ; 2 a l + n, w X l β l, 2b l + SSl [ ] } I 2 a l + n n + w δ X l (X T l X l) 1 X T l

19 4th June 2012: Department of Statistics, UCIrvine, Irvine 19 Hyper-parameters The parameter g in the normal baseline prior is set to δ n, so that with δ = n we use g = (n ) 2. This choice will make the g-prior contribute information equal to one data point within the posterior f(β l, σl 2 y, m l ; X l, δ). In this manner, the entire PEP prior accounts for information equal to ( 1 + δ) 1 data points. We set the parameters a and b in the inverse-gamma baseline prior to 0.01, yielding a baseline prior mean of 1 and variance of 100 (i.e., a large amount of prior uncertainty) for the precision parameter. (If strong prior information about the model parameters is available, Theorems 1 and 2 can be used to guide the choice of a and b.) By comparing the posterior distributions under the two different baseline schemes described before, it is straightforward to prove that they coincide for large g (and therefore for w 1), a l = d l /2 and b l = 0.

20 4th June 2012: Department of Statistics, UCIrvine, Irvine 20 6 MCMC for Sampling from the Posterior We consider the following augmented conditional distribution: f(β l, σ 2 l, y y;x l,x l, δ) f N dl ( βl ; β N, Σ N σ 2 l) fig (σ 2 l ; ãn l, b N l ) f(y y;x l,x l, δ), with with f(y y;x l,x l, δ) mn l (y y; X l,x l, δ) mn 0 (y X 0, δ) m N l (y X l, δ), m N l (y y, X l,x l, δ) = f(y β l, σ 2 l, m l ;X l, δ) f(β l, σ2 l y, m l ; X l ) dβ l dσ 2 l. (4) For the Zellner s baseline prior, (4) becomes equal to { f Stn y g ; 2 a l + n, g + 1 X β l l, 2b l + SS l 2 a l + n where SS l = y T [ I n f Stn g g+1 X l (X T l X l) 1 X T l { y ; n d l, X l β l, [ δ I n + g ]} g + 1 X l (XT l X l) 1 X T l, ] y, while, for the Jeffreys baseline prior, (4) becomes SS [ ] } l δ I n + X l n d (XT l X l) 1 X T l, l with the posterior sum of squares now given by SS l = y T [ I n X l (X T l X l) 1 X T l ] y.

21 4th June 2012: Department of Statistics, UCIrvine, Irvine 21 MCMC for Sampling from the Posterior (cont.) Using the previous expressions, we can specify the following MCMC scheme: (1) Generate y from f(y y; X l, X l, δ). (2) Generate σ 2 l from IG(ãN l, b N l ). (3) Generate β l from N dl ( βn, ΣN σ 2 l). In Step 1, we can generate the imaginary data y by using a Metropolis-Hastings algorithm with proposal q(y ) = m N l (y y, X l, X l, δ) given in (4) and acceptance probability [ ] α = min 1, mn 0 (y X 0, δ) m N l (y X l, δ) m N l (y X l, δ) mn 0 (y X 0, δ).

22 4th June 2012: Department of Statistics, UCIrvine, Irvine 22 7 Marginal Likelihood Computation The marginal likelihood of any model m l M is given by m PE l (y X l, X l, δ) = = m N l (y X l, X l ) f(y β l, σl, 2 m l ; X l ) πl PE (β l, σl 2 X l, δ) dβ l dσl 2 m N l (y y, X l, X l, δ) m N l (y X l, δ) m N 0 (y X 0, δ) dy. In the above expression m N l (y X l,x l) is the marginal likelihood of model m l for the actual data under the baseline prior. Therefore, under the Zellner s baseline prior is given by { m N l (y X l, X l) = f Stn y ; 2 a l,0, b ( ) 1 l [I n + g X l X T l X l Xl T]}, a l while under the Jeffreys baseline prior is given by m N l (y X l, X l) = c l π d l n 2 X l T X l 1 2 Γ ( n dl 2 ) RSS l n d l 2, with RSS l = y T ( In X l (X l T X l ) 1 X l T ) y

23 4th June 2012: Department of Statistics, UCIrvine, Irvine 23 Marginal likelihood computation (cont.) (1) Generate y (t) (t = 1,..., T) from m N 0 (y X 0, δ) and estimate the marginal likelihood by [ 1 T m P E l (y X l,x l, δ) = mn l (y X l, X l ) m N l (y (t) y, X l, X l, δ) ] T m N l (y (t) X l, δ). (5) t=1 (2) Generate y (t) (t = 1,..., T) from m N 0 (y y, X 0, X 0 [ 1 m P E l (y X l,x l, δ) = mn 0 (y X 0,X 0 ) T, δ) and estimate the marginal likelihood by T m N l (y y (t), X l, X l, δ) ] t=1 m N 0 (y y (t), X 0, X 0, δ). (6) (3) Generate y (t) (t = 1,..., T) from m N l (y y, X l, X l [ 1 m P E l (y X l, X l, δ) = mn l (y X l, X l ) T, δ) and estimate the marginal likelihood by T t=1 m N 0 (y (t) X 0, δ) ] m N l (y (t) X l, δ). (7) (4) Generate y (t) (t = 1,..., T) from m N l (y y; X l, X l [ 1 m P E l (y X l,x l, δ) = mn 0 (y X 0, X 0 ) T T t=1, δ) and estimate the marginal likelihood by m N l (y y ; X l, X l, δ) m N 0 (y y; X 0, X 0, δ) ] m N 0 (y y ; X 0, X 0, δ) m N l (y y; X l, X l, δ). (8)

24 4th June 2012: Department of Statistics, UCIrvine, Irvine 24 Model search algorithm We propose to modify the standard MC 3 method by sampling a binary vector γ indicating the variables included in the model (see, e.g., George & McCulloch, JASA, 1993), using a Metropolis-within-Gibbs approach as follows. (1) Generate y (t) (t = 1,...,T) from m N 0 (y X 0, δ) (this is the first Monte-Carlo marginal likelihood scheme) or m N 0 (y y; X 0, X 0, δ) (this is the second scheme). (2) For the current model m l, corresponding to the set of variable-inclusion indicators γ l, repeat the following: For j = 1,...,p (selected in random order), repeat the following steps: (a) Propose γ j = 1 γ j with probability one. (b) Keep the remaining covariates the same: γ l = γ l for all l j. (c) Identify m l corresponding to the vector γ l with elements γ k, k = 1,...,p. (d) If m l is not previously visited, calculate and store its estimated marginal

25 4th June 2012: Department of Statistics, UCIrvine, Irvine 25 likelihood f(m l y) = m PE l (y X l, X l second scheme)., δ) given by (5, first scheme) or (6, (e) Set m l = m l (i.e., accept the proposed model m l ) with probability [ α = min 1, f(m ] l y) = min [1, mpe l (y X l, X ] l, δ) f(m l ) f(m l y) m PE l (y X l, X l, δ). (9) f(m l ) (3) Store m l as the current model. (4) Repeat steps (2) (3) until a target number of models is visited or a pre-specified CPU budget is exhausted. For the third and fourth Monte-Carlo marginal-likelihood estimates, we start the above MC 3 algorithm from step (2) and in step (2)(d) we generate y (t) (t = 1,...,T) from m N l (y y; X l, X l, δ), which now depends on the proposed model, and estimate the marginal likelihood of that model using expressions (7) and (8), respectively.

26 4th June 2012: Department of Statistics, UCIrvine, Irvine 26 8 Simulated Example We illustrate the proposed methods by considering the simulated data-set Nott & Kohn (2005, Biometrika). This data-set consists of n = 50 observations and p = 15 covariates. The first 10 covariates are generated from a standardized Normal distribution while X ij N ( 0.3X i1 +0.5X i2 +0.7X i3 +0.9X i4 +1.1X i5, 1 ) for j = 11,...,15, i = 1,...,50 and the response from Y i N ( 4 + 2X i1 X i X i7 + X i X i13, ), for i = 1,...,50.

27 4th June 2012: Department of Statistics, UCIrvine, Irvine 27 Monte-Carlo standard errors In order to check the efficiency of the four Monte-Carlo estimates we initially performed a small experiment. For Z-PEP, we estimated the logarithm of the marginal likelihood for models X 1 + X 5 + X 7 + X 11 and X 1 + X 7 + X 11, by running each Monte-Carlo technique 100 times for 1000 iterations and we calculated the Monte-Carlo standard errors. Table 1: Monte-Carlo standard errors of the estimates of the logarithm of the marginal likelihoods for models X 1 + X 5 + X 7 + X 11 and X 1 + X 7 + X 11 when running the four Monte-Carlo techniques 100 times for 1000 iterations, using the PI methodology Monte-Carlo Model Scheme X 1 + X 5 + X 7 + X 11 X 1 + X 7 + X The third and fourth Monte-Carlo scheme produces much lower Monte-Carlo standard errors as expected. The third/fourth scheme is the one that will be used, for the Z-PEP/J-PEP methodology, in all the following runs, keeping the number of iterations equal to 1000.

28 4th June 2012: Department of Statistics, UCIrvine, Irvine 28 Posterior model probabilities Table 2: Posterior model probabilities for the best models, together with Bayes factors of the MAP model (m 1 ) against m j, j = 2,..., 7, for the Z-PEP and the J-PEP prior methodologies. Z-PEP J-PEP Posterior Model Bayes Posterior Model Bayes m j Predictors Probability Factor Rank Probability Factor 1 X 1 + X 5 + X 7 + X (2) X 1 + X 7 + X (1) X 1 + X 5 + X 6 + X 7 + X (3) X 1 + X 6 + X 7 + X (4) X 1 + X 7 + X 10 + X (5) X 1 + X 5 + X 7 + X 10 + X (9) X 1 + X 5 + X 7 + X 11 + X (10)

29 4th June 2012: Department of Statistics, UCIrvine, Irvine 29 Sensitivity analysis on n for Z-PEP Figure 1: Posterior marginal inclusion probabilities for different n for the Z-PEP methodology Posterior Inclusion Probabilities X7 X1 X11 X5 X6 X10 X Imaginary Data Sample Size (n )

30 4th June 2012: Department of Statistics, UCIrvine, Irvine 30 Sensitivity analysis on n for Z-PEP (cont.) Figure 2: Posterior model probabilities of the six best models obtained for each n, for the Z-PEP methodology; bullets indicate different models entered the top six models for at least one value of n and lines connect posterior model probabilities for models of the same rank over different values of n Posterior Model Probabilities X_1+X_5+X_6+X_7+X_11 X_1+X_5+X_7+X_10+X_11 X_1+X_5+X_7+X_11 X_1+X_5+X_7+X_11+X_13 X_1+X_6+X_7+X_11 X_1+X_7 X_1+X_7+X_10+X_11 X_1+X_7+X_ Imaginary Data Sample Size (n )

31 4th June 2012: Department of Statistics, UCIrvine, Irvine 31 Comparisons In the following we compare the Bayes factor between the two best models (X 1 + X 5 + X 7 + X 11 versus X 1 + X 5 + X 7 ) according to Z-PEP and J-PEP with the corresponding Bayes factors using J-EP (EPP with no power and with Jeffreys as baseline) and IBF. For IBF and EPP we randomly select 100 training samples of size n = 6 (minimal training samples for the estimation of these two models) and n = 17 (minimal training sample for the estimation of the full model when we consider all p = 15 covariates), while for Z-PEP & J-PEP we randomly select 100 training samples of sizes n = 6, 17 and n = 20 + k 5 for k = 1,.....,5. Each marginal likelihood estimate in Z-PEP is obtained with 1000 iterations, using the third Monte-Carlo scheme and in J-PEP and J-EPP with 1000 iterations, using the fourth Monte-Carlo scheme.

32 4th June 2012: Department of Statistics, UCIrvine, Irvine 32 Comparisons (cont.) J-PEP 50 Z-PEP 50 J-PEP 40 Z-PEP 40 J-PEP 30 Z-PEP 30 J-PEP 17 Z-PEP 17 J-EP 17 IBF 17 J-PEP 6 Z-PEP 6 J-EP 6 IBF Log Bayes Factor

33 4th June 2012: Department of Statistics, UCIrvine, Irvine 33 9 Real Life Example We use the ozone data (Breinman & Friedman 1985, JASA) to implement our approaches. The data we used were slightly changed based on some initial preliminary exploratory analysis. As a response we use a standardized version of the logarithm of the ozone concentration variable of the original data set. The standardized versions of nine (9) main effects, 9 quadratic terms, 2 cubic terms, and 36 two-way interactions (a total of 56 variables ) where used as possible covariates. The main effects are: X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 Day of Year Wind speed (mph) at LAX 500 mb pressure height (m) at VAFB Humidity (%) at LAX Temperature ( F) at Sandburg Inversion base height (feet) at LAX Pressure gradient (mm Hg) from LAX to Daggett Inversion base temperature ( F) at LAX Visibility (miles) at LAX

34 4th June 2012: Department of Statistics, UCIrvine, Irvine 34 Reduced model space (1) First we used MC 3 to identify variables with high posterior marginal inclusion probabilities, and we created a reduced model space consisting only of those variables whose marginal probabilities were above 0.3. (2) Then we used the same model search algorithm as in step (1) in the reduced space to estimate posterior model probabilities (and the corresponding odds). We ran MC 3 for 100,000 iterations for both the Z-PEP and the EIBF methods. For EIBF we used 30 randomly-selected minimal training samples (n = 58). The reduced model space was formed from those variables that in either run had posterior marginal inclusion probabilities above 0.3. With this approach we reduced the initial list of p = 56 available candidates down to 22 predictors. We then ran MC 3 for 220,000 iterations for the Z-PEP, J-PEP and the EIBF approaches. For EIBF we used 30 randomly-selected minimal training samples (n = 24).

35 4th June 2012: Department of Statistics, UCIrvine, Irvine 35 Table 3: Posterior odds (PO 1k ) of the three best models within each analysis versus the current model k for the reduced model space of ozone data set. Variables common in all three analyses were: X 1 + X 2 + X 8 + X 9 + X 10 + X 15 + X 16 + X 18 + X 43. J-PEP Ranking Number of Posterior J-PEP Z-PEP EIBF Additional Variables Covariates Odds P O 1k 1 (>3) (>3) No additional variable (1) (>3) X 7 + X 12 + X 13 +X (>3) (>3) X 7 + X 13 +X Z-PEP Ranking Number of Posterior Z-PEP J-PEP EIBF Additional Variables Covariates Odds P O 1k 1 (2) (>3) X 7 + X 12 + X 13 +X (>3) (>3) X 5 +X 7 + X 12 + X 13 +X (>3) (3) X 5 +X 7 + X 12 + X 13 +X 20 +X EIBF Ranking Number of Posterior EIBF J-PEP Z-PEP Additional Variables Covariates Odds P O 1k 1 (>3) (>3) X 7 + X 12 + X 13 +X 20 +X (>3) (>3) X 5 +X 7 + X 12 + X 13 +X 20 +X 26 + X (>3) (3) X 5 +X 7 + X 12 + X 13 +X 20 +X

36 4th June 2012: Department of Statistics, UCIrvine, Irvine 36 Comparison of the predictive performance We evaluate and compare the out-of-sample predictive performance of the three highest a-posteriori model indicated by the above illustrated methods. To do so, we divide the data in half by considering 50 randomly selected splits. For each split, we generate an MCMC sample of T iterations from the model of interest m l and then calculate the average root mean square error by with ARMSE l = 1 T RMSE (t) l = T t=1 1 n V i V RMSE (t) l ( yi ŷ (t) i m l ) 2 being the root mean square error for the validation dataset V of size n V calculated for the t-iteration of the MCMC, where ŷ (t) = X i m l l(i) β (t) l are the expected values of y i according to the assumed model for iteration t, β (t) l are the model parameters for iteration t and X l(i) is the i-th row of matrix X l of model m l. Detailed results for the two highest a-posteriori models are provided on the next Table; results for the full model were also added for comparison reasons. For comparison purposes, we have also included the split-half RMSE measures for these three models using predictions based on direct fitting of the model with the independence Jeffreys prior.

37 4th June 2012: Department of Statistics, UCIrvine, Irvine 37 Comparison of the predictive performance (cont.) Table 4: Comparison of the predictive performance of the PEP and J-EP methods using the full and the MAP models in the reduced model space of the ozone data set. RMSE Model d l R 2 R 2 adj J-PEP Z-PEP J-EP Jeffreys Prior Full (0.0087) (0.0097) (0.0169) (0.0104) J-PEP MAP (0.0063) (0.0051) (0.0626) (0.0052) Z-PEP MAP (0.0071) (0.0060) (0.0734) (0.0049) EIBF MAP (0.0066) (0.0072) (0.0800) (0.0061) Comparison with the full model (percentage changes) RMSE Model d l R 2 R 2 adj J-PEP Z-PEP J-EP Jeffreys Prior J-PEP MAP 59% 5.06% 4.48% 0.22% +3.81% +21.5% +3.23% Z-PEP MAP 41% 1.50% 1.06% +0.10% +1.01% +12.7% +0.37% EIBF MAP 36% 1.20% 0.78% +3.24% +0.44% +10.9% 0.23% Note: Mean (standard deviation) over 50 different split-half out-of-sample evaluations.

38 4th June 2012: Department of Statistics, UCIrvine, Irvine 38 (a) Full Model J-PEP JEFFREYS J-EP Z-PEP (b) MAP Model using the Z PEP prior J-PEP JEFFREYS J-EP Z-PEP

39 4th June 2012: Department of Statistics, UCIrvine, Irvine 39 (c) MAP Model using the J EP prior J-PEP JEFFREYS J-EP Z-PEP (d) MAP Model using the J PEP prior J-PEP JEFFREYS J-EP Z-PEP

40 4th June 2012: Department of Statistics, UCIrvine, Irvine 40 Thank You Irvine!

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models Ioannis Ntzoufras, Department of Statistics, Athens University of Economics and Business, Athens, Greece; e-mail: ntzoufras@aueb.gr.