Supplementary materials for Scalable Bayesian model averaging through local information propagation

Size: px

Start display at page:

Download "Supplementary materials for Scalable Bayesian model averaging through local information propagation"

Bryce Oliver
5 years ago
Views:

1 Supplementary materials for Scalable Bayesian model averaging through local information propagation August 25, 2014 S1. Proofs Proof of Theorem 1. The result follows immediately from the distributions of the decision variables and the fact that the pfs procedure stops in the first t 1 steps iff γ (t 1) < t 1. Proof of Theorem 2. Our proof strategy is to actually find a pfs representation for any model space distribution π. To this end, we can proceed by induction on the total number of potential predictors. First, we note that the conclusion holds for p = 1 or Ω = {0, 1}. In this case there are but two models in the space: the null model and the model including X 1, written as (0) and (1) respectively. Let π( ) be any probability distribution on Ω. It is easy to check that π( ) is the marginal distribution of the final model under the pfs procedure with ρ(0) = π(0), ρ(1) = 1, and λ 1 (0) = 1. Now suppose the inductive claim holds for any model space involving up to p 1 variables. We next show it must hold for the one with p predictors, or Ω = {0, 1} p, as well. To this end, again let π( ) be any distribution on {0, 1} p, and let Ω ( p) = {0, 1}p 1 {0} the collection of models that do not involve X p. Let us define a new distribution π ( ) on Ω ( p) such that for each γ Ω ( p), π (γ) = π(γ) + π(γ +p ). 1

2 where γ +p Ω is the model that adds an additional variable, X p, into γ. It is easy to check that π (γ) = 1. γ Ω ( p) Because Ω ( p) is isomorphic to {0, 1}p 1, π ( ) can be considered a probability distribution on {0, 1} p 1. Thus by the inductive hypothesis, π has a pfs representation with parameter mappings ρ and λ defined on {0, 1} p 1 \{1, 1,...,1}. Now for any γ {0, 1} p, let γ 1:p 1 = (γ 1,γ 2,...,γ p 1 ) {0, 1} p 1, and let ρ and λ be mappings defined on Ω ( p) such that for any γ Ω ( p), if γ 1:p 1 < p 1, ρ (γ) = ρ (γ 1:p 1 ), λ j (γ) = λ j(γ 1:p 1 ) for j = 1, 2,...,p 1, and λ p(γ) = 0, while if γ 1:p 1 = p 1, then ρ (γ) = 1, λ j(γ) = 0 for j = 1, 2,...,p 1, and λ p(γ) = 1. that Now consider the pfs procedure with p predictors with mappings ρ and λ defined such (i) If γ Ω ( p) and π(γ+p ) > 0, ρ(γ) = ρ (γ) π(γ) π (γ) 1 ρ (γ) λ 1 ρ(γ) j(γ) for j = 1, 2,...,p 1, λ j (γ) = π(γ +p ) π (γ) ρ (γ) 1 ρ(γ) for j = p, (ii) If γ Ω ( p) and π(γ+p ) = 0, ρ(γ) = ρ (γ) and λ j (γ) = λ j(γ) for j = 1, 2,...,p. 2

3 (iii) If γ Ω \ Ω ( p) and γ < p, ρ(γ) = 1 and λ j (γ) = 1/(p γ ) 1 γj =0. Under this pfs procedure, the pth predictor is always the last to be added. Now let us check that the marginal distribution of the final model γ (p) is indeed π. Now, for any γ Ω ( p) such that π(γ) > 0, by (i), (ii), and (iii) we have γ (1),...,γ (p) = γ (1),...,γ (p) γ [1 ρ(γ (t 1) )] λ jt (γ (t 1) ) ρ(γ) t=1 γ [1 ρ (γ (t 1) )] λ j t (γ (t 1) ) ρ (γ) π(γ)/π (γ) t=1 =π (γ) π(γ)/π (γ) = π(γ), where j 1,j 2,...,j γ are the values of the selection variables J 1,J 2,...,J γ that corresponds to the sequence of models γ (1),...,γ ( γ 1),γ ( γ ) = γ. Similarly, for any γ Ω \Ω ( p) such that π(γ) > 0, by (i), (ii), and (iii), the marginal probability for γ (p) to be γ is = γ (1),...,γ (p) :γ (p) =γ γ (1),...,γ (p) :γ (p) =γ γ [1 ρ(γ (t 1) )] λ jt (γ (t 1) ) ρ(γ) t=1 γ 1 [1 ρ (γ (t 1) )] λ j t (γ (t 1) ) [1 ρ(γ γ 1)] t=1 =π(γ)/π (γ ( γ 1) ) π (γ ( γ 1) ) = π(γ). π(γ) π (γ ( γ 1) ) ρ (γ ( γ 1) ) [1 ρ(γ ( γ 1) )] 1 The second equality follows because for γ Ω \Ω ( p) such that π(γ) > 0, under (i), (ii), 3

4 and (iii) γ (1),...,γ (p) :γ (p) =γ γ 1 [1 ρ (γ (t 1) )] λ j t (γ (t 1) ) ρ (γ ( γ 1) ) = π (γ ( γ 1) ), t=1 and with probability 1, γ ( γ 1) is the model with the pth predictor removed from γ. Proof of Theorem 3. Let S 1,J 1,S 2,J 2,...,S p,j p be the latent decision variables of the pfs representation of π under consideration. We let (Ω d, F d ) be the probability space on which these decision variables are jointly defined. The sequence of models γ (1),γ (2),...,γ (p) are functions of the decision variables and thus also measurable with respect to (Ω d, F d ). Fixing the data D, the marginal likelihood under the final model, p(d γ (p) ), is also a random variable on (Ω d, F d ). For any γ Ω, we define an event U γ on (Ω d, F d ) that γ is a submodel of the final model γ (p), that is, γ (p) contains all of the predictors included in γ. athematically, this event can be expressed as U γ : = {ω Ω d : γ (t) (ω) = γ for t = γ }. Next, we define a mapping Φ : Ω R as follows. For each γ Ω, Φ(γ) := E γ(p) [ p(d γ(p) ) U γ ], where the data D is fixed and the expectation is taken over the final model γ (p), or equivalently the decision variables, conditional on the event U γ. Now for any γ Ω, we claim that p(d γ) if γ = p, Φ(γ) = ρ(γ)p(d γ) + (1 ρ(γ)) j:γ j =0 λ j(γ) Φ(γ +j ), if γ < p. 4

5 To see this, note that if γ = p, then conditional on U γ, we have γ (p) = γ, and so E γ(p) [p(d γ (p) ) U γ ] = p(d γ). Now if γ = t < p, then by the tower property, E γ(p) [p(d γ (p) ) U γ ] =E γ(p) [E γ(p) [p(d γ (p) ) S t+1,u γ ] U γ ] =E γ(p) [p(d γ (p) ) S t+1 = 1,U γ ] P(S t+1 = 1 U γ ) + E γ(p) [p(d γ (p) ) J t+1 = j,s t+1 = 0,U γ ] P(J t+1 = j S t+1 = 0,U γ ) P(S t+1 = 0 U γ ). j:γ j =0 Now note that S t+1 = 1 and U γ together imply that γ (p) = γ and so E γ(p) [p(d γ (p) ) S t+1 = 1,U γ ] = p(d γ). Also, for each j such that γ j = 0, {ω Ω d : J t+1 (ω) = 1,S t+1 (ω) = 0} U γ U γ +j. oreover, conditional on the event U γ +j, γ (p) is a function of S t+2,j t+2,...,s p,j p,s p+1 and so is independent of S 1,J 1,...,S t+1,j t+1. Thus, E γ(p) [p(d γ (p) ) J t+1 = j,s t+1 = 0,U γ ] = E γ(p) [p(d γ (p) ) J t+1 = j,s t+1 = 0,U γ,u γ +j ] = E γ(p) [p(d γ (p) ) U γ +j ] = Φ(γ +j ). Finally, since P(S t+1 = 1 U γ ) = ρ(γ) and P(J t+1 = j S t+1 = 0,U γ ) = λ j (γ), putting the pieces together we have Φ(γ) = ρ(γ)p(d γ) + (1 ρ(γ)) λ j (γ) Φ(γ +j ). j:γ j =0 This establishes the above claim about Φ. 5

6 Given the mapping Φ, we are now ready to establish the theorem. First, because under the pfs representation the data generative mechanism essentially forms an H by Theorem 1, the model space posterior has a pfs representation with the mappings ρ( D) and λ( D) determined by the posterior distributions of the decision variables S 1,J 1,...,S p,j p. So our proof strategy now is to simply find the posterior distributions of these decision variables. For any model γ Ω with γ = t < p, ρ(γ D) = P(S t+1 = 1 U γ, D) = P(S t+1 = 1, D U γ ) P(D U γ ) = E γ (p) [p(d γ (p) ) S t+1 = 1,U γ ] P(S t+1 = 1 U γ ) E γ(p) [p(d γ (p) ) U γ ] = ρ(γ)p(d γ)/φ(γ), which is equal to 1 when ρ(γ) = 1. Similarly, if ρ(γ) 1, then λ j (γ D) = P(J t+1 = j U γ,s t+1 = 0, D) = P(J t+1 = j,s t+1 = 0, D U γ ) P(S t+1 = 0, D U γ ) = P(J t+1 = j,s t+1 = 0, D U γ ) P(D U γ ) P(S t+1 = 1, D U γ ) = E γ (p) [p(d γ (p) ) J t+1 = j,s t+1 = 0,U γ ] P(J t+1 = j S t+1 = 0,U γ ) P(S t+1 = 0 U γ ) E γ(p) [p(d γ (p) ) U γ ] E γ(p) [p(d γ (p) ) S t+1 = 1,U γ ] P(S t+1 = 1 U γ ) = Φ(γ+j ) λ j (γ) (1 ρ(γ)). Φ(γ) p(d γ) ρ(γ) On the other hand, for ρ(γ) = 1, then given U γ, S t+1 = 0 with probability 0 and the value of J t for t > γ has no impact on the final model γ (p). So we can simply set λ j (γ D) = λ j (γ) for all j. The theorem now follows by letting φ(γ) = Φ(γ)/p(D 0). 6

7 S2. Bayes factors under g and hyper-g priors For many common priors on the regression coefficients, the BF term in the weight update can be computed either in closed form or well approximated numerically. Here let us consider two popular priors the g-prior and the hyper-g prior. Given a particular model γ, Zellner s g-prior in its most popular form is the following prior on the regression coefficients and the noise variance p(ϕ) 1/ϕ and β γ ϕ,γ N(β 0 γ,g(x T X) 1 /ϕ) where βγ 0 and g are hyperparameters. Following the exposition in Liang et al. (2008), we assume without loss of generality that the predictor variables X 1, X 2,..., X p have all been mean centered at zero. Then we can place a common non-informative flat prior on the intercept α for all models. So p(α,ϕ) 1/ϕ. Under this prior setup, one can show that the BF for a model γ versus the null model is given by BF 0 (γ) = (1 + g) (n 1 γ )/2 ( 1 + g(1 R 2 γ ) ) (n 1)/2 where Rγ 2 is the coefficient of determination for model γ. To avoid undesirable features of the g-priors such as Barlett s paradox and the information paradox (Berger and Pericchi, 2001), Liang et al. (2008) proposed the use of mixtures of g- priors. In particular, they introduced the hyper-g prior, which puts the following hyperprior on g: g 1 + g Beta(1,a/2 1). This prior also renders a closed form representation for the model-specific marginal likelihood, and thus for the corresponding BFs. In particular, Liang et al. (2008) showed that the BF 7

8 of a model γ versus the null model is given by BF 0 (γ) = a 2 γ + a 2 ( ) 2F 1 (n 1)/2, 1; ( γ + a)/2;r 2 γ where 2 F 1 (, ; ; ) is the hypergeometric function. ore specifically, in the notations of Liang et al. (2008), 2F 1 (a,b;c;z) = Γ(c) 1 t b 1 (1 t) c b 1 dt. Γ(b)Γ(c b) 0 (1 tz) a Therefore, with either the g-prior and the hyper-g prior the BF in the weight update can be computed as ) BF ( ) BF 0 (γ i γ(t),γ i (t 1) i (t) = ). BF 0 (γ(t 1) i S3. Incorporating dilution under model space redundancy In this subsection we show that the pfs representation allows us much flexibility in incorporating prior information, and we illustrates this through an interesting phenomenon called the dilution effect first noted by George (1999). Dilution occurs when there is redundancy in the model space. ore specifically, consider the scenario where there is strong correlation among some of the predictors, and any one of these predictors captures virtually all of the association between them and the response. In this case models that contain different members of this class but are otherwise identical are essentially the same. As a result, if, say, a symmetric prior specification is adopted, these models will receive more prior probability than they properly should. At the same time, other models that do not include members of this class will be down-weighted in the prior. In real data, this phenomenon occurs to varying degrees depending on the underlying correlation structure among the predictors. 8

9 Next, we present a very simple specification of the model space prior under the pfs representation that can effectively address this phenomenon. We do not claim that this approach is the best way to deal with dilution, but rather use this as an example to illustrate the flexibility rendered by the pfs representation. The specification can most simply be described in two steps. Step I. Pre-clustering the predictors based on their correlation. First, we carry out a hierarchical clustering over the predictor variables using the (absolute) correlation as the similarity metric, which divides the predictors into K clusters C 1,C 2,...,C K. We recommend using complete linkage for this purpose as this will ensure that the variables within each cluster are all very close to each other. One need to choose a correlation threshold s for cutting the corresponding dendrogram into clusters in the case of complete linkage, this is the minimum correlation for two variables to be in the same cluster. We recommend choosing a large s, such as 0.9, to place variables into the same basket only if they are very highly correlated. Step II. Prior specification given the predictor clusters. Based on the predictor clusters, we assign prior selection probabilities for a model γ to the variables not yet in the model in the following manner. First, we place equal total prior selection probability over each of the available clusters. Then within each cluster, we assign selection probability evenly across the variables. For example, consider the situation where there are a total of 10 predictors X 1 through X 10, and following Step I, they form four clusters C 1 = {X 1,X 2,X 3 }, C 2 = {X 4,X 10 }, C 3 = {X 5,X 7,X 9 } and C 4 = {X 6,X 8 }. Let γ be the model that contains variables X 1, X 4, X 5, X 6, and X 8. That is, γ = (1, 0, 0, 1, 1, 1, 0, 1, 0, 0). If the FS procedure reaches γ and the procedure does not stop, that is, S(γ) = 0, then five variables, X 2,X 3,X 7,X 9,X 10, from three clusters C 1 = {X 2,X 3 }, C 2 = {X 4 }, and C 3 = {X 5 } are available for further 9

10 inclusion. In this case we choose the selection probabilities λ(γ) to be: λ 1 (γ) = λ 4 (γ) = λ 5 (γ) = λ 6 (γ) = λ 8 (γ) = 0, λ 2 (γ) = λ 3 (γ) = 1/3 1/2 = 1/6, λ 4 = 1/3 and λ 5 = 1/3. Under such a specification, the predictors falling in the same cluster evenly share a fixed piece of the prior selection probability, which ensures that the prior weight on the other variables are not diluted. References Berger, J. O. and L. R. Pericchi (2001). Objective Bayesian methods for model selection: Introduction and comparison. Lecture Notes-onograph Series 38, pp George, E. I. (1999). Sampling considerations for model averaging and model search. invited discussion of odel averaging and model search, by. Clyde. In J.. Bernado, J. O. Berger, A. P. Dawid, and A. F.. Smith (Eds.), Bayesian Statistics 6, pp Oxford, UK: Oxford University Press. Liang, F., R. Paulo, G. olina,. A. Clyde, and J. O. Berger (2008). ixtures of g-priors for Bayesian Variable Selection. Journal of the American Statistical Association 103(481),

Bayesian Model Averaging

Bayesian Model Averaging Hoff Chapter 9, Hoeting et al 1999, Clyde & George 2004, Liang et al 2008 October 24, 2017 Bayesian Model Choice Models for the variable selection problem are based on a subset