Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Size: px

Start display at page:

Download "Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31"

Eugenia Francis
5 years ago
Views:

1 Hierarchical models Dr. Jarad Niemi Iowa State University August 31, 2017 Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

2 Normal hierarchical model Let Y ig N(θ g, σ 2 ) for i = 1,..., n g, g = 1,..., G, and G g=1 n g = n. Now consider the following model assumptions: θ g N(µ, τ 2 ) θ g La(µ, τ) θ g t v (µ, τ 2 ) θ g πδ 0 + (1 π)n(µ, τ 2 ) θ g πδ 0 + (1 π)t v (µ, τ 2 ) To perform a Bayesian analysis, we need a prior on µ, τ 2, and (in the case of the discrete mixture) π. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

3 Gibbs sampling Normal hierarchical model Consider the model Y ig θ g N(θ g, σ 2 ) N(µ, τ 2 ) where i = 1,..., n g, g = 1,..., G, and n = G g=1 n g with prior distribution p(µ, σ 2, τ 2 ) = p(µ)p(σ 2 )p(τ 2 ) 1 σ 2 Ca + (τ; 0, C). For background on why we are using these priors for the variances, see Gelman (2006) Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

4 Gibbs sampling Multi-step Gibbs sampler for normal hierarchical model Here is a possible Gibbs sampler for this model: For g = 1,..., G, sample θ g p(θ g ). Sample σ 2 p(σ 2 ). Sample µ p(µ ). Sample τ 2 p(τ 2 ). How many steps exist in this Gibbs sampler? G+3? 4? Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

5 Gibbs sampling 2-Step 2-Step Gibbs sampler for normal hierarchical model Here is a 2-step Gibbs sampler: 1. Sample θ = (θ 1,..., G) p(θ ). 2. Sample µ, σ 2, τ 2 p(µ, σ 2, τ 2 ). There is stronger theoretical support for 2-step Gibbs sampler, thus, if we can, it is prudent to construct a 2-step Gibbs sampler. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

6 Gibbs sampling Sampling θ Sampling θ The full conditional for θ is p(θ ) p(θ, µ, σ 2, τ 2 y) p(y θ, σ 2 )p(θ µ, τ 2 )p(µ, σ 2, τ 2 ) p(y θ, σ 2 )p(θ µ, τ 2 ) = G g=1 p(y g θ g, σ 2 )p(θ g µ, τ 2 ) where y g = (y 1g,..., y ng g ). We now know that the θ g are conditionally ependent of each other. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

7 Gibbs sampling Sampling θ g Sampling θ g The full conditional for θ g is p(θ g ) p(y g θ g, σ 2 )p(θ g µ, τ 2 ) = n g i=1 N(y ig ; θ g, σ 2 )N(θ g ; µ, τ 2 ) Notice that this does not include θ g for any g g. This is an alternative way to conclude that the θ g are conditionally ependent of each other. Thus where θ g N(µ g, τ 2 g ) τg 2 = [τ 2 + n g σ 2 ] 1 µ g y g = τg 2 [µτ 2 + y g n g σ 2 ] = 1 ng n g i=1 y ig. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

8 Gibbs sampling Sampling µ, σ 2, τ 2 Sampling µ, σ 2, τ 2 The full conditional for µ, σ 2, τ 2 is p(µ, σ 2, τ 2 ) p(y θ, σ 2 )p(θ µ, τ 2 )p(µ)p(σ 2 )p(τ 2 ) = p(y θ, σ 2 )p(σ 2 )p(θ µ, τ 2 )p(µ)p(τ 2 ) So we know that σ 2 is ependent of µ and τ 2. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

9 Gibbs sampling Sampling σ 2 Sampling σ 2 Recall that y ig N(θ g, σ 2 ) and p(σ 2 ) 1/σ 2. Thus, we are in the scenario of normal data with a known mean and unknown variance and the unknown variance has our default prior. Thus, we should know ( the full conditional is σ 2 IG n 2, 1 G ng 2 g=1 i=1 (y ig θ g ) ). 2 To derive the full conditional, use p(σ 2 ) G ng g=1 i=1 (σ2 ) 1/2 exp ( 1 (y 2σ 2 ig θ g ) 2) 1 ( σ 2 = (σ 2 ) n/2 1 exp 1 G / ) ng 2 g=1 i=1 (y ig θ g ) 2 σ 2 ( which is the kernel of a IG n 2, 1 G ng 2 g=1 i=1 (y ig θ g ) ). 2 Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

10 Sampling µ, τ 2 Sampling µ, τ 2 Recall that θ g N(µ, τ 2 ) and p(µ, τ 2 ) Ca + (τ; 0, C). This is a non-standard distribution, but is extremely close a normal model with unknown mean and variance with the standard non-informative prior p(µ, τ 2 ) 1/τ 2 or the conjugate normal-inverse-gamma prior. Here are some options for sampling from this distribution: random-walk Metropolis (in 2 dimensions), ependent Metropolis-Hastings using posterior from standard non-informative prior as the proposal, or rejection sampling using posterior from standard non-informative prior as the proposal The posterior under the standard non-informative prior is τ 2 Inv-χ 2 (G 1, s 2 θ) and µ τ 2,... N(θ, τ 2 /G) where θ = 1 G G g=1 θ g and s 2 θ = 1 G 1 (θ g θ) 2. What is the MH ratio? Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

11 Sampling µ, τ 2 Summary Markov chain Monte Carlo for normal hierarchical model 1. Sample θ p(θ ): a. For g = 1,..., G, sample θ g N(µ g, τ 2 g ). 2. Sample µ, σ 2, τ 2 : a. Sample σ 2 IG(n/2, SSE/2). b. Sample µ, τ 2 using ependent Metropolis-Hastings using posterior from standard non-informative prior as the proposal. What happens if θ g La(µ, τ) or θ g t v (µ, τ 2 )? Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

12 Scale mixtures of normals Scale mixtures of normals Recall that if then This is called a location mixture. Now, if θ φ N(φ, V ) and φ N(m, C) θ N(m, V + C). θ φ N(m, Cφ) and we assume a mixing distribution for φ, we have a scale mixture. Since the top level distributional assumption is normal, we refer to this as a scale mixture of normals. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

13 Scale mixtures of normals t distribution t distribution Let then p(θ) θ φ N(m, φc) and φ IG(a, b) = p(θ φ)p(φ)dφ = (2π C) 1/2 ba Γ(a) φ 1/2 e (θ m)2 /2φC φ (a+1) e b/φ dφ = (2πC) 1/2 ba φ (a+1/2+1) e [b+(θ m)2 /2C]/φ dφ = (2πC) = Γ(a) 1/2 ba Γ(a+1/2) Γ(a) [b+(θ m) 2 /2C] [ a+1/2 ] (θ m) 2 [2a+1]/2 2a bc/a Γ([2a+1]/2) Γ(2a/2) 2aπbC/a Thus θ t 2a (m, bc/a) i.e. θ has a t distribution with 2a degrees of freedom, location m, scale bc/a, and variance bc a 1. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

14 Scale mixtures of normals t distribution Hierarchical t distribution Let m = µ, C = 1, a = ν/2, and b = ντ 2 /2, i.e. θ φ N(µ, φ) and φ IG(ν/2, ντ 2 /2). Then, we have θ t ν (µ, τ 2 ), i.e. a t distribution with ν degrees of freedom, location µ, and scale τ 2. Notice that the parameterization has a redundancy between C and a/b, i.e. we could have chosen C = τ 2, a = ν/2, and b = ν/2 and we would have obtained the same marginal distribution for θ. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

15 Scale mixtures of normals t distribution Laplace distribution Let θ φ N(m, φc 2 ) and φ Exp(1/2b 2 ) where E[φ] = 2b 2 and Var[φ] = 4b 4. Then, by an extension of equation (4) in Park and Casella (2008), we have p(θ) = 1 θ m 2Cb e Cb. This is the pdf for a Laplace (double exponential) distribution with location m and scale Cb which we write θ La(m, Cb). and say θ has a Laplace distribution with location m and scale Cb and E[θ] = m and Var[θ] = 2[Cb] 2 = 2C 2 b 2. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

16 Scale mixtures of normals t distribution Hierarchical Laplace distribution Let m = µ, C = 1, and b = τ i.e. θ φ N(µ, φ) and φ Exp(1/2τ 2 ). Then, we have θ La(µ, τ), i.e. a Laplace distribution with location µ and scale τ. Notice that the parameterization has a redundancy between C and b, i.e. we could have chosen C = τ and b = 1 and we would have obtained the same marginal distribution for θ. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

17 Normal hierarchical model Normal hierarchical model Recall our hierarchical model Y ig N(θ g, σ 2 ) for g = 1,..., G and i = 1,..., n g. Now consider the following model assumptions: θ g φ g N(µ, φ g ), φ g = τ 2 = θ g N(µ, τ 2 ) θ g φ g N(µ, φ g ), θ g φ g N(µ, φ g ), φ g Exp(1/2τ 2 ) = θ g La(µ, τ) φ g IG(v/2, vτ 2 /2) = θ g t v (µ, τ 2 ) For simplicity, let s assume σ 2 IG(a, b), µ N(m, C), and τ Ca + (0, c) and that σ 2, µ, and τ are a priori ependent. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

18 MCMC Gibbs sampling The following Gibbs sampler will converge to the posterior p(θ, σ, µ, φ, τ y): 1. Independently, sample θ g p(θ g ). 2. Sample µ, σ, τ p(µ, σ, τ ): a. Sample µ p(µ ). b. Sample σ p(σ ). c. Sample τ p(τ ). 3. Independently, sample φ g p(φ g ). Steps 1, 2a, and 2b will be the same for all models, but steps 2c and 3 will be different. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

19 MCMC θ Sample θ Y ig N(θ g, σ 2 ) and θ g N(µ, φ g ) p(θ ) [ G ng g=1 G g=1 i=1 e (y ig θ g ) 2 /2σ 2] [ ] G g=1 e (θg µ)2 /2φ g [ ng ] i=1 e (y ig θ g ) 2 /2σ 2 e (θg µ)2 /2φ g Thus θ g are conditionally ependent given everything else. It should be obvious that ( [ 1 θ g N µ + n ] [ g 1 φ g σ 2 y g, + n ] ) 1 g φ g σ 2 where y g = n g i=1 y ig /n g. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

20 MCMC µ Sample µ θ g N(µ, φ g ) and µ N(m, C) Immediately, we should know that µ N(m, C ) with ( C = 1 C + ) 1 G g=1 1 ( φ g m = C 1 C m + ) G g=1 1 φ g θ g Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

21 MCMC σ Sample σ 2 Y ig N(θ g, σ 2 ) and σ 2 IG(a, b) This is just a normal data model with an unknown variance that has the conjugate prior. The only difficulty is that we have several groups here. But very quickly you should be able to determine that σ 2 IG(a, b ) where a = a + 1 G 2 g=1 n g = a + n 2 b = b + 1 G ng 2 g=1 i=1 (y ig θ g ) 2. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

22 MCMC Distributional assumption for θ g Distributional assumption for θ g Y ig N(θ g, σ 2 ) and θ g N(µ, φ g ) φ g φ g φ g = τ Exp(1/2τ 2 ) IG(v/2, vτ 2 /2) The steps that are left are 1) sample φ and 2) sample τ 2, Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

23 MCMC φ Sample φ for normal model For normal model, φ g = τ, so we will address this when we sample τ. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

24 MCMC φ Sample φ for Laplace model For Laplace model, θ g N(µ, φ g ) and φ g Exp(1/2τ 2 ), so the full conditional is [ G ] p(φ ) N(θ g ; µ, φ g )Exp(φ g ; 1/2τ 2 ). g=1 So the ividual φ g are conditionally ependent with p(φ g ) N(θ g ; µ, φ g )Exp(φ g ; 1/2τ 2 ) φ 1/2 g e (θg µ)2 2 /2φ g φg /2τ e If we perform the transformation η g = 1/φ g, we have p(η g ) η 3/2 g e (θg µ) 2 2 η g 1 2τ 2 ηg which is the kernel of an inverse Gaussian distribution with mean 1/τ 2 (θ g µ) 2 and scale 1/τ 2 where the parameterization is such that the variance is µ 3 /λ (different from the mgcv::rig parameterization). Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

25 MCMC φ Sample φ for t model For the t model, so we have θ g N(µ, φ g ) and φ g IG(v/2, vτ 2 /2), ( φ g v + 1 IG, vτ2 + (θ g µ) 2 ). 2 2 Since this is just G ependent normal data models with a known mean and ependent conjugate inverse gamma priors on the variance. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

26 MCMC Then we calculate τ = η. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31 τ Sample τ for normal model Let so the full conditional is θ g N(µ, τ 2 ) and τ Ca + (0, c). p(η ) η G/2 e G g=1 (θg µ)2 /2η ( 1 + η/c 2) 1 η 1/2 where we performed the transformation η = τ 2 on the prior. Let s use Metropolis-Hastings with proposal distribution ( ) η G 1 G (θ g µ) 2 IG, 2 2 g=1 and acceptance probability min{1, ρ} where ( 1 + η /c 2) 1 ρ = ( 1 + η (i) /c 2) 1 = 1 + η(i) /c η /c 2 where η (i) and η are the current and proposed value respective.

27 MCMC τ Sample τ for Laplace model Let so the full conditional is φ g Exp(1/2τ 2 ) and τ Ca + (0, c) p(η ) η G e G g=1 φg /2η ( 1 + η/c 2) 1 η 1/2. Let s use Metropolis-Hastings with proposal distribution η IG G 1 G 2, φ g 2 g=1 and acceptance probability min{1, ρ} where again Then we calculate τ = η. ρ = 1 + η(i) /c η /c 2. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

28 MCMC τ Sample τ for t model Let so the full conditional is φ g IG(v/2, vτ 2 /2) and τ Ca + (0, c) p(η ) η Gv/2 e η 2 G g=1 1 φg ( 1 + η/c 2 ) 1 η 1/2. Let s use Metropolis-Hastings with proposal distribution η Ga Gv + 1, 1 G φ g=1 g and acceptance probability min{1, ρ} where again Then we calculate τ = η. ρ = 1 + η(i) /c η /c 2. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

29 MCMC Point-mass distributions Dealing with point-mass distributions We would also like to consider models with θ g πδ 0 + (1 π)n(µ, φ g ) where φ g = τ 2 corresponds to a normal and φ g IG(v/2, vτ 2 /2) corresponds to a t distribution for the non-zero θ g. Similar to the previous, the θ g are conditionally ependent. To sample θ g, we calculate π = π ng i=1 N(y ig ;0,σ 2 ) π ng i=1 N(y ig ;0,σ 2 )+(1 π) ng. i=1 N(y ig ;µ,φ g +σ 2 ) ( φ g = 1 φ g + ng ( ) µ g = φ µ g + ng y φg σ 2 g σ 2 ) 1 Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

30 MCMC Point-mass distributions Dealing with point-mass distributions (cont.) Let and π Beta(s, f ). θ g πδ 0 + (1 π)f (θ g ) The full conditional for π is G G π Beta s + I(θ g = 0), f + I(θ g 0) g=1 g=1 and µ and φ g get updated using only those θ g that are non-zero. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

31 MCMC Point-mass distributions Dealing with point-mass distributions (cont.) Updating µ, φ g, and τ, will be exactly the same as it was before EXCEPT, you will only use g such that θ g 0. So µ N(m, C ) with ( C = 1 C + ) 1 g:θ g 0 1 ( φ g m = C 1 C m + ) g:θ g 0 1 φ g θ g, τ will still require a MH step, but the proposal should only depend on g : θ g 0, and φ g will be exactly as before when θ g 0, but when θ g = 0, then it actually doesn t matter what distribution you are sampling from (Carlin and Chib 1995). Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

CS281A/Stat241A Lecture 22

CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution