Part 7: Hierarchical Modeling

Size: px

Start display at page:

Download "Part 7: Hierarchical Modeling"

Baldwin Ellis
5 years ago
Views:

1 Part 7: Hierarchical Modeling!1

2 Nested data It is common for data to be nested: i.e., observations on subjects are organized by a hierarchy Such data are often called hierarchical or multilevel For example, patients within several hospitals" genes within a group of animals, or" people within counties within regions within countries!

3 Two groups The simplest type of multilevel data has levels, in which one level consists of groups" and the other consists of units within groups y i,j i th In this case, we denote as the data on the unit within group j!3

4 Hierarchical model The sampling model should reflect/acknowledge the hierarchy so that we may distinguish between within-group variability, and" between-group variability One typically uses the following hierarchical model, for j =1,...,m n j, with observations in each group {Y 1,j,...,Y nj,j j } iid p(y j ) (within-group sampling variability) { 1,..., m } iid p( j ) (between-group sampling variability) p( )!4 (prior distribution)

5 Variability accounting It is important to recognize that the distributions p(y ) and p( ) both represent sampling variability among populations of objects: p(y ) represents variability among measurements within a group " p( ) represents variability across groups In contrast, p( ) represents information about a single fixed but unknown quantity These are both sampling distributions; the data are used to estimate and ; but p( ) is not estimated!5

6 Hierarchical normal model A popular model for describing the heterogeneity of means across several populations is the hierarchical normal model, in which the within- and between-group sampling models are both normal: j =(µ j, ),p(y j )=N (µ j, ) (within-group model) =(, ),p( j )=N (, ) (between-group model) Note that p( ) only describes heterogeneity across group means, and not any heterogeneity in groupspecific variances The within-group sampling variability is assumed to be constant across groups!6

7 Hierarchical diagram (, ) µ 1 µ µ m 1 µ m Y 1 Y Y m 1 Y m!7

8 The priors The fixed but unknown parameters in the models are, and For convenience we will use the standard semiconjugate normal and IG prior for these parameters IG( 0 /, 0 0 /) IG( 0 /, 0 0 /) N ( 0, 0)!8

9 Posterior inference The full set of unknown quantities in our system include the group-specific means {µ 1,...,µ m }, the within-group sampling variability and the mean and variance (, ) of the population group-specific means Posterior inference for these parameters can be made by GS which approximates the joint posterior distribution p(µ 1,...,µ m,,, y 1,...,y m ) GS proceeds by iteratively sampling each parameter from its full conditional distribution!9

10 Full conditionals Deriving the full conditional distributions in this highly parameterized system may seem like a daunting task " But it turns out that we have already worked out all of the necessary technical details " All that is required is that we recognize certain analogies between the current model and the univariate normal model!10

11 Posterior essence The following factorization of the posterior distribution will be useful p(µ 1,...,µ m,,, y 1,...,y m ) / p(y 1,...,y m µ 1,...,µ n,,, ) p(µ 1,...,µ m, ) p(,, < n my Y j = < my = = p(y i,j µ j, ) : ; p(µ j, ) : ; p( )p( )p( ) j=1 i=1 j=1 ) This term is the result of an important conditional independence feature of our model This relates back to our diagram...!11

12 Conditional independence (, ) µ 1 µ µ m 1 µ m Y 1 Y Y m 1 Y m The existence of a path from (, ) to each indicates that these parameters provide information about Y j Y j but only indirectly through µ j Conditional on {µ 1,...,µ m,,, } the random variables Y 1,j,...,Y are independent with a nj,j distribution that depends only on µ j and!1

13 Full conditional for (, ) As a function of and, the posterior distribution is proportional to my p(µ j, )p( )p( ) j=1 And so the full conditional distributions of are also proportional to this quantity and p( µ 1,...,µ m,, p( µ 1,...,µ m,,,y 1,...,y m )= Y p(µ j, )p( ),y 1,...,y m )= Y p(µ j, )p( )!13

14 Full conditional for (, ) These conditionals are exactly the full conditional distributions from the one-sample normal problem Therefore, by analogy: m µ/ { µ 1,...,µ m, + 0 / 0 } N m/ +1/ 0, [m/ +1/ 0] { 0 + m µ 1,...,µ m, } IG, P (µ j ) 1!14

15 Full conditional for µ j Likewise, the full conditional distribution of proportional to µ j must be p(µ j,,,µ ( j),y 1,...,y n ) / n j Y p(y i,j µ j, ) p(µ j, ) µ j i=1 is conditionally independent of {µ k,y k } for k 6= j (, ) µ 1 µ µ m 1 µ m Y 1 Y Y m 1 Y m While there is a path from each µ j to every other µ k, the paths go through (, ) or!15

16 Full conditional for µ j We can think of this as meaning that the µ s contribute no information about each other beyond that contained in, and The terms in our conditional are n j Y p(y i,j µ j, ) p(µ j, ) i=1 (product of normals) (normal) Mathematically, this is the same setup as our normal model. So by analogy, the full conditional distribution is {µ j,,,y 1,j,...,y nj,j} N nj ȳ j / + /!16 n j / +1/, [n j/ +1/ ] 1

17 Full conditional for By similar arguments, is conditionally independent of {, } given {y 1,...,y m,µ 1,...,µ m } The derivation of the full conditional of is similar to that in the one-sample normal model, except that now m we have information about from separate groups p( µ 1,...,µ m,y 1,...,y m ) / my j=1 n j Y i=1 p(y i,j µ j, / ( ) P n j / e ) p( ) PP (yi,j µ j ) ( ) 0/ e!17

18 Full conditional for Adding powers of and collecting terms in the exponent, we recognize that this is an IG: { µ, y} IG mx n j 3 5, mx 31 n X j (y i,j µ j ) 5A j=1 j=1 i=1 PP (yi,j µ j ) Note that is the sum of squared residuals across all groups, conditional on the withingroup means So the conditional distribution concentrates probability around a pooled-sample estimate of the variance!18

19 Example: Math scores in US public schools Consider data that is part of the 00 Educational Longitudinal Study (ELS), a survey of students from a large sample of schools in the United States The data consist of math scores of 10th grade students at 100 different urban public high schools with a (10th grade) enrollment of 400+ The scores are based on a national exam, standardized to produce a nationwide mean of 50 and standard deviation of 10!19

20 Example: the data Scores for students within the same school plotted along a common vertical bar: rank of school specific math score average math score!0

21 Example: the data The range of average scores (36, 65) is quite large Extreme sample averages occur for schools with small sample sizes This is a common relationship in hierarchical datasets ybar Frequency n ybar!1

22 Example: priors The prior parameters that we need to specify are ( 0, 0) for p( ) ( 0, 0 ) for p( ), and ( 0, 0) for p( ) The exam was designed to give a variance of 100, both within and between schools so the within-school variance should be at most 100, which we take as" 0 this is likely an overestimate, so we take a weak prior concentration around this value with 0 =1!

23 Example: priors, ctd. Similarly, the between-school variance should not be more than 100, so we likewise take ( 0, 0 )=(1, 100) Finally, the nationwide mean over all schools is 50 Although the mean for large urban public schools may be different than the nationwide average, it should not differ by too much We shall take 0 = 50 and 0 = 5, so that the prior probability that (40, 60) is about 95%!3

24 Example: Gibbs sampling Posterior approximation may now proceed by GS Given a current state of the unknowns {µ (s) 1,...,µ(s) m, (s), (s), (s) } a new state may be generated as follows 1. sample. sample 3. sample (s+1) p( µ (s) 1,...,µ(s) m, (s) ) (s+1) p( µ (s) 1,...,µ(s) m, (s+1) ) (s+1) p( µ (s) 1,...,µ(s) m,y 1,...y m ) 4. sample for each j {1,...,m} sample µ (s+1) j p(µ j (s+1), (s+1),!4 (s+1),y j )

25 Example: Gibbs sampling The order in which the new parameters are updated does not matter What does matter is that each parameter is updated conditional upon the most current value of the other parameters Checking for mixing or convergence problems with so many parameters can be difficult Calculating the ESS and plotting traces is the state of the art!5

26 Example: Posterior summaries Density Density Density psi sigma = =9.1 =4.97 tau 95% of the scores within a school are within points of each other" whereas, 95% of the average school scores are within points of each other !6

27 Example: Shrinkage One of the motivations behind hierarchical modeling is that information can be shared across groups Recall that, conditional on,, and the data, the expected value of µ j ȳ j is a weighted average of and E{µ j y j,,, } = ȳjn j / + / n j / +1/ As a result, the expected value of µ j is pulled a bit ȳ j from towards by an amount depending upon n j This effect is called shrinkage!7

28 Example: Shrinkage Consider the relationship between ȳ j and µ j = E{µ j y j,,, } for j =1,...,m obtained in via our MCMC method Notice that the relationship apply(mu,, mean) follows a line with a slope that is less than one, indicating that high values of ȳ j correspond to slightly less high values of and vice-versa for low µ j values ybar!8

29 Example: Shrinkage It is also interesting to observe the shrinkage as a function of the group-specific sample size (ybar apply(mu,, mean))[o] Groups with low sample sizes get shrunk the most, where as groups with large sample sizes hardly get shrunk at all This makes sense: n[o] The larger the sample size the more information we have for that group, and the less information we need to borrow from the rest of the population!9

30 Example: Ranking schools Suppose our task is to rank the schools according to what we think their performances would be if every student in each school took the math exam In this case, it makes sense to rank the schools according to the school-specific posterior expectations { µ 1,..., µ m } Alternatively, we could ignore the results of the hierarchical model and just use the school-specific averages {ȳ 1,...,ȳ m } The two methods will give similar, but not identical rankings!30

31 Example: Ranking schools Consider the posterior distributions of µ 46 and µ 8 Both schools have exceptionally low sample means, in the bottom 10% of all schools Density n 46 = 1 n 8 =5 school 46 school 8 psi bar (data averages)!31 N = 5000 Bandwidth = 0.301

32 Example: Ranking schools So school 8 is ranked lower than 46 by the data averages, but higher by the posterior group-means Does this make sense? The posterior density for school 46 is more peaked because it was derived from a much larger sample size Therefore our degree of certainty about higher than that for µ 8 µ 46 is much How does this translate into lack of faith in the data averages if we are going to justify using the posterior means instead?!3

33 Example: Hypothetical absence Suppose that on the day of the exam the student who got the lowest exam score in school 8 did not come to class Then the sample mean for school 8 would have been rather than 38.76, a change of more than three points (and a change in ranking) In contrast, if the lowest performing student in school ȳ had not shown up, then would have been 40.9 as opposed to 40.18, a change of only 3/4!33

34 Example: Hypothetical absence In other words, the low value of the sample mean for school 8 can be explained by either low µ 8 being very... or just the possibility that a few of the 5 sampled students were among the poorer performing students in the school In contrast for school 46, this latter possibility cannot explain the low value of the sample mean, because of the large sample size Therefore it makes sense to shrink the expectation of school 8 towards the population expectation greater amount than for 46!34 by a

35 Example: Is it fair? To some people this reversal of rankings may seem strange or unfair While fairness may be debated, the hierarchical model reflects the objective fact that there is more evidence that evidence that µ 46 µ 8 is exceptionally low than there is is exceptionally low There are many other real-life situations where differing amounts of evidence results in a switch of ranking!35

36 Hierarchical variances If the population means vary across groups, shouldn t we allow for the possibility that the population variances also vary across groups? Letting be the variance for group j, our sampling j model would then become Y 1,j,...,Y nj,j iid N (µj, j ) and our full conditional for each {µ j y 1,j,...,y nj,j, j,, } µ j would be N n jȳ j j + / n j / j +1/, [n j/ j +1/ ]!36 1!

37 Estimating j How does j get estimated? Suppose we were to specify that 1,..., m iid IG 0, 0 0 Then we can use our familiar result that the full conditional distribution of j { j y 1,j,...,y nj,j,µ j, 0, 0} IG and estimation for 1,..., can proceed by iteratively sampling their values along with the other is m 0 + n j, P i (y i,j µ j ) parameters via GS!37

38 Inefficient independence 0 0 If and are fixed in advance at some particular values, then our IID IG prior on the variances implies that, e.g., p( m 1,..., m 1) =p( m) and so the information we may have about 1,..., m 1 This seems inefficient: if 1,..., m 1 is not used to help us estimate n m m was small, and we saw that were tightly concentrated around a particular value, then we would want to use this fact to improve our estimation of!38 m

39 Heterogeneity in variances In other words, we want to be able to learn about the sampling distribution of the s and use this information to improve our estimation for groups that may have low sample sizes j 0 0 This can be done by treating and as parameters to be estimated by thinking of 1,..., m iid IG 0, 0 0 as a sampling model for across-group heterogeneity in population variances (not as a prior distribution)!39

40 Hierarchical diagram Putting this together with our model for heterogeneity in population means gives (, ) µ 1 µ µ m 1 µ m Y 1 Y Y m 1 Y m 1 m 1 m ( 0,!40 0)

41 Unknown parameters Now, the unknown parameters to be estimated include: representing the {(µ 1, 1),...,(µ m, m)} within-group sampling distributions {, } representing the across-group heterogeneity in means { 0, 0} representing the across-group heterogeneity in variances!41

42 Gibbs sampling As before, the joint posterior distribution for all of these parameters can be approximated by iteratively sampling each parameter from its full conditional distribution given the others The full conditional distributions for and are unchanged in this new setup We just derived the full conditionals for µ j and j What remains to do is to specify prior distributions for 0 and 0 and obtain their full conditionals!4

43 Conjugate prior for 0 A conjugate class of prior densities for Gamma densities 0 are the If 0 G(a, b) then it is straightforward to show that 0 1 { 0 1,..., m, 0} + 1 m 0,b+ 1 0 m X 1 A j=1 j Notice that for small values of and the conditional mean of 1,..., 0 m a is approximately the harmonic mean of!43 b

44 No conjugate prior for 0 A simple conjugate prior for We shall restrict 0 0 does not exist to be a positive integer, so that it may retain its interpretation as a prior sample size If we let the prior on 0 be the geometric distribution {1,,...} p( 0 ) / e 0 on, so that then the full conditional distribution is proportional to p( 0 0, 1,..., m) / p( 1,..., m 0, 0 1 ( 0 = 0 /) m 0/ A ( 0 /) j=1 1 j 0 / 1 0) p( 0 ) exp X 0 (1/ j )!44

45 Awkward conditional for 0 While this not pretty, this un-normalized probability density can be computed for a large range of 0 -values and then sampled from In fact, since the interpretation of prior sample size for the is that it is the s, it is reasonable to restrict it to be smaller than the data sample size mx j 0 N = j=1 n j The expression for p( 0 0, 1,..., m) may easily be evaluated (and re-normalized) over this range!45

46 Example: a particular full conditional for 0 Consider the full conditional for conditional upon 0 = 100 and j = 85 0 for all j =1,...,m mass nu.0!46

47 Gibbs sampler Therefore, we may complete the GS algorithm by sampling from the full conditional for 0, which is a discrete distribution over a finite set of values {1,,...,N} This may be accomplished with the sample function in R!47

48 Example: math score data Let us re-analyze the math score data with our hierarchical model for school-specific means and variances We shall take the parameters in our prior distributions to be the same as before, with the addition of =1 and {a =1,b=1/100} for the prior distributions on 0 and 0, respectively We shall run the GS algorithm for 5,000 iterations, as before, and compare our results to the hierarchical model with a common group-variance!48

49 Example: group means and variances Density = vars!= vars Density psi tau mass Density nu.0! sigma.0

50 Example: group variances Our first model, in which all within-group " variances are forced to be equal, is equivalent to a value of 0 = 1 in our current hierarchical model In contrast, a low value of 0 =1 would indicate that the variances are quite unequal, and little information about variances could be shared across groups The posterior summaries from the GS output indicates that neither of these extremes is appropriate, as the posterior distribution is concentrated around a moderate value of 14 or 15!50

51 Example: group variances This estimated distribution of is used to shrink extreme sample variances towards an acrossgroup center 1,..., m s apply(sigma.,, mean) n[o] (s apply(sigma.,, mean))[o] The larger amounts of shrinkage generally occur for groups with smaller sample sizes!51

52 Hierarchical binomial model Another commonly used hierarchical model is the Beta-binomial model, where Y j Bin(n j, j ) j Beta(, ) (, ) p(, ) j is the familiar Beta distribution j Y j,, Beta( + y j, + n j y j ) Conditional on and the posterior conditional for!5

53 Hierarchical diagram (, ) 1 m 1 m Y 1 Y Y m 1 Y m As usual, we treat n 1,...,n m as known!53

54 Hyperprior Unfortunately, there is no (semi-) conjugate prior for and However, it is possible to set up a non-informative hyperprior that is dominated by the likelihood and yields a proper posterior distribution which leads to a convenient Metropolis-within-Gibbs sampling method!54

55 A hyperprior choice A reasonable choice of diffuse hyperprior for the Betabinomial hierarchical model is uniform on, ( + ) 1/ + A change of variables shows that this implies the following prior on the original scale p(, ) / ( + ) 5/ There are many other possibilities. Why is this a sensible choice?!55

56 A hyperprior choice p(, ) / ( + ) 5/ It may be shown that it gives a roughly uniform prior distribution to the log standard deviation of the resulting j Beta(, ) sampling model It may also be shown that the posterior marginal p(, y) is proper with this choice as long as 0 <y j <n j for at least one experiment j!56

57 The posterior marginal How can we calculate the posterior marginal distribution? (cond prob.) p(, y) = = = p(,, y) p(,,y) (Bayes rule) p(y,, )p(,, ) p(,,y) p(y )p(, )p(, ) (cond indep. & cond prob.) p(,,y) Q m j=1 Bin(y j; n j, j ) Q m j=1 Beta( j;, ) p(, ) Q m j=1 Beta( j; + y j, + n j y j ) (...) p(, ) apple ( + ) ( ) ( ) m m Y j=1 ( + y j ) ( + n j y j ) ( + + n j )!57

58 Metropolis-within-Gibbs We may sample from the marginal ( (s+1), p(, (s+1) ) ( (s), y) generating from via 1. propose 0 Unif( (s) /, (s) ) (s) ) by. if u< p( 0, (s) y) p( (s), (s) y) (s) 0 for u Unif(0, 1) then (s+1) 0, else (s+1) (s) 3. propose 0 Unif( (s) /, (s) ) 4. if u< p( (s+1), 0 y) (s) p( (s+1), (s) y) 0 then (s+1) 0, else!58 (s+1) for u Unif(0, 1) (s)

59 Completing the MC Conditional on samples ( (s), (s) ) we may complete the MC method for sampling from the joint posterior distribution by sampling from the conditional(s) (s) j Beta( (s) + y j, (s) + n j y j ) for j =1,...,m We may also obtain samples from the posterior predictive p(ỹ y) =p(ỹ 1,...,ỹ m y 1,...,y m ) as ỹ (s) j Bin(n j, (s) j ) for j =1,...,m!59

60 Example: risk of tumors in a group of rats In the evaluation of drugs for possible clinical application, studies are routinely performed on rodents In a particular study, the aim is to estimate the probability of a tumor in a population of female rats F344 that receive a zero-dose of the drug (control group) The data show that 4/14 rats developed endometrial stromal polyps (a kind of tumor) Typically, the mean and standard deviation of underlying tumor risks are not available to form a prior!60

61 Example: prior data Rather, historical data are available on previous experiments on similar groups of rats Tarone (198) provides data on the observations of tumor incidence in 70 groups of rats # tumors # of rats in group!61

62 Example: Bayesian analysis We shall model the m = 71 rat tumor data with a hierarchical Beta-binomial sampling model with MC(MC) inference, as just described First we must obtain samples from alpha ESS = the marginal s posterior of (, ) beta ESS = !6 s

63 Example: The posterior marginal Once we have determined that the mixing is good, and we think the chain has achieved stationarity we can inspect the marginal posterior in a number of ways beta On the original scale alpha!63 log(alpha + beta) log(alpha/beta) A sensible transformation

64 Example: Mean Shrinkage As in the normal hierarchical model, we can assess the amount of shrinkage in the group-specific means, which may be obtained by direct MC, in a number of ways theta bar y/n theta bar y/n !64 n

65 Example: Rat group F344 We can now present the posterior distribution for for our 71st rat group, and compare it to the population mean of tumor rates in the 70 prior rat groups Density rat F344 a/(a+b) E{ 71 y} =0.199 E y = P 71 > = theta!65

66 In summary We have seen how hierarchical models may be used to model data which are nested or have a natural hierarchy" pool information about groups of similar populations so that smaller groups may borrow information from larger ones (i.e., shrinkage)" provide an efficient way of using prior data in an appropriate way!66

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course