STAT 425: Introduction to Bayesian Analysis

Size: px

Start display at page:

Download "STAT 425: Introduction to Bayesian Analysis"

Francis Bailey
6 years ago
Views:

1 STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

2 Part 3: Hierarchical and Linear Models Hierarchical models Linear regression models Generalized linear models (logistic and Poisson) Hierarchical linear and mixed models Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

3 Data augmentation techniques for binary responses Binary response case. Basic idea: re-expression of discrete-data regression models as unobserved (latent) continuous data. Aids interpretation and allows convenient MCMC sampling Used both for logistic and probit regression Albert and Chib (1993) demonstrated an auxiliary variable approach to simplify binary probit regression Introduce extra variables into model, z such that y = g(z); g any non-decreasing function for interpretability Can also be used for multinomial/ordinal data (see Hoff Chapter 12, Section 12.1) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

4 Binary regression model Let us observe y i {0, 1} and a set of covariates, X i, i = 1,...n. y i = Bernoulli(g 1 (η i ) η i = X i β (1) β π(β) Probit regression: g(u) = Φ 1 (u), Normal CDF Logit regression: g(u) = logit(u) = u 1 u logit link Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

5 Probit link for binary outcome (chapter 12) The auxiliary variable formulation for binary outcomes assumes that a continuous latent variable z i exists such that The latent value z i is related to the binary y i via { yi = 1 if z i > 0 y i = 0 if z i 0 Associated with the i-th response, the values of k covariates x i1,..., x ik are observed. The latent value z i is related to the k covariates by the normal regression model z i = x i1 β x ik β k + ε i ε i N(0, 1) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

6 Then we can show that p(y i = 1 β) = p(y i = 1 z i > 0, β)p(z i > 0 β) + p(y i = 1 z < 0, β)p(z i < 0 β) = 1 p(z i > 0 β) + 0 p(z i < 0 β) = p(z i η i > η i β) = Φ(η i ). with η i = (x i1 β x ik β k ) and where Φ() is the cdf of a standard normal distribution. The latent values z i are viewed as additional parameters. Gibbs sampling can be used to obtain posterior draws of β and z = (z 1,..., z n ). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

7 If we place a uniform prior on β, p(β) 1 the full conditionals are given by: ( β z, X, y N k (X T X) 1 X T z, (X T X) 1) { N (xi β, 1) I{z z i β, X, y i > 0} if y i = 1 N (x i β, 1) I{z i 0} if y i = 0 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

8 If we want to specify, instead, a normal prior density for β its full conditional becomes β N (0, S 0 ) β z, X, y N k ( (X T X + S 1 0 ) 1 X T z, (X T X + S 1 0 ) 1) Note: Sampling from truncated normal density, y N(µ, σ 2 ) I(a < y < b), via the inverse CDF transformation method: 1 Setting u 1 = Φ(a; µ, σ 2 ) and u 2 = Φ(b; µ, σ 2 ) 2 Sampling u U(u 1, u 2 ) 3 Setting y = Φ 1 (u; µ, σ 2 ) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

9 Example: Donner Party (from Bayesian computation using R by Jim Albert). The Donner Party was a group of American pioneers who set out for California in a wagon train. They spent the winter of snowbound in the Sierra Nevada. The first relief party did not arrive until the middle of February 1847, almost four months after the wagon train became trapped. Forty-eight of the 87 members of the party survived to reach California. The dataset donner.dat contains the age (in years), gender (MALE) and survival status (1 if survived) for 45 members of the Donner Party. We want to fit the probit model for π i = P (y i = 1) Φ 1 (π i ) = β 0 + β 1 MALE i + β 2 AGE i Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

10 donner = read.table("donner.dat", header=t, sep="\t") y = donner$survival; n=length(y) X = as.matrix(cbind(rep(1, n), donner[,1:2])) k=dim(x)[2] library(mass) T=10000; BETA=matrix(NA, T, k); Z=matrix(NA, T, n) set.seed(1) # initial value z = rnorm(n, 0, 1) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

11 # Implement Gibbs sampler for(t in 1:T) { # Update beta vb = solve(t(x)%*%x); mb = vb%*%t(x)%*%z beta = mvrnorm(1, mb, vb) # Update z_i s for(i in 1:n) { if(y[i]==1) z[i]=rtruncnorm(1,x[i,]%*%beta,1,0,inf else z[i]=rtruncnorm(1, X[i,]%*%beta, 1, -Inf, 0) } BETA[t,]=beta; Z[t,] = z } Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

12 Let s calculate some posterior summaries: nburn=1000 > apply(beta[(nburn+1):t,], 2, mean) [1] > apply(beta[(nburn+1):t,], 2, sd) [1] > apply(beta[(nburn+1):t,],2,quantile,c(0.025,0.975)) [,1] [,2] [,3] 2.5% % Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

13 Let s compare the results with a maximum likelihood fit of the probit model: fit.probit = glm(survival., family=binomial(link=probit), data=donner) summary(fit.probit) Call: glm(formula = survival., family = binomial(link = probit), data = donner) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) * age * male * Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

14 Model Fit and Model Choice (Hoff - chapter 9.3) Good statisticians question whether their model is an adequate approximation to reality. A chosen model may not be the best model to fit the data. For example, a different set or combination of the available covariates should be considered. This is a problem of model selection. A first attempt to model and prior criticism considers analyzing the adequacy of our fit by using common regression diagnostic tools, e.g. by inspecting the residuals of the fit provided by the posterior mean: with ˆβ = E(β y). ˆɛ i = y i x i ˆβ, Alternatively, one could think at more formal, and perhaps more Bayesian, ways to compare models. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

15 Why variable selection? Avoid the use of redundant variables (problems with interpretations) Inclusion of un-necessary terms yields less precise estimates, particularly if explanatory variables are highly correlated with each other reduced MSE: reduced variance but possibly higher bias It is too expensive to use all variables Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

16 Model selection criteria Model selection criteria have been devised to compare different models. Kadane and Lazar (2004) review model selection from Bayesian and frequentist perspectives. In regression settings, the cndidate models are distinguished by different covariate combinations or transformations of predictor variables: 1. Adjusted R 2 2. Stepwise regresison 3. Regularation (Ridge, LASSSO) 4. Akaike Information criterion (AIC) 5. Bayesian Information Criterion (BIC) 6. Deviance Information Criterion (DIC) 7. Watanabe-Akaike information criteria (WAIC) 8. Log pseudo marginal likelihood (LPML) 9. Bayes Factors Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

17 Model Selection In Linear regression, one may want to understand what predictors (and models) fit the data best. For example, we can consider the 6 models below: Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

18 Variable Selection in Linear Models A prior on the regression coefficient that is often considered is the spike and slab prior, which is a mixture prior: β i γ δ 0 ( ) + (1 γ) N(0, b) for large b. This prior sets β i = 0 if γ = 1 and draws β i N(0, b) if γ = 0. The variable γ is a latent auxiliary variable such that γ Bern(π), with π (0, 1) The spike-and-slab prior achieves dimensional reduction: a variable is included in the model if P (γ = 1 data) > λ for some threshold lambda. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

19 With the point masses, a subset of predictors can be excluded with positive probability, which can be treated as direct shrinkage to zero. The continuous components in the prior also pull the coefficients included in the model towards their prior centers, which are usually zero, to achieve another layer of shrinkage. Indeed, instead of N(0, b), one can use the g-prior β g, τ γ δ 0 ( ) + (1 γ) N(0, g τ (X X) 1 ) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

20 Bayes Factors Suppose we have two models, M 0 and M 1, with sampling density: f(y θ 0, M 0 ) & f(y θ 1, M 1 ), The two models and the vectors θ 0 and θ 1 may not have anything in common, the two parameters need not even have the same dimension. We have prior distributions on the parameters under the two models: p M0 (θ 0 ) & p M1 (θ 1 ) The Bayes factor (B 01 ) is calculated as the ratio of the marginal distributions of the data p(y M 0 ) = f(y θ 0, M 0 ) p M0 (θ 0 )dθ 0 Θ 0 and p(y M 1 ) = f(y θ 1, M 1 ) p M1 (θ 1 )dθ 1 Θ 1 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

21 Bayes Factors By Bayes Theorem, the posterior probability of model M j, j = 0, 1 is P (M j y) = We can then consider the posterior odds: p(y M j ) p(m j ) f(y M 0 )p(m 0 ) + p(y M 1 ) p(m 1 ) P (M 0 y) P (M 1 y) = p(y M 0) p(m 0 ) p(y M 1 ) p(m 1 ) p(m 0 ) = BF 01 p(m 1 ) If p(m 0 ) = p(m 1 ) = 1 2, so P (M 0 y) P (M 1 y) = BF 01. Also, usually one looks at LBF = log(bf 01 ) or 2LBF. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

22 Bayes Factors - Strength of evidence The Bayes Factor can be used in general also to test two competing hypotheses, besides two competing models. Traditionally, strength of evidence for model M 0 (or hypotheses M 0 ) is decided based on the following table for BFs (Kass and Raftery, 1985): 2LBF Strength of evidence 0 to 2 not really worth considering 2 to 6 positive 6 to 10 strong > 10 very strong Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

23 Bayesian Model Choice Models for the variable selection problem are based on a subset of the X 1,... X p variables Encode models with a vector γ = (γ 1,... γ p ) where γ j {0, 1} is an indicator for whether variable X j should be included in the model M γ. γ j = 0 β j = 0 Each value of γ represents one of the 2 p models. Under model M γ : Y α, β, σ 2, γ N(1α + X γ β γ, σ 2 I) Where X γ is design matrix using the columns in X where γ j = 1 and β γ is the subset of β that are non-zero.

24 Posterior Probabilities of Models Posterior model probabilities p(m j Y) = p(y M j)p(m j ) j p(y M j)p(m j ) Marginal likelihod of a model is proportional to p(y M γ ) = p(y β γ, σ 2 )p(β γ γ, σ 2 )p(σ 2 γ)dβ dσ 2 Bayes Factor BF [i : j] P(M i Y) P(M j Y) = p(y M i) p(y M j ) P(M i) P(M j ) Posterior Odds = Bayes Factor Prior odds Probability β j 0: M j :β j 0 p(m j Y) (marginal posterior inclusion probability)

25 Zellner s g-prior within Models Centered model: Y = 1 n α + X c γβ γ + ɛ Common parameters p(α, φ) φ 1 Model Specific parameters β γ α, φ, γ N(0, gφ 1 (X c γ X c γ) 1 ) Marginal likelihood of M γ is proportional to p(y M γ ) = C(1 + g) n p 1 2 (1 + g(1 Rγ)) 2 (n 1) 2 where R 2 γ is the usual R 2 for model M γ and C is a constant that is p(y M 0 ) (model with intercept alone) uniform distribution over space of models p(m γ ) = 1/(2 p )

26 Computing the Bayes Factors for large p In many cases, the Bayes factors can be computed from a single posterior sample. it is very easy to compute both the numerator and the denominator of the Bayes Factor, by using post-mcmc compositional sampling (Monte Carlo) techniques based on the output of the MCMC chains. One of the neat features of Bayes factors is their transitivity. If I know that Model A outperforms Model B by 3, and I know that Model B outperforms Model C by 4, then I know that Model A outperforms Model C by 3 4 = 12. On the other hand, they are not defined with improper priors. One criticism of Bayes Factors is the (implicit) assumption that one of the competing models (M1 or M2) is correct. For complex models, the post-mcmc compositional sampling (Monte Carlo) may be very inefficient and computationally costly. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

28 Hierarchical Linear and Mixed Models - Outline Hierarchical regression models Generalized linear mixed models Examples Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

29 Hierarchical Regression Models (chapter 11) Hierarchical/multilevel models are used with nested designs (multiple levels or sampling) or with clustered/correlated observations within groups. Example: Data on education system (students within schools within districts). Hierarchical linear models extend hierarchical models to situations where (i) a regression model describes within-group variation and (ii) a multivariate normal distribution captures heterogeneity among regressions (naive regressions unreasonable). Recall example on math scores for 10th grade students from 100 schools: (i) We estimated school-specific expected math scores and (ii) assessed variation of the estimates across schools. With hierarchical linear models we can model the relationship between math scores and other variables (SES), assuming the relationship is linear and that it varies from school to school. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

30 Example: math score & SES Regress math scores on socioeconomic status. Center SES scores within each school (intercepts school-level averaged math scores) Figure: LS regression lines and plots of estimates versus group sample size. Individual regressions not optimal (want to borrow strength across schools, especially for small sample sizes) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

31 Random Effects Model (simple case) For Y = (y 1,..., y n ) falling into m groups, we can write a simple Random Effects model as Y β, Σ N(Xβ, σ 2 I) β θ, s 2 N(θ, s 2 I) s 2 0 implies all β i s are equal. s 2 implies all β i s are unrelated. Check the posterior is not sensitive to priors for s 2. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

32 Random Effects Model (general case) For Y = (y ij ), i = 1,..., n (observations) and j = 1,..., m (groups), we can write a general Random Effects model as y ij = β T j x ij + ɛ ij, ɛ ij iid normal (0, σ 2 ) β 1,..., β m N(θ, Σ) With Y j = (y 1j,..., y nj j) we have Y j N(X j β j, σ 2 I) Notice exchangeability assumptions Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

33 Mixed Effects Model Linear Mixed Effects Models have fixed effects (parameters for entire population) and random effects (parameters for smallers units sampled from the population). Reparameterize previous model as β j = θ + γ j, with γ 1,..., γ m N(0, Σ) then we have y ij = βj T x ij + ɛ ij = θ T x ij + γ T x ij + ɛ ij with θ the fixed effect and γ 1,..., γ m the random effects Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

34 The mixed effects model in the more general Laird-Ware form is: y ij = θ T x ij + γ T z ij + ɛ ij γ j N(0, Σ), ɛ ij N(0, σ 2 ) where x ij and z ij can be vectors of different length and with overlapping/non-overlapping variables. Typically x ij contains group-specific predictors (constant within groups) while z ij contains effects specific to subunit i that can be thought of as extra error terms inducing intra-cluster dependence. Note: Random and fixed is confusing to a Bayesian (all parameters are random). Refer to fixed effect coefficients as those which are constant for all subjects, and to random effect coefficients as those which are subject-specific. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

35 Examples Random intercepts, (Xβ + Zu) ij = β 0 + u i + β 1 x ij, cov(u) = σ 2 Random intercepts and slope, (Xβ + Zu) ij = β 0 + u i + (β 1 + v i )x ij, cov(u) = Σ 2 2 with i = 1,..., n and j = 1,..., n i Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

36 Prior Model Semi-conjugate priors (see multivariate normal model) lead to a straightforward Gibbs sampler θ N(µ 0, Λ 0 ) Σ IW (η 0, S 1 0 ) σ 2 IG(ν 0 /2, ν 0 σ 2 0/2) Full conditional of β 1,..., β m : Multivariate normal Full conditional of θ: Multivariate normal Full conditional of Σ: Inverse Wishart Full conditional of σ 2 : Inverse gamma Prior on θ usually flat. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

37 Example (continued) Regress math scores on socioeconomic status. Center SES scores within each school (intercepts school-level averaged math scores) Figure: Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

38 Generalized Linear Mixed Models GLMM combine GLM with LMM. For data with a hierarchical structure where the normal model is not an appropriate within-group model (for example, data as counts or binary). For m groups: β 1,..., β m N(θ, Σ) f(y j X j, β, φ) = f(y ij β T x ij, φ) with f(y j ) a density with mean that depends on β T x and where we assume exchangeability across groups. More generally: n j i=1 Y u exp(y (θ T X + γ T Z) 1 b(θ T X + γ T Z) + 1 c(y )) γ N(0, Γ) where b( ) varies with the model (e.g. b(x) = exp(x) for Poisson). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

39 Bayesian GLMM Assume diffuse or non-informative priors for the fixed effects β. Need MCMC due to the intractable form of the posteriors and the marginals: Full conditionals for (θ, Σ) Metropolis step for β j with normal proposal centered in previous value and with var-cov equal to a scaled version of sampled Σ (s) More readings: Hierarchical centering of certain parameters (Gelfand, Sahu and Carlin 1995) and data-augmentation methods for non-conjugate priors (van Dyk and Meng 2001). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

40 Examples of GLMM Poisson regression Example: Africa data Logistic regression model Example: Seeds data Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

41 Seeds data Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which occur at the community level. To gain insight into species responses, a sample of seeds were selected from a suite of eight species selected to represent the range of regeneration types which occur in this community. This representative community was then placed in experimental plots manipulated to mimic the natural variation in light conditions found in rain forests. Mammals were excluded from one half of each plot in order to assess their effects on the regeneration of rain forest trees. Six seeds of each type were planted and an indicator of whether they germinated and survived was recorded. Which variables are important in determining whether a seedling will survive? Are there interactions that influence survival probabilities? Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

42 Variables: SURV: Survival (No = 0, Yes = 1) of seedling. Indicator of whether there was a seedling present at the end of the observation period. GAP: 0,1 Indicator for understory versus clearing CAGE 0,1 (Absent/Present) Enclosure to prevent mammals from eating the seeds LITTER: (different levels = 0,1,2,4) SPECIES = (names on slides). Size= 1 smallest to 8 largest E = Epigeal - cotyledons, H=Hypogeal - food reserves in seed. Epigeal species rely on the cotyledons for photosynthesis and production of energy to become estabished. Seed size tends to be small, with little reserves in the seeds. Hypogeal species tend to have larger seeds, and can rely on reserves in the seed to produce energy, thus if initial leaves are lost to predators, there may still be additional reserves that can be used to produce additional leaves. Larger seeds, are easier to spot by predators. LIGHT measure of light levels at the forest floor Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

43 The dataset seeds.txt includes columns in the following order: PLOT (number 1 to 8) SUBPLT (within the plot) SPECIES (character string with names above) IND (seeding number within plot/subplot) SURV (indicator of survival) GERM (indicator for germination) ESTAB (intermediate measure of survival germination) LIGHT (measure of light at the forest floor for the plot - observational) LITTER (ordered categorical variable (manuplated litter levels) CAGE (indicator of enclosure) GAP (indicatory for clearing in forest - observation) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

44 Graduate level courses on Bayesian Statistics More on MCMC Formal derivations of posterior distributions Nonparametric regression Bayesian survival analysis Bayesian spatial analysis Multicomparison testing Bayesian time Series Graphical models Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 3) Fall / 40

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017 Binary Regression GH Chapter 5, ISL Chapter 4 January 31, 2017 Seedling Survival Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which