Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Size: px

Start display at page:

Download "Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation"

Maude McDaniel
5 years ago
Views:

1 Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation So far we have discussed types of spatial data, some basic modeling frameworks and exploratory techniques. We have not discussed inference from applying models to data in any great detail (though some elements of maximum likelihood inference for random effect models has been mentioned). In this chapter we will outline Bayesian methods for statistical inference as well as the computational algorithms that are required for applying the methods to complex problems. 1

2 Once we have done this, we will discuss hierarchical spatial models and develop inferential techniques for such models in the Bayesian framework (Chapter 5 - the core of the course). We will cover just the basic concepts of Bayesian inference and computation and get everyone up an running with Markov chain Monte Carlo using the WinBugs software. Good references on Bayesian statistics: 1. Applied perspectives: Gelman, Carlin, Stern and Rubin (2004), Carlin and Louis (2000). 2. More Theoretical: Robert (2001), Bernardo and Smith (1996).

3 Everyone is familiar with the maximum likelihood approach which is based on a model for data f(y θ), where θ is treated as a fixed unknown quantity of interest. Inference is then based on the notion of repeated sampling along with asymptotic arguments. In other words, we consider the distribution of the MLE ˆθ n induced by repeatedly sampling the data under identical conditions and in addition as n. The Bayesian approach to inference is based on adding to the data model f(y θ), a distribution for θ.

4 This distribution is meant to reflect external knowledge or prior opinion for example from previous related studies. An example would be a pilot study in a clinical trial. These are usually conducted before the formal trial to obtain initial estimates of variability which are then used in sample size calculations. This distribution for θ is known as a prior distribution and will have density π(θ λ) where λ is a vector of parameters, known as hyperparameters, indexing a family of prior distributions.

5 In many situations λ is known: 1. determined from prior information 2. deliberately set to a particular value that makes the prior vague For example in regression models one often sets the prior for a regression coefficient β λ N(0, λ) and then sets λ to a very large value say λ = 10 6 to provide an essentially non-informative prior.

6 In such cases when λ is known, inference concerning θ is based on the posterior distribution π(θ y, λ) which, from Bayes theorem takes the form π(θ y, λ) = p(y, θ λ) p(y, λ) = f(y θ)π(θ λ) f(y θ)π(θ λ)dθ A synthesis of the information contained in the likelihood and the prior. Note 1: There is no notion of repeated sampling of Y here. Inference is based strictly on the data y we have observed. Note 2: Asymptotics do not have a strong role to play. We consider the posterior distribution as determined by the sample size of the observed data. In this sense, inference is exact.

7 Note 3: Basing inference on a probability distribution for the unknowns π(θ y, λ) leads to straightforward inference on θ, where objects such as p-values and confidence intervals are replaced by quantities based on probability statements more intuitive. If the hyperparameter λ is not known a priori, we can assign it a distribution, a hyperprior π(λ), in which case the posterior π(θ y) is obtained as π(θ y) = f(y θ)π(θ λ)π(λ)dλ f(y θ)π(θ λ)π(λ)dθdλ where we simply integrate λ out of the posterior.

8 Note 1: It is typical to assume Y and λ are conditionally independent given θ f(y θ, λ) = f(y θ) Note 2: In general, Bayesian inference deals with nuisance parameters in a very straightforward way we simply integrate them out of the posterior. In most complicated models, some components of the hyperparameter λ will be of interest as well. For example when modeling random effects, the hyperparameter λ will be the associated variance component.

9 In this case we base inference for both θ and λ on their joint posterior π(θ, λ y) = f(y θ)π(θ λ)π(λ) f(y θ)π(θ λ)π(λ)dθdλ Another alternative to dealing with unknown hyperparameters is to estimate them from the data. That is, we use the data to obtain ˆλ, stuff this into the prior and base inference on π(θ y, ˆλ) empirical Bayes The estimate of λ is obtained as the maximizer of the likelihood function with θ integrated out ˆλ = arg min λ p(y λ) = the usual MLE of λ p(y θ)π(θ λ)dθ Any problems with this approach?

10 Replacing λ with ˆλ and proceeding as if it were known does not acknowledge the uncertainty associated with the estimation of ˆλ. One of the difficulties that arises in applying Bayesian methods is the evaluation of integrals in π(θ y, λ) = f(y θ)π(θ λ) f(y θ)π(θ λ)dθ In particular, the normalizing constant (also known as the marginal likelihood function) p(y λ) = f(y θ)π(θ λ)dθ is generally not available in closed form. This computational difficulty severely limited the application of Bayesian methods up until the early 1990 s.

11 Since then, an increase in computing power along with the development of Markov chain Monte Carlo algorithms has led to a rapid increase in the use of Bayesian methods especially in spatial statistics

12 Example Assume we observe a single data point Y with Y θ N(θ, σ 2 ) with σ 2 known. In this case the likelihood function is f(y θ) = 1 (y θ)2 exp( 2πσ 2σ 2 ) For the unknown θ, suppose we adopt a normal prior θ µ, τ 2 N(µ, τ 2 ) with µ, τ 2 known. case is The posterior in this π(θ y) = f(y θ)π(θ µ, τ 2 )/p(y µ, τ 2 ) f(y θ)π(θ µ, τ 2 ) exp( (y θ)2 2σ 2 ) exp( (θ µ)2 2τ 2 )

13 = exp( 1 θ)2 [(y 2 σ 2 (θ µ)2 τ 2 ]) = exp( 1 2 [τ 2 (y θ) 2 + σ 2 (θ µ) 2 τ 2 σ 2 ]) exp( 1 2 [(τ 2 + σ 2 )θ 2 2(yτ 2 + µσ 2 )θ τ 2 σ 2 ]) (τ 2 +σ 2 ) θ τ 2 σ 2 (τ 2 +σ 2 ) exp( 1 2 [ θ 2 2 (yτ 2 +µσ2 ) Complete the Square... exp( 1 2 (θ (yτ 2 +µσ 2 ) (τ 2 +σ 2 ) )2 τ 2 σ 2 (τ 2 +σ 2 ) This is the kernel of a normal distribution ]) θ y N( yτ 2 + µσ 2 τ 2 + σ 2, τ 2 σ 2 τ 2 + σ 2) )

14 Note that the posterior mean E[θ y] = τ 2 τ 2 + σ 2y + σ2 τ 2 + σ 2µ is a weighted average of the data y and the prior mean µ. The weights depend on the relative variability of the likelihood and the prior. Also note that the posterior precision (inverse variance) 1 V [θ y] = 1 σ τ 2 is a sum of the precision of the data and the precision in the prior.

15 If we observe a sample of y s y 1,..., y n then using the exact same argument and the fact that ȳ is sufficient for θ we can show that the posterior for θ is once again normal with and E[θ y] = nτ 2 nτ 2 + σ 2ȳ + V [θ y] = σ2 τ 2 σ 2 + nτ 2 what happens as n gets large? σ2 nτ 2 + σ 2µ In general for large datasets the likelihood will swamp the prior in this way inference will be driven primarily by the observed data

16 Note 1: In the above example a conjugate prior was used for θ. A conjugate prior is a prior distribution which, when combined with a given likelihood, leads to a posterior distribution in the same family as the prior. Note 2: τ 2 = in the above example yields a noninformative prior in which case π(θ y) N(ȳ, σ2 n ) prior has no effect on the posterior in this case. Note 3: τ 2 = corresponds to a flat prior π(θ) 1 which is improper.

17 Setting τ 2 to some very large value say 10 6 would not be strictly non-informative but would yield a vague but proper prior N(µ, 10 6 ) for θ. Such vague but proper priors are often employed to achieve both approximate noninformativeness and the computational convenience of a conjugate prior. These sorts of priors have been criticized for various reasons. For example, with sparse data such a prior may not be as non-informative as one hopes. In general it is good practice to conduct a sensitivity analysis and obtain results corresponding to various forms for the prior.

18 Example (linear model): Suppose we observe data Y i, x i, i = 1,..., n and we assume a linear regression model for Y of the form Y β MV N n (Xβ, Σ) with Σ assumed known and adopt the prior β MV N p (β, V) with Σ, β and V known. Here, interest lines in the regression coefficient β. Which under the specified (conjugate) prior has a posterior β y MV N p (Dd, D) where and D 1 = X ΣX + V 1 d = X Σy + V 1 β

19 With this posterior, one sensible estimate would be the posterior mean E[β y] = Dd with variation assessed using D. To obtain a non-informative prior for β we can set the prior precision to the zero matrix V 1 = 0 so that D 1 = X ΣX, d = X Σy If in addition we make the usual linear regression assumption Σ = σ 2 I then we obtain the posterior β y MV N p ((X X) 1 Xy, σ 2 (X X) 1 ) The posterior mean in this case is the usual least squares estimator! In addition the posterior covariance Cov[β y] which we use to asses variability is exactly the covariance of β in the frequentist setting β MV N p (β, σ 2 (X X) 1 )

20 We see that adopting a non-informative prior in the Bayesian setting, in the linear model, leads to results that are formally equivalent to the usual frequentist results. The frequentist approach, in this case, is just a special case of the Bayesian approach corresponding to a particular prior.

21 Bayesian Inference from the Posterior Distribution As mentioned previously, in most real problems the posterior distribution π(y θ) will not be available in closed form. If we do have the posterior distribution or some estimate of it, inference regarding unknown parameters as well as prediction become fairly straightforward. For now, we will assume we have access to the posterior and discuss how one conducts estimation, tests hypotheses and goes about prediction using it.

22 For ease of presentation I will also assume dim(θ) = 1; however, everything carries forward to the multidimensional setting in a straightforward manner. Once we have discussed the posterior quantities one uses for inference and prediction, we will discuss estimation of these quantities using Monte Carlo methods. Estimation: If we have π(θ y) we can estimate θ using some measure of centrality derived from this distribution. 1. Posterior mean: ˆθ = E[θ y] 2. Posterior median: ˆθ = ˆθ π(θ y)dθ = Posterior mode: π(ˆθ y) > π(θ y), θ Θ

23 Note that if π(θ y) is symmetric and unimodal all three are equivalent. When this is not the case we can choose one or report all three as summaries of the posterior. Note that the posterior mode can be particularly bad in some cases - for example if the posterior distribution looks like an exponential distribution. The posterior mean is often used; however, can be sensitive to heavy tails. The posterior median is the most robust of the three with respect to different distributional forms.

24 Note that in many cases, all three of these estimators will have good frequentist properties (small bias etc...) Formally, starting with a loss function L(θ, ˆθ) the Bayes estimator ˆθ is the estimator that minimizes the posterior expected value of the loss function. For squared error loss L(θ, ˆθ) = (θ ˆθ) 2 the Bayes estimator is the posterior mean. For absolute error loss L(θ, ˆθ) = θ ˆθ the Bayes estimator is the posterior median.

25 Note that if one adopts a flat improper prior π(θ) 1 and the posterior is proper then the posterior mode is exactly the maximum likelihood estimator. In the hierarchical spatial models we will consider in Chapter 5, the posterior median and mean will be used for estimation.

26 Interval estimation: We can use the posterior distribution to obtain interval estimates with a direct probability interpretation. Suppose q L and q U are two points in the parameter space such that ql π(θ y) = α/2 and then we have q U π(θ y) = α/2 P (q L < θ < q U y) = 1 α The interval (q L, q U ) is called a 100 (1 α) equal tail credible interval for θ.

27 More generally, any subset of the parameters space C that satisfies 1 α = P (C y) = p(θ y)dθ C is called a 100 (1 α) credible interval for θ. Note that for a given continuous posterior distribution there are an infinite number of such sets C. Typically we want the one that encompasses the required 1 α probability but also has the shortest length. Such a shortest length interval is achieved by restricting elements of the set C to have posterior density greater than some cutoff value k(α) - highest posterior density sets.

28 The cutoff is chosen to be as large as possible while still maintaining the required coverage probability. When the posterior is approximately symmetric and unimodal, the equal tail interval is typically used. Prediction: Prediction of a new observation Y 0 given data y is obtained via the posterior predictive distribution P (y 0 y) = f(y 0 y, θ)π(θ y)dθ We can predict Y 0 using the mean of this distribution and easily form prediction intervals in the same way we formed credible intervals above.

29 Notice that this is much nicer than the frequentist method of estimating θ and plugging the estimate into the conditional distribution. As with other posterior quantities, we will use Monte Carlo methods to form predictions and prediction intervals. Hypothesis Testing: If we restrict our attention to a particular family of models f(y θ) indexed by θ, a hypothesis is simply a statement about θ. For example we may have H 0 : β ( 1, 1) for a regression coefficient or H 0 : µ 1 < µ 2 for a pair of normal means (µ 1, µ 2 ).

30 Given a posterior distribution π(θ y) we can evaluate the hypothesis of interest by evaluating its posterior probability P (H 0 y) = π(θ y)dθ H 0 with small values giving evidence in favor of the alternative. This posterior probability is easily conveyed and interpreted by subject matter specialists and is far more meaningful than a p- value. There are issues; however, with the prior distribution. The posterior probability will necessarily be a function of the prior probability π(h 0 ). It may not me clear how to choose this; however, a default choice is the fair prior π(h 0 ) = 0.5.

31 In addition, testing point null hypotheses such as H 0 : β = 0 requires special care since such a hypothesis will have posterior probability zero under a continuous prior. An easier way of assessing such point-null hypotheses is to form the 100 (1 α) credible interval for β and to simply check whether the null value is contained in the interval - this is what is usually done. Model selection: Choosing between competing models for data is a very important statistical task. In the likelihood setting, one usually uses a likelihood ratio or a score test to compare models. What are the limitations of these?

32 Models must be nested. That is, one of the competing models must be a submodel of the other. For example, a submodel obtained by setting one of the parameters in the model to a particular value. In addition, the submodel can not correspond to a value on the boundary of the parameters space. If these conditions don t hold, the usual asymptotics for these test statistics breaks down.

33 In general, we would like to compare nonnested models 1. Compare Proportional hazards models to accelerated failure time models 2. Compare different link functions for generalized linear models 3. Compare different distributions for random effects in mixed models Also, values of the parameter that lie on the boundary of the parameter space can sometimes correspond to very interesting submodels that we would like to test for. Unfortunately, even in the Bayesian framework there is no universally accepted technique for doing model selection.

34 The traditional Bayesian method for comparing models is through the use of posterior models probabilities and Bayes factors. Suppose we have two competing (not necessarily nested) models M 1 and M 2 each with parameters θ 1 and θ 2 respectively with associated prior densities π i (θ i ), i = 1, 2. To complete the prior specification we assign to each model a prior probability π(m i ), i = 1, 2 Again, a default choice is usually π(m i ) = 0.5.

35 To choose between models we can calculate the posterior models probabilities obtained from Bayes theorem as P (M i y) = p(y M i )π(m i ) p(y M 1 )π(m 1 ) + p(y M 2 )π(m 2 ) In order to calculate these we need the marginal distributions for each model p(y M i ) which are obtained as p(y M i ) = f(y θ i, M i )π i (θ i )dθ i, i = 1, 2 where f(y θ i, M i ) is the likelihood function under model M i. The posterior probabilities are summarized using the Bayes factor.

36 The Bayes factor, BF, is an odds ratio: the ratio of the posterior odds of M 1 to the prior odds of M 1 BF = P (M 1 y)/p (M 2 y) P (M 1 )/P (M 2 ) which using Bayes theorem is just BF = p(y M 1) p(y M 2 ) a ratio of the marginal likelihoods under each model. If the BF > 1 then the posterior odds of M 1 is greater than the prior odds of M 1. That is, having observed the data, the odds of M 1 have increased the data favor M 1.

37 This is a very elegant way of comparing two models unfortunately there are some difficulties that hamper the use of the BF for model comparison. First, the calculation of the marginal likelihoods p(y M i ) is difficult. This is the normalizing constant in the posterior distribution of θ i in model M i. A shortcut method for calculating the Bayes factor is the BIC, which for a given model M i is defined as BIC = 2l Mi ( ˆθ i ) + log(n)p a fit plus penalty model selection tool where l Mi (θ i ) is the loglikelihood for model M i and ˆθ i is the MLE of θ.

38 Models with lower BIC are preferred. In addition, the difference in the BIC scores between two models is asymptotically equal to minus twice the log of the Bayes factor comparing the two models BIC = 2 log BF assuming both models are a priori equally likely. So the Bayes factor comparing two models can be approximated by calculating the corresponds BIC s. Unfortunately, this approximation is not valid with random effect models.

39 Another serious problem with Bayes factors, besides their computation, is that they are not well defined when non-informative priors are employed even if the corresponding posterior is proper. This is due to the fact that the marginal likelihood p(y M i ) = f(y θ i, M i )π i (θ i )dθ i is not well defined when π i (θ i ) is improper. Similar to the BIC, another fit plus penalty model selection tool that we have already seen is the AIC AIC = 2l Mi ( ˆθ i ) + 2p which replaces log(n) in the penalty by 2 and is therefore more liberal whenever n > e 2.

40 Unfortunately the AIC just like the BIC is not appropriate for comparing models with random effects. Why are these tools inappropriate for such models? The raw number of parameters, p, that appears in the penalty of both AIC and BIC is not an appropriate way to measure complexity in random effects models where the parameters are correlated and hence the effective number of parameters will generally be less than p. To address this concern, Spiegelhalter et al. (2002) proposed a generalization of the AIC known as the deviance information criterion (DIC) which is more appropriate for doing model selection with Bayesian hierarchical models.

41 This generalization is based on the deviance statistic, which for a given model takes the form D(θ) = 2 log f(y θ) + 2 log h(y) where f(y θ) is the likelihood function under the model and h(y) is some standardizing function of the data alone. Typically, we set h(y) = 1 for convenience. To asses the fit of the model we consider the posterior mean of the deviance D = E[D(θ) y] which is smaller for better fitting models. To penalize this measure of fit we penalize a given model by measuring its complexity by the effective number of parameters p D defined by p D = E[D(θ) y] D( θ)

42 Is p D always positive? Why or why not? This penalty counts the number of parameters while also accounting for correlation or shrinkage in parameters. For example, in a model with spatially correlated random effects, we may have n random effects contributing to the parameters count. Rather than simply adding n to the total parameter count, the p D penalty accounts for the fact that the random effects are spatially correlated, so the effective number added to the penalty due to the random effects will be less than n.

43 Note that unlike the penalty term used in the AIC and BIC, this penalty will also depend on the observed data. If the data encourages a great deal of shrinkage in the random effects, the p D value and hence the penalty will be lower. Adding p D to the posterior mean of the deviance yields the DIC DIC = D + p D = 2E[D(θ) y] D( θ) with smaller values being preferred. For non-hierarchical models (linear regression models, glms, parametric survival models) the value of p D will be very close to the actual parameter count and the DIC will essentially be equal to the AIC.

44 The DIC is a very popular model selection tool primarily since it is easy to calculate (using output from the Gibbs sampler) and is automatically computed by the WinBugs MCMC software. There is no theoretical justification for its use outside the exponential family class of models; nevertheless, it has been used for model selection in far more complicated situations and tends to work fairly well. The lack of a theoretical justification implies that the DIC, in many situations, can only be considered a quick and dirty model selection tool used to rank a set of candidate models. My feeling is that the DIC tends to be too liberal in model selection, generally choosing more complicated models than perhaps necessary. I use it a lot anyways...

45 Another problem with the DIC is that it is not invariant to re-parametrization. Change the parametrization of your model and the DIC for the model will change! Posterior predictive loss criteria: Another Bayesian model selection tool developed by Gelfand and Gosh (1998). The posterior predictive loss criteria focusses on model evaluation based on the notion of prediction. In particular, we consider a hypothetical replicate data set Y rep drawn under the same conditions (i.e. same positions, same covariate values etc..) as the observed data and we assume that Y rep is conditionally independent of Y given θ.

46 The criteria has a decision theoretic foundation where the action of interest is prediction of Y rep using the posterior predictive distribution of Y rep f(y rep y) = f(y rep θ)π(θ y)dθ One chooses a loss function that penalizes predictions in some way and the selected models are those that perform well under the given loss function. Under squared error loss, the criteria penalizes actions (prediction of Y rep ) when the expected value of E[Y repi y], under a given model, departs from the observed data y (predictive bias) and also penalizes actions when the predictive variability V [Y repi y i ] for a given observation is high.

47 In other words, under squared error loss, the criteria chooses models that have small mean squared error of prediction with respect to the replicate data Y rep. The criteria can be written as the sum of two terms D = G + P Where and G = n i=1 P = (E[Y repi y i ] y obs,i ) 2 n i=1 V [Y repi y i ]

48 Models with that have a small combination of predictive bias and variance (with respect to replicate data) will yield small D values - these are the preferred models under this criteria. For a given model, we can estimate E[Y repi y i ] and V [Y repi y i ] using the Monte Carlo methods we will discuss down the road. This criteria is not as popular as the DIC criteria. This is likely due to the fact that it is not automatically calculated in winbugs (but it is easily calculated!) and requires the choice of a loss function.

49 Since it is based on the posterior predictive distribution and the notion of prediction, using this criteria for model selection would be quite natural if prediction was an eventual goal. In general it seems to have a stronger theoretical foundation than DIC. The Basics of Bayesian Computation Although the posterior π(θ y) provides all of the information concerning θ, simply writing it down is not enough π(θ, λ y) = f(y θ)π(θ λ)π(λ) f(y θ)π(θ λ)π(λ)dθdλ

50 The normalizing constant in this distribution will typically be analytically intractable. In addition, we will usually be interested in the marginal distributions π(θ i y) of certain components of θ. π(θ i y) = f(y θ)π(θ λ)π(λ)dλdθ( i) f(y θ)π(θ λ)π(λ)dθdλ where θ ( i) denotes the parameter vector θ with the i th component θ i removed. For example, the marginal distribution π(θ i y) is required to construct a 100(1 α) credible interval for θ i. The problem we have with obtaining the posterior and any of its marginals is essentially one of integration.

51 To implement bayesian methods, we need ways of approximating high dimensional integrals. There are several ways to approach this problem: 1. Asymptotic approximations: Laplace approximation, Bayesian CLT 2. Quadrature: methods like Simpson s rule etc Importance sampling: a method for generating independent realizations from the posterior - hard to implement in high dimensional problems (many parameters) 4. Markov chain Monte Carlo (MCMC): method for generating dependent realizations from the posterior - easy to implement in many high dimensional problems

52 We will only discuss MCMC since it is essentially the only one of these methods applicable to spatial statistics Note that Laplace approximations have been used in implementing frequentist techniques for spatial statistics penalized quasi likelihood but we will not discuss these.

53 Posterior inference through simulation: Even though we can not access the posterior distribution in its analytic form, if it were possible to simulate independent realizations θ (1), θ (2),..., θ (J) from the posterior distribution θ (j) iid π(θ y), j = 1,..., J We could use these simulated realizations, called Monte Carlo samples, to summarize the posterior distribution and estimate posterior quantities of interest and hence conduct inference based on these simulations. To see this, it is helpful to consider the duality between a probability density function and a histogram of a set of random draws from this distribution.

54 Given a large enough sample θ (1),..., θ (J), the histogram based on θ (1),..., θ (J) can provide essentially complete information about the density. Consider approximating the density of a Weibull(ρ, λ) distribution with shape ρ = 2 and scale λ = 1. The density can be approximated by the histogram obtained from simulated draws obtained from this distribution.

55 10 Random Draws 100 Random Draws Density Density sim sim Random Draws Random Draws Density Density sim sim4

56 Using a sample simulated from the posterior distribution θ (1),..., θ (J), we can estimate the various moments of the posterior, percentiles, and obtain other summary statistics which can estimate any aspect of the posterior distribution, to a level of precision which can be estimated.

57 For example: 1. Ê[θ i y] = 1 J Jj=1 θ (j) i 2. ˆP (θ i D y) = 1 J Jj=1 I{θ (j) i D} 3. ˆP (θ i > θ k y) = 1 J Jj=1 I{θ (j) i > θ (j) k } th percentile of π(θ i y) use the 0.95Jth order statistic of θ (1) i,..., θ (J) i 5. The DIC for a model with deviance D(θ) = 2f(y θ) is estimated as ˆ DIC = 2 1 J J j=1 D(θ (j) ) D( 1 J j=1 θ (j) )

58 In addition, the marginal posterior density π(θ i y) can be estimated using the histogram of simulated values θ (1) i, θ (2) i,..., θ (J) i or a smooth estimate can be obtained using a kernel density estimate based on the simulated values. The level of precision depends only on the Monte Carlo sample size and can be increased to an arbitrarily high level simply by increasing the size of the Monte Carlo sample. We usually drop the hat notation for these simulation based estimators under the assumption that a very large Monte Carlo sample size has been used.

59 Note that provided the draws θ (1),..., θ (J) are iid from π(θ y) the weak law of large numbers guarantees that simulation based estimators such as and Ê[θ i y] = 1 J J j=1 θ (j) i ˆP (θ i D y) = 1 J J j=1 are all simulation consistent. I{θ (j) i D} In other words, they converge in probability to the actual posterior quantities as the Monte Carlo sample size J. Unfortunately, generating independent draws from a possibly high dimensional and nonstandard distribution π(θ y) can be very difficult.

60 This is where Markov chain Monte Carlo comes in... initially developed by Metropolis et al. (1953) for applications in statistical physics, extended by Hastings (1970) and applied to image restoration problems by Geman and Geman (1984) where the Gibbs sampler, a special case of Metropolis/Hastings, was developed. These algorithms remained virtually unnoticed by statisticians until Gelfand and Smith (1990), in a landmark paper, observed that these algorithms could be exploited in Bayesian statistics.

61 Idea behind MCMC: Simulate realizations from a Markov chain θ (1) θ (2) θ (3) θ (4) θ (k) that converges to a unique stationary or limiting distribution that is the posterior distribution of interest. Once we are able to simulate realizations from a Markov process whose limiting distribution is π(θ y) we run the simulation long enough, for t 0 iterations say, θ (1) θ (2) θ (3) θ (4) θ (t 0) so that future draws θ (t 0+1), θ (t 0+2),..., θ (t 0+J) constitute a sample of size J from the true posterior distribution.

62 For a given posterior π(θ y), there are a variety of ways to construct such a chain; moreover, to do so, we only need to know the posterior distribution up to a normalizing constant. That is, the simulation of such a Markov chain requires only the ability to evaluate f(y θ)π(θ) without knowledge of the normalizing constant f(y θ)π(θ)dθ. Given the samples from the Markov chain (after it has reached its stationary distribution) θ (t 0+1), θ (t 0+2),..., θ (t 0+J) we can construct simulation consistent estimates of posterior quantities as before.

63 Note 1: The sampled values {θ (t 0+1), θ (t 0+2),..., θ (t 0+J) } are not independent - they are draws from a Markov chain. Note that a dependent sample contains less information about the posterior than a sample of independent values. The precision of our Monte Carlo estimates will therefore depend on (I) the Monte Carlo sample size J as in the case of independent samples and (II) the level of dependence or autocorrelation in the sampled values. If the level of autocorrelation in the sampled values is high, then we will need a larger value of J to produce a reliable posterior summary.

64 Note 2: It is critical for us to be able to decide when the Markov chain has converged to its stationary distribution - π(θ y). That is, we need some way of determining the value of the integer t 0 such that for every k > t 0 the sampled values θ (k) are draws from the stationary distribution of the Markov chain. The initial set of sampled values {θ (0), θ (1),..., θ (t 0) } is a transient phase of the Markov chain simulation and is known as the burn-in period.

65 We do not use these initial sampled values for posterior inference as we can not be sure that these values come from the posterior distribution - they are discarded. Examining the sequence of simulated values θ (0), θ (1),... and determining when the chain converged its stationary distribution (in other words determining the number of initial values to discard as burn-in) is known as convergence assessment.

66 Given the above notes, we will answer the following questions in order: 1. How exactly do I construct and simulate from the Markov chain that has π(θ y) as its stationary distribution? Gibbs sampling, Metropolis and Metropolis-Hastings algorithms. 2. How do I asses convergence of the chain to the posterior distribution? 3. Since the posterior samples are correlated, how do I asses the quality of my MCMC-based posterior estimates? In other words, how can I asses the Monte Carlo variance of the posterior summary I have produced?

67 Gibbs Sampling Simplest and most widely used of all MCMC algorithms. Suppose our posterior distribution [θ y] is k dimensional where k could be (and usually is in spatial models) large θ = (θ 1,..., θ k ) For any component, θ i, we define the full conditional distribution as [θ i θ 1,..., θ i 1, θ i+1,..., θ k, y] = [θ i θ i, y] The distribution of θ i conditional on y and all other components of θ.

68 The Markov chain constructed through Gibbs sampling generates realizations from the posterior [θ y] by iteratively generating realizations from the full conditional distributions. Initialization: We begin the procedure by choosing an initial or starting value for θ: θ (0) = (θ (0) 1, θ(0) 2,..., θ(0) k ) The starting value θ (0) can be any point in the parameter space over which the posterior distribution is defined.

69 Simulation: T iterations of the Gibbs sampler produces realizations θ (0) θ (1) θ (T ) by repeating the following process for i = 1,..., T 1. generate θ (i) 1 [θ 1 θ (i 1) 2, θ (i 1) 3,, θ (i 1) k, y] 2. generate θ (i) 2 [θ 2 θ (i) 1, θ(i 1) 3,, θ (i 1) k, y] 3. generate θ (i) 3 [θ 3 θ (i) 1, θ(i) 2, θ(i 1) 4,, θ (i 1) k, y].. k. generate θ (i) k [θ k θ (i) 1, θ(i) 2,, θ(i) k 1, y]

70 At iteration i of the simulation, the vector θ (i) is drawn in an iterative fashion by simulating each of its components θ (i) j, j = 1,..., k from the corresponding full conditional distribution: [θ (i) j θ (i) 1,, θ(i) j 1, θ(i 1) j+1,, θ(i 1) k, y] This procedure will produce a Markovian sequence with [θ y] as its stationary distribution. Convergence to the stationary distribution is guaranteed for a very wide class of models. Note 1: Not obvious why it works.

71 Note 2: The Gibbs sampler reduces the problem of simulating from a high dimensional distribution [θ y] to that of iteratively simulating from a sequence of lower (typically 1-dimensional) distributions. Each component θ j of θ need not be 1- dimensional; however, this is often the case. If θ j is not 1-dimensional, it is typically of low dimension. Simulating from low-dimensional full conditional distributions is typically easy. Since they are of low dimension, we can easily find the associated normalizing constant of each; however, knowledge of this normalizing constant is often not required.

72 Techniques used to simulate from full conditional distributions: 1. If the full conditional is a standard distribution (Gamma, normal etc...) we can simulate from it directly. 2. Rejection sampling - requires finding an envelope distribution 3. Adaptive rejection sampling (ARS) - requires log-concavity of the density associated with the full conditional distribution 4. Slice sampling, Metropolis-Hastings which we will discuss later.

73 Note 3: If X, Y, Z are random vectors then in order to simulate realizations from [Z y] Z (1),..., Z (J) it is sufficient to simulate realizations from [Z, X y] (Z (1), X (1) ),..., (Z (J), X (J) ) and then simply consider the Z components of this realization. Once we have a sample θ (t0+1),..., θ (t 0+J) from the posterior [θ y], a sample from the marginal posterior of any component of θ say [θ j y] is obtained simply by considering the corresponding components of the simulated values θ (t 0+1) j,..., θ (t 0+J) j.

74 Note 4: The Gibbs sampler is guaranteed to converge for any starting value θ (0). Regardless of the initial value, the chain will eventually forget its initial state and converge to the stationary distribution. Choice of initial value is important... if the initial value is located in a region of high posterior mass, the chain will converge quickly; however, an initial value with low posterior mass may require a longer burn-in period. In practice one usually runs several chains initialized from different starting values and compares the resulting sequences to asses convergence.

75 Upon convergence, regardless of their starting values, all chains will all be sampling from the same distribution. Note 5: Deriving full conditional distributions is straightforward. Suppose we want [θ i θ i, y] which is, of course, characterized by the corresponding density (or pmf) f(θ i θ i, y). f(θ i θ i, y) = f(θ i, θ i, y) f(θ i, y) = f(θ, y) f(θ i, y) In the context of the full conditional distribution both θ i and y are known so that f(θ i θ i, y) f(θ, y) = f(y θ)π(θ)

76 Full conditional density (pmf) is just proportional to likelihood prior. So we write down f(y θ)π(θ) and view it as a function of θ i to obtain the full conditional density (pmf) up to a normalizing constant. If we recognize this as the kernel of some standard distribution, we can simulate from it easily. Otherwise we can use one of the aforementioned techniques (rejection sampling, ARS, slice sampling or MH) which only require knowledge of the full conditional density up to a normalizing constant.

77 Note 6: Situations where Gibbs sampling does not work well: 1. Multimodal Posterior distributions: the Markov chain can spend to much time in one of the modes before tunneling into the other modes. After an initial burn-in sample has been discarded, Gibbs sampling in this setting will need very long simulation runs in order for the posterior space to be adequately explored. 2. Weakly identified models in conjunction with vague prior distributions and sparse data will usually produce weird results.

78 Note 7: Missing data is handled very nicely within a Gibbs sampling framework. Suppose our data y can be partitioned into two components y = (y obs, y mis ), one we have actually observed and the other we have not.

79 In this case y is not the observed data; rather, it is a hypothetical set of complete data and there are two reasons for considering this: 1. The likelihood f(y θ) corresponding to y = (y obs, y mis ) is much simpler in form than the likelihood corresponding to the data we have observed f(y obs θ) = y mis f(y θ)dy mis and so we consider the complete data in an attempt to work with the simpler likelihood function - data augmentation 2. We may be interested in conducting inference on the missing data itself.

80 In either case we use the Gibbs sampler to simulate from the joint posterior [θ, y mis y obs ] which then yields samples from the marginal posterior [θ y obs ] which we use for conducting inference on θ and also yields samples from [y mis y obs ] which we can use for conducting inference on the missing data (if this is of interest). Why is this easy to do within a Gibbs sampler? Consider the density of the joint posterior π(θ, y mis y obs ) π(θ, y mis, y obs ) = π(y mis, y obs θ)π(θ)

81 To simulate from this distribution using the Gibbs sampler we work iteratively with the conditional distributions and θ [θ y mis, y obs ] [θ y] y mis [y mis y obs, θ] This allows us to work with the complete data likelihood when drawing from [θ y mis, y obs ] [θ y] and y mis is essentially treated as an additional parameter being drawn from its full conditional distribution at each iteration.

82 Note 8: Prediction within a Gibbs sampling framework is straightforward. Recall we base prediction of Y new on the posterior predictive distribution whose density (pmf) can be written as f(y new y) = f(y new θ, y)π(θ y)dθ which, under the assumption that Y new is conditionally independent of Y given θ is given by f(y new y) = f(y new θ)π(θ y)dθ

83 If we have samples θ (1), θ (2),, θ (J) from the posterior distribution (obtained using the Gibbs sampler!) we can generate onefor-one samples 1. generate y (1) new f(y new θ (1) ) 2. generate y (2) new f(y new θ (2) ) 3. generate y (3) new f(y new θ (3) ).. J. generate y (J) new f(y new θ (J) ) In which case y (1) new,, y new (J) are a sample from the predictive density f(y new y). We can use these predictive samples to form predictions and prediction intervals etc...

84 The Metropolis-Hastings Algorithm Implementing the Gibbs sampler is, in principle, fairly straightforward we only need to sample from full conditional distributions [θ i θ i, y], i = 1,..., k. When for at least one i, the full conditional [θ i θ i, y] is not of standard form, we need some way to sample from this distribution in order to implement the Gibbs sampler. Note that we will always have the density of the full conditional distribution up to a normalizing constant as this is simply proportional to the portion of f(y θ) P (θ) that involves θ i.

85 The Metropolis-Hastings algorithm will be useful in these situations. This is an MCMC algorithm that allows one to generate (dependent) samples from an arbitrary distribution and requires only knowledge of the corresponding density up to a normalizing constant. First I will describe the algorithm in its most general form for simulating realizations of some random vector X from some multivariate distribution with density f(x) h(x) - we call this the target distribution. Then describe how it is applied very usefully for dealing with non-standard full conditional distributions within a Gibbs sampler.

86 Note that the target distribution f(x) h(x) can represent virtually any distribution one is interested in: a posterior distribution, a full conditional distribution or some other distribution that may be of interest in non-bayesian applications. As before the algorithm will generate a Markov chain X (1) X 2) X (t) having f(x) h(x) as its stationary or limiting distribution. The algorithm is a rejection algorithm: where we generate a proposed value X from some candidate distribution and either accept or reject this proposed value with a certain probability.

87 To implement this algorithm we need to specify a candidate distribution from which to draw proposals. At iteration t, the density of the candidate distribution, can depend on the previous value of the chain X (t 1) in some way and we let q(x x (t 1) ) denote this density. Given a candidate (or proposal) distribution q(x x (t 1) ) and an initial value X (0) the MH algorithm proceeds for t = 1, 2, 3,... as follows

88 1. Draw X from q( x (t 1) ) 2. Compute the acceptance ratio r as: r = f(x )q(x (t 1) x ) f(x t 1 )q(x x (t 1) ) 3. Accept the candidate X as the new state of the chain X (t) with probability p = min{1, r} That is, if the acceptance ratio r 1 we set X (t) = X Otherwise we set X (t) = X X (t 1) with probability r with probability 1 r

89 Note that the normalizing constant associated with the density f( ) cancels in the acceptance ratio so that we have r = h(x )q(x (t 1) x ) h(x t 1 )q(x x (t 1) ) and we only require h( ) to implement the algorithm. Note that unlike the Gibbs sampler, the Metropolis-Hastings algorithm does not necessarily change its state in every iteration. If the proposal at iteration t is rejected, then X (t 1) and X (t) are identical. Implementation of the MH algorithm requires the choice of a candidate distribution q(x x (t 1) ) from which to draw proposals.

90 Amazingly, the algorithm will converge to the stationary distribution for virtually any choice of proposal distribution - in theory. In practice the choice of proposal distribution has an important effect on the performance of the algorithm. Types of proposals useful in practice 1. Symmetric: q(x x (t 1) ) = q(x (t 1) x ) - Metropolis algorithm 2. Proposal independent of current state q(x x (t 1) ) = q(x ) Hastings independence chain

91 Of these, the most common option is to use what is known as a random walk Metropolis proposal. Assuming the sample space of the target distribution [X] is R k, the proposal at iteration t is formed as where X = X (t 1) + e e MV N(0, Σ) where Σ is pre-specified so that the sampler performs well in terms of its overall acceptance rate.

92 Note that this leads to a Metropolis algorithm as the associated proposal distribution is X MV N(X (t 1), Σ) which has a symmetric density. If the sample space of the target [X] is not R k, the above random walk proposal can be employed after applying a suitable transformation (homework). The performance of the MH algorithm is often measured by its empirical acceptance rate. That is, we run the algorithm for some fixed number of iterations and determine the proportion of proposed moves that were accepted.

93 What acceptance rate would we like the algorithm to have? An acceptance rate of 1 is definitely not optimal. A high acceptance rate, say above 0.9, may be indicative of an overly narrow candidate distribution. In this case the algorithm will only propose minor departures from the current state. These minor departures typically have high acceptance probability; however, the chain can take a very long time to explore the entire posterior as it will move around the parameter space at a snail s pace (a diagram would be helpful here) high autocorrelation in the chain.

94 On the other hand, if our candidate distribution proposes very bold moves, the proposed values may lie in the tails of the target distribution and will almost certainly be rejected. In such cases, the chain can get stuck in the same state for a very long time X (j) = X (j+1) = X (j+100) = which is obviously undesirable. The ideal candidate distribution will yield an empirical acceptance rate that lies somewhere in the middle of these two extremes - between 20 and 50 percent seems to work well in practice (rough guidelines).

95 Such an ideal candidate distribution is often chosen adaptively - we tune the algorithm. For example, let s assume we want to sample from an arbitrary 1-dimensional distribution f(x) h(x) that has support on the entire real line. In this case, a random walk Metropolis algorithm will propose, at the t th iteration X = X (t 1) + e with e N(0, σ 2 ) and the corresponding acceptance ratio will be r = f(x ) f(x (t 1) )

96 In this case, choosing an appropriate candidate distribution boils down to choosing an appropriate value for σ 2 - we tune the algorithm to determine this value. Start by picking some initial value of σ 2, and then keep track of the empirical proportion of candidates that are accepted in some small say 100 number of iterations. If this fraction is too high (say above 75 percent) we increase σ 2 (why?) and if it is too low (say below 20 percent) we decrease σ 2 and continue until a reasonable acceptance rate is obtained.

97 The MH algorithm is typically run in two phases: 1. A first phase where the algorithm is tuned in order to select a candidate distribution that leads to reasonable acceptance rates. 2. A second phase where the candidate distribution is held fixed and the required samples are drawn.

98 Back to the Gibbs Sampler Now that we know about the Metropolis- Hastings algorithm we can discuss its utilization within a Gibbs sampler. If we encounter an awkward, non-standard full conditional [θ i θ i, y] in implementing the Gibbs sampler, we could in principle sample from this full conditional distribution using the Metropolis-Hastings algorithm. That is, at the i th iteration of the Gibbs sampler we simulate from [θ i θ i, y] by running another Markov chain (via the MH algorithm) that has the required full conditional [θ i θ i, y] as its stationary distribution. Such an approach is known as Metropolis within Gibbs. Problems with this?

99 Fortunately we do not have to consider such overly bulky MCMC within MCMC type algorithms! It turns out that a single MH substep is sufficient to ensure convergence of the overall chain to its stationary distribution (the posterior). That is, when we encounter an awkward full conditional distribution [θ i θ i, y], we simply draw one Metropolis-Hastings candidate, calculate the acceptance ratio, either accept or reject it, and move on to updating the next component θ i+1.

100 Given this, an overall strategy for drawing samples from a posterior distribution π(θ y) is as follows: 1. Divide the parameter vector θ into k components θ = (θ 1, θ 2,..., θ k ) Each θ i need not be a scalar; however, this is often the case. 2. Starting with an initial value θ (0) we generate each subsequent state of the chain θ (t) by visiting and updating the corresponding components using the full conditional distributions [θ i θ i, y] 3. If the full conditional is some standard distribution we draw θ i from this distribution - a Gibbs step.

101 4. If the full conditional is non-standard but has support on R (or R dim(θ i) ) we update θ i by taking a random walk Metropolis step. 5. If the full conditional is non-standard and has support on some D R (or D R dim(θ i) ) we take a random walk Metropolis step after transforming to R (or R dim(θ i) ) - Metropolis-Hastings step (homework)

102 6. We first run this algorithm in an adaptive phase, adjusting the tuning parameters associated with each MH step so that the corresponding acceptance rate is not too high and not too low. 7. We run a second phase with these values held fixed, asses convergence of the sampler to its limiting distribution (burnin), and draw posterior samples from the rest. This simple MCMC algorithm, a hybrid of Gibbs and MH steps will work very well for a broad class of models.

103 Assessing Convergence of an MCMC Sampler As mentioned before, upon running an MCMC algorithm we need to decide when the chain has reached its stationary distribution. That is, we need to determine the value of some t 0 such that θ (t 0+1) θ (t 0+2) θ (t 0+J) constitute a dependent sample of size J from the posterior. The idea is to run the chain for some initial number of iterations and examine the output in some way to determine if the initial state has been forgotten.

104 The simplest way to examine the output is through a trace plot. That is, for each θ i we plot θ (t) i versus t and examine the evolution of the chain. MCMC Trace Plot for Theta theta t

105 Examining a single trace plot can be dangerous. Consider the following trace plot: MCMC Trace Plot for Theta theta t Chain looks fairly stable, as if it has reached its stationary distribution fairly quickly.

106 If we run another chain with a different starting value and compare the output to the original chain we see: Trace Plot of Theta from Two Chains theta t

107 We see that the chain had not converged, it was simply moving very slowly across the parameter space. Running two chains with each initialized at different points allowed us to detect this. In general, it is always best to run multiple chains in parallel and compare the output of all chains. The chains should be initialized at points that are overdispersed with respect to the posterior. The trace plots from all chains are then examined to determine if there is an identifiable point t 0 after which all chains seems to be overlapping.

Principles of Bayesian Inference

Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters