Molecular Epidemiology Workshop: Bayesian Data Analysis

Size: px

Start display at page:

Download "Molecular Epidemiology Workshop: Bayesian Data Analysis"

Rosalind Dawson
6 years ago
Views:

1 Molecular Epidemiology Workshop: Bayesian Data Analysis Jay Taylor and Ananias Escalante School of Mathematical and Statistical Sciences Center for Evolutionary Medicine and Informatics Arizona State University Jay Taylor (ASU) Bayesian Analysis August / 46

2 Outline 1 Probability and Uncertainty 2 Bayesian Analysis 3 MCMC Jay Taylor (ASU) Bayesian Analysis August / 46

3 Probability and Uncertainty Probability: Interpretations and Basic Principles Frequentist Interpretation The probability of an event is equal to its limiting frequency in an infinite series of independent identical trials. Bayesian Interpretations Logical: The probability of a proposition is equal to the strength of evidence in favor of the proposition. Subjective: The probability of a proposition is equal to the strength of an individual s belief in the proposition. Jay Taylor (ASU) Bayesian Analysis August / 46

4 Probability and Uncertainty Although probabilities can be interpreted in different ways, these interpretations are usually based on the same mathematical rules. To describe these, we will use P(E) to denote the probability of an event or proposition E. Probability Axioms 1 If S is certain to be true, then P(S) = P(E) 1 for any proposition E. 3 If E and F are mutually exclusive propositions, then the probability that either E or F is true is equal to the sum of the probabilities of E and of F : P(E or F ) = P(E) + P(F ). Jay Taylor (ASU) Bayesian Analysis August / 46

5 Probability and Uncertainty The probability assigned to a proposition depends on the information or evidence available to us. This can be made explicit through conditional probability. Conditional Probability Suppose that E and F are propositions and that P(E) > 0. If we know E to be true, then the conditional probability of F given E is equal to P(F E) = P(E and F ) P(E) where P(E and F ) is the probability that both E and F are true. P(S) = 1 S P(E) = 0.46, P(F ) = 0.33 P(E and F ) = 0.21 P(F E) = 0.21/ E E and F F Jay Taylor (ASU) Bayesian Analysis August / 46

6 Probability and Uncertainty Joint probabilities can often be calculated by conditioning on one of the propositions. Product Rule P(E and F ) = P(E) P(F E). Example: Suppose that two balls are sampled without replacement from an urn containing five red balls and five blue balls. If we let E be the event that the first ball sampled is red and F be the event that second ball sampled is red, then the probability that both balls sampled are red is P(E, F ) = P(E)P(F E) = = 2 9. Jay Taylor (ASU) Bayesian Analysis August / 46

7 Probability and Uncertainty Because we can condition on either E or F, the joint probability of E and F can be decomposed in two different ways using the product rule: ( P(E) P(F E) (conditioning on E) P(E and F ) = P(F ) P(E F ) (conditioning on F ). It follows that the two expressions on the right-hand side are equal, i.e., P(E) P(F E) = P(F ) P(E F ), and if we then divide both sides by P(E), we arrive at one of the most important formulas in probability theory: Bayes Formula P(F E) = P(F ) P(E F ). P(E) Jay Taylor (ASU) Bayesian Analysis August / 46

Probability and Uncertainty Example: Reversed Sexual Size Dimorphism in Spotted Owls

(in mm) is approximately N (329, 6) in females and N (320, 6) in males (Blakesley et

8 Probability and Uncertainty Example: Reversed Sexual Size Dimorphism in Spotted Owls Like many raptors, adult female Spotted Owls (Strix occidentalis) are larger, on average, than their male counterparts. For example, a study of a California population found that the wing chord distribution (in mm) is approximately N (329, 6) in females and N (320, 6) in males (Blakesley et al., 1990, J. Field Ornithology). Wing Chord in Spotted Owls male female density wing chord Jay Taylor (ASU) Bayesian Analysis August / 46

9 Probability and Uncertainty Problem: Suppose that an adult bird with a wing chord of 329 mm is randomly sampled from a population with a 1 : 1 adult sex ratio. What is the probability that this is a female? Solution: Let F (M) be the event that the bird is female (male) and let W be the event that the wing chord is 329 mm. Then P(F ) = 0.5 p(w F ) = /72 2π e ( ) p(w ) = P(F )p(w F ) + P(M)p(W M) = /72 1 2π e ( ) /72 2π e ( ) and upon substituting these quantities into Bayes formula we find that P(F W ) = P(F ) p(w F ) p(w ) Jay Taylor (ASU) Bayesian Analysis August / 46

10 Bayesian Analysis Bayesian Data Analysis: Overview Bayesian data analysis is based on the following two principles: 1 Probability is interpreted as a measure of uncertainty, whatever the source. Thus, in a Bayesian analysis, it is standard practice to assign probability distributions not only to unseen data, but also to parameters, models, and hypotheses. 2 Uncertainty is quantified both before and after the collection of data and Bayes formula is used to update our beliefs in light of the new data. Jay Taylor (ASU) Bayesian Analysis August / 46

11 Bayesian Analysis Suppose that our objective is to use some newly acquired data D to estimate the value of an unknown parameter Θ. A Bayesian treatment of this problem would proceed as follows: 1 We first need to formulate a statistical model which determines the conditional distribution of the data under each possible value of Θ. This is specified by the likelihood function, p(d Θ = θ). 2 We then choose a prior distribution, p(θ = θ), for the unknown parameter which quantifies how strongly we believe that the true value is θ before we examine the new data. 3 We then collect and examine the new data D. 4 In light of this new data, we use Bayes formula to revise our beliefs concerning the value of the parameter Θ: «p(d Θ = θ) p(θ = θ D) = p(θ = θ) {z } {z } p(d) posterior prior p(θ = θ D) is said to be the posterior distribution of the parameter Θ given the data D. Jay Taylor (ASU) Bayesian Analysis August / 46

12 Bayesian Analysis Example: Bayesian Estimation of Sex Ratios Suppose that our objective is to estimate the birth sex ratio of a newly described species. To this end, we will count the numbers of male and female offspring in each of b broods. For the sake of the example, we will make the following assumptions: The probability of being born female does not vary within or between broods. In particular, there is no environmental or heritable variation in sex ratio. The sexes of the different members of a brood are determined independently of one another. If we let θ denote the unknown probability that an individual is born female, then the sex ratio at birth will be θ/1 θ. We will let m and f denote the total numbers of male and female offspring contained in the b broods. Jay Taylor (ASU) Bayesian Analysis August / 46

13 Bayesian Analysis Likelihood Function: Under our assumptions about sex determination, the likelihood function depends only on the sex ratio and the total numbers of males and females in the broods. Conditional on Θ = θ, the total number of female offspring among the n offspring is binomially distributed with parameters f + m and θ: Likelihood function P(f, m Θ = θ) =! f + m θ f (1 θ) m f p(3,7 θ Likelihood Function: f =7, m=3 θ MLE =0.7 MLE θ MLE = f f + m θ Jay Taylor (ASU) Bayesian Analysis August / 46

14 Bayesian Analysis Prior Distribution: The prior distribution on the unknown parameter θ should reflect what we have previously learned about the birth sex ratio in this species. For example, this might be determined by prior observations of this species or by information about the sex ratio of closely-related species. Here I will consider prior distributions for three different scenarios: Uniform: no prior information p(θ = θ) = Prior Distributions Beta(1,1) Beta(5,5) Beta(2,4) Beta(5,5): even sex ratio 1.5 p(θ = θ) = 630 θ 4 (1 θ) 4 prior 1 Beta(2,4): male-biased 0.5 p(θ = θ) = 20 θ(1 θ) θ Jay Taylor (ASU) Bayesian Analysis August / 46

15 Bayesian Analysis We are now ready to use Bayes formula to calculate the posterior distribution of θ conditional on having observed f female offspring out of a total of n offspring. When the prior distribution is of type Beta(a, b), we have p(θ = θ f, m) = p(θ = θ) = = «P(f, m θ) P(f, m) 1 β(a, b) θa 1 (1 θ) b 1 (Bayes formula) θf! (1 θ) m `f +m f P(f, m) 1 β(f + a, m + b) θf +a 1 (1 θ) m+b 1, which shows that the posterior distribution is of type Beta(f + a, m + b). Remark Because the prior and the posterior distribution belong to the same family of distributions, we say that the beta distribution is a conjugate prior for the binomial likelihood function. Jay Taylor (ASU) Bayesian Analysis August / 46

16 Bayesian Analysis The figures show the posterior distributions corresponding to each of these three priors for two different data sets: either f = 7, m = 3 (left plot) or f = 70, m = 30 (right plot) Posterior Distribution: f =7, m=3 Beta(1,1) Beta(5,5) Beta(2,4) Posterior Distribution: f =70, m=30 Beta(1,1) Beta(5,5) Beta(2,4) posterior density posterior density Jay Taylor (ASU) Bayesian Analysis August / 46

17 Bayesian Analysis Summaries of the Posterior Distribution Although the posterior distribution p(θ = θ D) comprehensively describes the state of knowledge concerning an unknown parameter, it is common to summarize this distribution by a number or an interval. Point estimators of Θ include the mean and the median of the posterior distribution. If it exists, the mode of the posterior distribution can be used to estimate Θ, in which case it is called the maximum a posteriori (MAP) estimate. A credible region is a region that contains a specified proportion of the probability mass of the posterior distribution, e.g., a 95% credible region will contain 95% of this mass. These can be chosen in several ways, including: quantile-based intervals; highest probability density (HPD) regions. Jay Taylor (ASU) Bayesian Analysis August / 46

Bayesian Analysis Posterior Summary Statistics Credible Intervals for N(0,1)

692 95% HPD/quantiles (-1.96, 1.96) Mean = 1.0 95% HPD (0.0, 2.

96, 1.96) 95% HPD = ( 1.96, 1.96) MAP = 0, median = 0.692, mean = 1.

18 Bayesian Analysis Posterior Summary Statistics Credible Intervals for N(0,1) Credible Intervals for Exp(1) MAP = 0.0 MAP = mean = median Median = % HPD/quantiles (-1.96, 1.96) Mean = % HPD (0.0, 2.996) 95% quantiles (0.025, 3.69) MAP = mean = median = 0 95% median = ( 1.96, 1.96) 95% HPD = ( 1.96, 1.96) MAP = 0, median = 0.692, mean = % quantiles = (0.025, 3.689) 95% HPD = (0, 2.996) Jay Taylor (ASU) Bayesian Analysis August / 46

19 Bayesian Analysis The table lists various summary statistics for the posterior distributions calculated in the sex ratio example. data prior mean median sd 2.5% 97.5% 7, 3 uniform even biased , 30 uniform even biased Interpretation Whereas the small data set does not provide strong evidence against the hypothesis that the sex ratio is 1 : 1, all three analyses of the large data set suggest that the true value of θ is greater than Jay Taylor (ASU) Bayesian Analysis August / 46

20 Bayesian Analysis Acquiring New Data One of the strengths of Bayesian statistics is that it can readily handle sequentially acquired data. This is done by using the posterior distribution from the most recent experiment as the prior distribution for the next experiment. p 0(θ) {z } prior p 1(θ) {z } prior D 1 D 2 p1(θ) = p 0(θ) p(d1 θ) p(d 1) {z } posterior p2(θ) = p 1(θ) p(d2 θ) p(d 2) {z } posterior p 2(θ) {z } prior D 3 Jay Taylor (ASU) Bayesian Analysis August / 46

21 Bayesian Analysis Example: Sex Ratio Estimation, continued Suppose that our prior distribution on Θ was uniform and that we initially collected three broods totaling 7 females and 3 males. We then used Bayes formula to determine that the posterior distribution of Θ is Beta(8, 4). Since this distribution is broad, we decide to collect additional data to refine our estimate of Θ. If the second data set contains 15 females and 5 males, then using Beta(8, 4) as the new prior distribution, we find that the new posterior distribution of Θ is p(θ = θ f = 15, m = 5) = θ7 (1 θ) 3 β(8, 4) = θ22 (1 θ) 8 β(23, 9) `20 θ15! (1 θ) 5 15 P(f = 15, m = 5) Jay Taylor (ASU) Bayesian Analysis August / 46

22 Bayesian Analysis Choosing the Prior Distribution: Some Guidelines Choosing a prior distribution is both useful and sometimes difficult because it requires a careful assessment of our knowledge or beliefs before we perform an experiment. As a rule, the prior should be chosen independently of the new data. Cromwell s rule: The prior distribution should assign positive probability to any proposition that is not logically false, no matter how unlikely. It is sometimes useful to carry out multiple Bayesian analyses using different prior distributions to explore the sensitivity of the posterior distribution to different prior assumptions. When there is very little prior information, it may be appropriate to choose an uninformative prior. Examples include maximum entropy distributions and Jeffreys priors. Jay Taylor (ASU) Bayesian Analysis August / 46

23 Bayesian Analysis Bayesian Phylogenetics Bayesian methods have proven to be especially useful in the analysis of genetic sequence data. In these problems, the unknown parameters can often be divided into three categories: the unknown tree, T the parameters of the demographic model, Θ dem the parameters of the substitution model, Θ subst. Then, given a sequence alignment D, the analytical problem is to calculate the posterior distribution of all of the unknowns: Bayes formula for phylogenetic inference, general case «p(d T, Θdem, Θ subst ) p(t, Θ dem, Θ subst D) = p(t, Θ dem, Θ subst ) p(d) Jay Taylor (ASU) Bayesian Analysis August / 46

24 Bayesian Analysis If the substitution process is assumed to be neutral, then the prior distribution and the likelihood functions can be simplified: Bayes formula for phylogenetic inference with neutral data «p(d T, Θsubst ) p(t, Θ subst, Θ dem D) = p(θ subst ) p(θ dem ) p(t Θ dem ) p(d) This is a consequence of the following assumptions: Under the prior distribution, the parameters of the substitution process Θ subst are independent of the tree T and the demographic parameters Θ dem. The demographic parameters Θ dem typically determine the conditional distribution of the genealogy T, e.g., through a coalescent model. Conditional on T and Θ subst, the sequence data D is independent of Θ dem. Jay Taylor (ASU) Bayesian Analysis August / 46

25 Bayesian Analysis Example: Bayesian Inference of Effective Population Size Suppose that our data D consists of n randomly sampled individuals that have been sequenced at a neutral locus and that our objective is to estimate the effective population size N e. For simplicity, we will use the Jukes-Cantor model for the substitution process with a strict molecular clock and we will assume that the demography can be described by the constant population size coalescent. To carry out a Bayesian analysis, we need to specify p(µ), p(n e) and p(t N e). p(µ) should be chosen to reflect what we know about the mutation rate, e.g., we could use a lognormal distribution with mean u and variance σ 2. Since N e is a scale parameter in the coalescent, it is common practice to use the Jeffreys prior p(n e) 1/N e. p(t N e) is then determined by Kingman s coalescent. Jay Taylor (ASU) Bayesian Analysis August / 46

26 Bayesian Analysis Assuming that we are only interested in N e, then µ and T are nuisance parameters and so we need to calculate the marginal posterior distribution of N e by integrating over µ and T : Z Z p(n e D) = p(t, N e, µ D)dµdT Z Z = p(ne) p(d) p(d T, µ)p(µ)p(t N E )dµdt. However, unless n is quite small, integration over T is not feasible. For example, if our sample contains 20 sequences, then there are approximately possible trees to be considered. Even with the fastest computers available, this is an impossible calculation. Jay Taylor (ASU) Bayesian Analysis August / 46

27 MCMC Markov Chain Monte Carlo Methods (MCMC) Until recently, Bayesian methods were regarded as impractical for many problems because of the computational difficulty of evaluating the posterior distribution. In particular, to use Bayes formula, p(θ D) = p(θ) «p(d θ), p(d) we need to evaluate the marginal probability of the data Z p(d) = p(d θ)p(θ)dθ, which requires integration of the likelihood function over the parameter space. Except in special cases, this integration must be performed numerically, but sometimes even this is very difficult. Jay Taylor (ASU) Bayesian Analysis August / 46

28 MCMC An alternative is to use Monte Carlo methods to sample from the posterior distribution. Here the idea is to generate a random sample from the distribution and then use the empirical distribution of that sample to approximate p(θ D): p(θ D) {z } posterior 1 NX δ Θi where Θ 1,, Θ N p(θ D) N {z } i=1 {z } sample empirical distribution For example, the figure shows two histogram estimators for the Beta(8, 4) density generated using either 100 (left) or 1000 (right) independent samples: sample: N =100 Beta(8,4) 3.5 sample: N =1000 Beta(8,4) 3 3 posterior density posterior density θ θ Jay Taylor (ASU) Bayesian Analysis August / 46

29 MCMC In particular, the empirical distribution can be used to estimate probabilities and expectations under p(θ D): P(θ A D) 1 N E[f (θ) D] 1 N NX 1 A (Θ i ) i=1 NX f (Θ i ). i=1 What makes this approach difficult is the need to generate random samples from distributions that are only known up to a constant of proportionality (e.g., p(d)). This is where Markov chain Monte Carlo methods come in. Jay Taylor (ASU) Bayesian Analysis August / 46

30 MCMC Markov Chains A discrete-time Markov chain is a stochastic process X 0, X 1, with the property that the future behavior of the process only depends on its current state. More precisely, this means that for every set A and t, s 0, P(X t+s A X t, X {z } {z} t 1,, X 0 ) = P(X t+s A X t). {z } future present past frequency Simulations of the Wright Fisher Process N =500, µ =0.001 present past future Markov chains have numerous applications in biology. Some familiar examples include: random walks branching processes the Wright-Fisher model chain-binomial models (Reed-Frost) time Jay Taylor (ASU) Bayesian Analysis August / 46

31 MCMC One consequence of the Markov property is that many Markov chains have a tendency to forget their initial state as time progresses. More precisely, Asymptotic behavior of Markov chains Many Markov chains have the property that there is a probability distribution π, called the stationary distribution of the chain, such that for large t the distribution of X t approaches π, i.e., for all states x and sets A, lim P(Xt A X0 = x) = π(a). t 0.16 p = 0.01 t = t = t =100 Stationary behavior of the Wright-Fisher process: (N = 100, µ = 0.02) density p = 0.5 p = 0.9 density density p p p Jay Taylor (ASU) Bayesian Analysis August / 46

32 MCMC p Some Markov chains satisfy an even stronger property, called ergodicity. Ergodicity A Markov chain with stationary distribution π is said to be ergodic if for every initial state X 0 = x and every set A, we have 1 lim T T TX 1 A (X t) = π(a). t=1 In other words, if we run the chain for a long time, then the proportion of time spent visiting the set A is approximately π(a). 1 Neutral Wright Fisher model 0.9 Ergodic behavior of the Wright-Fisher process: (N = 100, µ = 0.02) Generation x 10 4 Jay Taylor (ASU) Bayesian Analysis August / 46

33 MCMC Markov Chain Monte Carlo: General Approach The central idea in Markov Chain Monte Carlo is to use an ergodic Markov chain to generate a random sample from the target distribution π. 1 The first step is to select an ergodic Markov chain which has π as its stationary distribution. There are different methods for doing this, including the Metropolis-Hastings algorithm and the Gibbs Sampler. 2 We then need to simulate the Markov chain until the distribution is close to the target distribution. This initial period is often called the burn-in period. 3 We continue simulating the chain, but because successive values are highly correlated, it is common practice to only collect a sample every T generations, so as to reduce the correlations between the sampled states (thinning). 4 We can then use these samples to approximate the target distribution: 1 N NX δ XB+nT π. n=1 Jay Taylor (ASU) Bayesian Analysis August / 46

34 MCMC The Metropolis-Hastings Algorithm Given a probability distribution π, the Metropolis-Hastings algorithm can be used to explicitly construct a Markov chain that has π as its stationary distribution. Implementation requires the following elements: the target distribution, π, known up to a constant of proportionality; a family of proposal distributions, Q(y x), and a way to efficiently sample from these distributions. The MH algorithm is based on a more general idea known as rejection sampling. Instead of sampling directly from π, we propose values using a distribution Q(y x) that we can easily sample but then we reject values that are unlikely under π. Jay Taylor (ASU) Bayesian Analysis August / 46

35 MCMC The Metropolis-Hastings algorithm consists of repeated application of the following three steps. Suppose that X n = x is the current state of the Markov chain. Then the next state is chosen as follows: Step 1: We first propose a new value for the chain by sampling y with probability Q(y x). Step 2: We then calculate the acceptance probability of the new state: j ff π(y)q(x y) α(x; y) = min π(x)q(y x), 1 Step 3: With probability α(x; y), set X n+1 = y. Otherwise, set X n+1 = x. Remark Because π enters into α as a ratio π(y)/π(x), we only need to know π up to a constant of proportionality. This is why the MH algorithm is so well suited for Bayesian analysis. Jay Taylor (ASU) Bayesian Analysis August / 46

36 MCMC The Proposal Distribution The choice of the proposal distribution Q can have a profound impact on the performance of the MH algorithm. While there is no universal procedure for selecting a good proposal distribution, the following considerations are important. Q should be chosen so that the chain rapidly converges to its stationary distribution. Q should also be chosen so that it is easy to sample from. There is usually a tradeoff between these two conditions and sometimes it is necessary to try out different proposal distributions to identify one with good properties. Many implementations of MH (e.g., BEAST, MIGRATE) offer the user some control over the proposal distribution. Jay Taylor (ASU) Bayesian Analysis August / 46

37 MCMC MCMC: Convergence and Mixing One of the most challenging issues in MCMC is knowing for how long to run the chain. There are two related considerations. 1 We need to run the chain until its distribution is sufficiently close to the target distribution (convergence). 2 We then need to collect a large enough number of samples that we can estimate any quantities of interest (e.g., the mean TMRCA) sufficiently accurately (mixing). Unfortunately, there is no universally-valid, fool-proof way to guarantee that either one of these conditions is satisfied. However, there are a number of convergence diagnostics that can indicate when there are problems. Jay Taylor (ASU) Bayesian Analysis August / 46

38 MCMC Convergence Diagnostics: Trace Plots Trace plots show how the value of a parameter changes over the course of a simulation. In general, what we want to see is that the mean and the variance of the parameter are fairly constant over the duration of the trace plot, as in the two examples shown below P_vivax_gsr_geo1.log P_vivax_gsr_geo1.log E7 3E E7 3E7 State State Jay Taylor (ASU) Bayesian Analysis August / 46

39 MCMC Problems with convergence or mixing may be revealed by trends or sudden changes in the behavior of the trace plot P_vivax_gsr_geo1.log P_vivax_gsr_geo1.log E7 3E7 State E7 3E7 State The increasing trend indicates that the chain has not yet converged. The sudden changes in mean indicate that the chain is poorly mixing. Jay Taylor (ASU) Bayesian Analysis August / 46

40 MCMC Trace Plots: Some Guidelines 1 You should examine the trace plot of every parameter of interest, including the likelihood and the posterior probability. If any of the trace plots look problematic, then all of the results are suspect. 2 The fact that a trace plot appears to have converged is not conclusive proof that it has. Especially in high-dimensional problems, a chain that appears to be stationary for the first 500 million generations may well show a sudden change in behavior in the next. 3 The program Tracer ( can be used to display and analyze trace plots generated by BEAST, MrBayes and LAMARC. Jay Taylor (ASU) Bayesian Analysis August / 46

41 MCMC Convergence Diagnostics: Effective Sample Size Because successive states visited by a Markov chain are correlated, an estimate derived using N values generated by such a chain will usually be less precise than an estimate derived using N independent samples. This motivates the following definition. Effective Sample Size (ESS) The effective sample size of a sample of N correlated random variables is equal to the number of independent samples that would estimate the mean with the same variance. For a stationary Markov chain with autocorrelation coefficients ρ k, the ESS of N successive samples is equal to ESS = N P k=1 ρ. k Jay Taylor (ASU) Bayesian Analysis August / 46

42 MCMC Effective Sample Size: Guidelines 1 Each parameter has its own ESS and these can differ between parameters by more than order of magnitude. 2 Parameters with small ESS s indicate that a chain either has not converged or is slowly mixing. As a rule of thumb, the ESS of every parameter should exceed 1000 and larger values are even better. 3 Thinning by itself will not increase the ESS. However, we can increase the ESS by simultaneously thinning and increasing the duration of the chain, e.g., collecting a 1000 samples from a chain lasting generations is better than collecting a 1000 samples from a chain lasting generations. 4 The ESS of a parameter can usually only be estimated from its trace. For this reason, large ESS s do not guarantee that the chain has converged. Jay Taylor (ASU) Bayesian Analysis August / 46

43 MCMC Formal Tests of Convergence There are several formal tests of stationarity that can be applied to MCMC. The Geweke diagnostic compares the mean of a parameter estimated from two non-overlapping parts of the chain and tests whether these are significantly different. The Raftery-Lewis diagnostic uses a pilot chain to estimate the burn-in and chain length required to estimate the q th quantile of a parameter to within some tolerance. Both of these methods suffer from the defect that you ve only seen where you ve been (Robert & Casella, 2004). In other words, these methods cannot detect that the chain has failed to visit part of the support of the target distribution. These and other diagnostic tests are implemented in the R package coda. Jay Taylor (ASU) Bayesian Analysis August / 46

44 MCMC Convergence Diagnostics: Multiple Chains Another approach to testing the convergence of a MCMC analysis is to run multiple independent chains and compare the results. Large differences between the posterior distributions estimated by the different chains indicate problems with convergence or mixing. It is often useful to start the different chains from different, randomly-chosen initial conditions. If the different chains give similar results, then their traces can be combined using a program such as LogCombiner ( Best Practice It is always a good idea to run at least two independent chains in any MCMC analysis. Jay Taylor (ASU) Bayesian Analysis August / 46

45 MCMC Metropolis-Coupled Markov Chain Monte Carlo (MC 3 ) MC 3 is a generalization of MCMC which attempts to improve mixing by running m chains in parallel, while randomly exchanging their states. The first chain (the cold chain) is constructed so that the target distribution π is its stationary distribution. The i th chain is constructed so that its stationary distribution is proportional to π i (x) π(x) 1/T i where T i is said to be the temperature of the chain. As T i increases, the distribution π i (x) becomes flatter, which makes it easier for this chain to converge. The MH algorithm is used to swap the states occupied by different chains in such a way that the stationary distribution of each chain is maintained. The output from the cold chain is then used to approximate the target distribution. Jay Taylor (ASU) Bayesian Analysis August / 46

46 References I MCMC M. A. Beaumont and B. Rannala, The Bayesian revolution in genetics, Nat. Rev. Genetics 5 (2004), A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, second ed., Chapman & Hall/CRC, P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, Cambridge University Press, P. Lemey, M. Salemi, and A.-M. Vandamme (eds.), The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing, second ed., Cambridge University Press, S. P. Otto and T. Day, A Biologist s Guide to Mathematical Modeling in Ecology and Evolution, Princeton University Press, C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer-Verlag, Jay Taylor (ASU) Bayesian Analysis August / 46

APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

APM 541: Stochastic Modelling in Biology Bayesian Inference Jay Taylor Fall 2013 Jay Taylor (ASU) APM 541 Fall 2013 1 / 53 Outline Outline 1 Introduction 2 Conjugate Distributions 3 Noninformative priors