Molecular Epidemiology Workshop: Bayesian Data Analysis

Similar documents
APM 541: Stochastic Modelling in Biology Bayesian Inference. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 53

Bayesian Phylogenetics:

STAT 425: Introduction to Bayesian Analysis

Markov Chain Monte Carlo methods

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Bayesian Methods for Machine Learning

Computational statistics

ST 740: Markov Chain Monte Carlo

A Bayesian Approach to Phylogenetics

eqr094: Hierarchical MCMC for Bayesian System Reliability

Markov Chain Monte Carlo methods

How robust are the predictions of the W-F Model?

APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65

Bayesian Inference and MCMC

Principles of Bayesian Inference

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Bayesian Inference. Chapter 1. Introduction and basic concepts

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

MCMC: Markov Chain Monte Carlo

MCMC algorithms for fitting Bayesian models

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Bayesian Networks in Educational Assessment

Probability, Entropy, and Inference / More About Inference

Answers and expectations

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Non-Parametric Bayesian Inference for Controlled Branching Processes Through MCMC Methods

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Inference in Astronomy & Astrophysics A Short Course

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bayesian inference: what it means and why we care

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Monte Carlo in Bayesian Statistics

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Markov chain Monte Carlo

Approximate Bayesian Computation: a simulation based approach to inference

Bayesian GLMs and Metropolis-Hastings Algorithm

MARKOV CHAIN MONTE CARLO

MCMC Methods: Gibbs and Metropolis

Markov chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Bayesian Inference. p(y)

STA 4273H: Statistical Machine Learning

Principles of Bayesian Inference

Lecture 8: The Metropolis-Hastings Algorithm

16 : Approximate Inference: Markov Chain Monte Carlo

CPSC 540: Machine Learning

David Giles Bayesian Econometrics

Markov chain Monte Carlo

Bayesian Inference. Chapter 2: Conjugate models

Markov Chain Monte Carlo

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Lecture 1 Bayesian inference

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

An introduction to Bayesian reasoning in particle physics

Bayesian Regression Linear and Logistic Regression

17 : Markov Chain Monte Carlo

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Data Analysis and Uncertainty Part 2: Estimation

Monte Carlo-based statistical methods (MASM11/FMS091)

Machine Learning using Bayesian Approaches

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Taming the Beast Workshop

Sampling Algorithms for Probabilistic Graphical models

Part IV: Monte Carlo and nonparametric Bayes

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Applied Bayesian Statistics STAT 388/488

Reminder of some Markov Chain properties:

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Principles of Bayesian Inference

Stat 516, Homework 1

Bayesian analysis of the Hardy-Weinberg equilibrium model

Metropolis-Hastings Algorithm

Hierarchical Models & Bayesian Model Selection

INTRODUCTION TO BAYESIAN STATISTICS

Control Variates for Markov Chain Monte Carlo

Infer relationships among three species: Outgroup:

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Introduction to Machine Learning CMU-10701

Bayesian Methods in Multilevel Regression

Bayesian Inference: Concept and Practice

David Giles Bayesian Econometrics

MCMC notes by Mark Holder

Computer intensive statistical methods

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Bayesian modelling. Hans-Peter Helfrich. University of Bonn. Theodor-Brinkmann-Graduate School

Markov chain Monte Carlo

Introduc)on to Bayesian Methods

Principles of Bayesian Inference

Bayesian Inference and Decision Theory

Transcription:

Molecular Epidemiology Workshop: Bayesian Data Analysis Jay Taylor and Ananias Escalante School of Mathematical and Statistical Sciences Center for Evolutionary Medicine and Informatics Arizona State University Jay Taylor (ASU) Bayesian Analysis August 2014 1 / 46

Outline 1 Probability and Uncertainty 2 Bayesian Analysis 3 MCMC Jay Taylor (ASU) Bayesian Analysis August 2014 2 / 46

Probability and Uncertainty Probability: Interpretations and Basic Principles Frequentist Interpretation The probability of an event is equal to its limiting frequency in an infinite series of independent identical trials. Bayesian Interpretations Logical: The probability of a proposition is equal to the strength of evidence in favor of the proposition. Subjective: The probability of a proposition is equal to the strength of an individual s belief in the proposition. Jay Taylor (ASU) Bayesian Analysis August 2014 3 / 46

Probability and Uncertainty Although probabilities can be interpreted in different ways, these interpretations are usually based on the same mathematical rules. To describe these, we will use P(E) to denote the probability of an event or proposition E. Probability Axioms 1 If S is certain to be true, then P(S) = 1. 2 0 P(E) 1 for any proposition E. 3 If E and F are mutually exclusive propositions, then the probability that either E or F is true is equal to the sum of the probabilities of E and of F : P(E or F ) = P(E) + P(F ). Jay Taylor (ASU) Bayesian Analysis August 2014 4 / 46

Probability and Uncertainty The probability assigned to a proposition depends on the information or evidence available to us. This can be made explicit through conditional probability. Conditional Probability Suppose that E and F are propositions and that P(E) > 0. If we know E to be true, then the conditional probability of F given E is equal to P(F E) = P(E and F ) P(E) where P(E and F ) is the probability that both E and F are true. P(S) = 1 S P(E) = 0.46, P(F ) = 0.33 P(E and F ) = 0.21 P(F E) = 0.21/0.46 0.47 E E and F F Jay Taylor (ASU) Bayesian Analysis August 2014 5 / 46

Probability and Uncertainty Joint probabilities can often be calculated by conditioning on one of the propositions. Product Rule P(E and F ) = P(E) P(F E). Example: Suppose that two balls are sampled without replacement from an urn containing five red balls and five blue balls. If we let E be the event that the first ball sampled is red and F be the event that second ball sampled is red, then the probability that both balls sampled are red is P(E, F ) = P(E)P(F E) = 5 10 4 9 = 2 9. Jay Taylor (ASU) Bayesian Analysis August 2014 6 / 46

Probability and Uncertainty Because we can condition on either E or F, the joint probability of E and F can be decomposed in two different ways using the product rule: ( P(E) P(F E) (conditioning on E) P(E and F ) = P(F ) P(E F ) (conditioning on F ). It follows that the two expressions on the right-hand side are equal, i.e., P(E) P(F E) = P(F ) P(E F ), and if we then divide both sides by P(E), we arrive at one of the most important formulas in probability theory: Bayes Formula P(F E) = P(F ) P(E F ). P(E) Jay Taylor (ASU) Bayesian Analysis August 2014 7 / 46

Probability and Uncertainty Example: Reversed Sexual Size Dimorphism in Spotted Owls Like many raptors, adult female Spotted Owls (Strix occidentalis) are larger, on average, than their male counterparts. For example, a study of a California population found that the wing chord distribution (in mm) is approximately N (329, 6) in females and N (320, 6) in males (Blakesley et al., 1990, J. Field Ornithology). Wing Chord in Spotted Owls male female density wing chord Jay Taylor (ASU) Bayesian Analysis August 2014 8 / 46

Probability and Uncertainty Problem: Suppose that an adult bird with a wing chord of 329 mm is randomly sampled from a population with a 1 : 1 adult sex ratio. What is the probability that this is a female? Solution: Let F (M) be the event that the bird is female (male) and let W be the event that the wing chord is 329 mm. Then P(F ) = 0.5 p(w F ) = 1 6 2 /72 2π e (129 129) 0.0665 p(w ) = P(F )p(w F ) + P(M)p(W M) = 1 0.5 6 2 /72 1 2π e (129 129) + 0.5 6 2 /72 2π e (129 120) 0.0441 and upon substituting these quantities into Bayes formula we find that P(F W ) = P(F ) p(w F ) p(w ) 0.75. Jay Taylor (ASU) Bayesian Analysis August 2014 9 / 46

Bayesian Analysis Bayesian Data Analysis: Overview Bayesian data analysis is based on the following two principles: 1 Probability is interpreted as a measure of uncertainty, whatever the source. Thus, in a Bayesian analysis, it is standard practice to assign probability distributions not only to unseen data, but also to parameters, models, and hypotheses. 2 Uncertainty is quantified both before and after the collection of data and Bayes formula is used to update our beliefs in light of the new data. Jay Taylor (ASU) Bayesian Analysis August 2014 10 / 46

Bayesian Analysis Suppose that our objective is to use some newly acquired data D to estimate the value of an unknown parameter Θ. A Bayesian treatment of this problem would proceed as follows: 1 We first need to formulate a statistical model which determines the conditional distribution of the data under each possible value of Θ. This is specified by the likelihood function, p(d Θ = θ). 2 We then choose a prior distribution, p(θ = θ), for the unknown parameter which quantifies how strongly we believe that the true value is θ before we examine the new data. 3 We then collect and examine the new data D. 4 In light of this new data, we use Bayes formula to revise our beliefs concerning the value of the parameter Θ: «p(d Θ = θ) p(θ = θ D) = p(θ = θ) {z } {z } p(d) posterior prior p(θ = θ D) is said to be the posterior distribution of the parameter Θ given the data D. Jay Taylor (ASU) Bayesian Analysis August 2014 11 / 46

Bayesian Analysis Example: Bayesian Estimation of Sex Ratios Suppose that our objective is to estimate the birth sex ratio of a newly described species. To this end, we will count the numbers of male and female offspring in each of b broods. For the sake of the example, we will make the following assumptions: The probability of being born female does not vary within or between broods. In particular, there is no environmental or heritable variation in sex ratio. The sexes of the different members of a brood are determined independently of one another. If we let θ denote the unknown probability that an individual is born female, then the sex ratio at birth will be θ/1 θ. We will let m and f denote the total numbers of male and female offspring contained in the b broods. Jay Taylor (ASU) Bayesian Analysis August 2014 12 / 46

Bayesian Analysis Likelihood Function: Under our assumptions about sex determination, the likelihood function depends only on the sex ratio and the total numbers of males and females in the broods. Conditional on Θ = θ, the total number of female offspring among the n offspring is binomially distributed with parameters f + m and θ: Likelihood function P(f, m Θ = θ) =! f + m θ f (1 θ) m f p(3,7 θ 0.25 0.2 0.15 Likelihood Function: f =7, m=3 θ MLE =0.7 MLE θ MLE = f f + m 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ Jay Taylor (ASU) Bayesian Analysis August 2014 13 / 46

Bayesian Analysis Prior Distribution: The prior distribution on the unknown parameter θ should reflect what we have previously learned about the birth sex ratio in this species. For example, this might be determined by prior observations of this species or by information about the sex ratio of closely-related species. Here I will consider prior distributions for three different scenarios: Uniform: no prior information p(θ = θ) = 1 2.5 2 Prior Distributions Beta(1,1) Beta(5,5) Beta(2,4) Beta(5,5): even sex ratio 1.5 p(θ = θ) = 630 θ 4 (1 θ) 4 prior 1 Beta(2,4): male-biased 0.5 p(θ = θ) = 20 θ(1 θ) 3 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ Jay Taylor (ASU) Bayesian Analysis August 2014 14 / 46

Bayesian Analysis We are now ready to use Bayes formula to calculate the posterior distribution of θ conditional on having observed f female offspring out of a total of n offspring. When the prior distribution is of type Beta(a, b), we have p(θ = θ f, m) = p(θ = θ) = = «P(f, m θ) P(f, m) 1 β(a, b) θa 1 (1 θ) b 1 (Bayes formula) θf! (1 θ) m `f +m f P(f, m) 1 β(f + a, m + b) θf +a 1 (1 θ) m+b 1, which shows that the posterior distribution is of type Beta(f + a, m + b). Remark Because the prior and the posterior distribution belong to the same family of distributions, we say that the beta distribution is a conjugate prior for the binomial likelihood function. Jay Taylor (ASU) Bayesian Analysis August 2014 15 / 46

Bayesian Analysis The figures show the posterior distributions corresponding to each of these three priors for two different data sets: either f = 7, m = 3 (left plot) or f = 70, m = 30 (right plot). 4 3.5 3 Posterior Distribution: f =7, m=3 Beta(1,1) Beta(5,5) Beta(2,4) 9 8 7 Posterior Distribution: f =70, m=30 Beta(1,1) Beta(5,5) Beta(2,4) posterior density 2.5 2 1.5 posterior density 6 5 4 3 1 2 0.5 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 Jay Taylor (ASU) Bayesian Analysis August 2014 16 / 46

Bayesian Analysis Summaries of the Posterior Distribution Although the posterior distribution p(θ = θ D) comprehensively describes the state of knowledge concerning an unknown parameter, it is common to summarize this distribution by a number or an interval. Point estimators of Θ include the mean and the median of the posterior distribution. If it exists, the mode of the posterior distribution can be used to estimate Θ, in which case it is called the maximum a posteriori (MAP) estimate. A credible region is a region that contains a specified proportion of the probability mass of the posterior distribution, e.g., a 95% credible region will contain 95% of this mass. These can be chosen in several ways, including: quantile-based intervals; highest probability density (HPD) regions. Jay Taylor (ASU) Bayesian Analysis August 2014 17 / 46

Bayesian Analysis Posterior Summary Statistics Credible Intervals for N(0,1) Credible Intervals for Exp(1) MAP = 0.0 MAP = mean = median Median = 0.692 95% HPD/quantiles (-1.96, 1.96) Mean = 1.0 95% HPD (0.0, 2.996) 95% quantiles (0.025, 3.69) MAP = mean = median = 0 95% median = ( 1.96, 1.96) 95% HPD = ( 1.96, 1.96) MAP = 0, median = 0.692, mean = 1.0 95% quantiles = (0.025, 3.689) 95% HPD = (0, 2.996) Jay Taylor (ASU) Bayesian Analysis August 2014 18 / 46

Bayesian Analysis The table lists various summary statistics for the posterior distributions calculated in the sex ratio example. data prior mean median sd 2.5% 97.5% 7, 3 uniform 0.667 0.676 0.131 0.390 0.891 even 0.600 0.603 0.107 0.384 0.798 -biased 0.563 0.565 0.120 0.323 0.787 70, 30 uniform 0.697 0.697 0.045 0.604 0.781 even 0.682 0.683 0.044 0.592 0.765 -biased 0.679 0.680 0.045 0.588 0.764 Interpretation Whereas the small data set does not provide strong evidence against the hypothesis that the sex ratio is 1 : 1, all three analyses of the large data set suggest that the true value of θ is greater than 0.58. Jay Taylor (ASU) Bayesian Analysis August 2014 19 / 46

Bayesian Analysis Acquiring New Data One of the strengths of Bayesian statistics is that it can readily handle sequentially acquired data. This is done by using the posterior distribution from the most recent experiment as the prior distribution for the next experiment. p 0(θ) {z } prior p 1(θ) {z } prior D 1 D 2 p1(θ) = p 0(θ) p(d1 θ) p(d 1) {z } posterior p2(θ) = p 1(θ) p(d2 θ) p(d 2) {z } posterior p 2(θ) {z } prior D 3 Jay Taylor (ASU) Bayesian Analysis August 2014 20 / 46

Bayesian Analysis Example: Sex Ratio Estimation, continued Suppose that our prior distribution on Θ was uniform and that we initially collected three broods totaling 7 females and 3 males. We then used Bayes formula to determine that the posterior distribution of Θ is Beta(8, 4). Since this distribution is broad, we decide to collect additional data to refine our estimate of Θ. If the second data set contains 15 females and 5 males, then using Beta(8, 4) as the new prior distribution, we find that the new posterior distribution of Θ is p(θ = θ f = 15, m = 5) = θ7 (1 θ) 3 β(8, 4) = θ22 (1 θ) 8 β(23, 9) `20 θ15! (1 θ) 5 15 P(f = 15, m = 5) Jay Taylor (ASU) Bayesian Analysis August 2014 21 / 46

Bayesian Analysis Choosing the Prior Distribution: Some Guidelines Choosing a prior distribution is both useful and sometimes difficult because it requires a careful assessment of our knowledge or beliefs before we perform an experiment. As a rule, the prior should be chosen independently of the new data. Cromwell s rule: The prior distribution should assign positive probability to any proposition that is not logically false, no matter how unlikely. It is sometimes useful to carry out multiple Bayesian analyses using different prior distributions to explore the sensitivity of the posterior distribution to different prior assumptions. When there is very little prior information, it may be appropriate to choose an uninformative prior. Examples include maximum entropy distributions and Jeffreys priors. Jay Taylor (ASU) Bayesian Analysis August 2014 22 / 46

Bayesian Analysis Bayesian Phylogenetics Bayesian methods have proven to be especially useful in the analysis of genetic sequence data. In these problems, the unknown parameters can often be divided into three categories: the unknown tree, T the parameters of the demographic model, Θ dem the parameters of the substitution model, Θ subst. Then, given a sequence alignment D, the analytical problem is to calculate the posterior distribution of all of the unknowns: Bayes formula for phylogenetic inference, general case «p(d T, Θdem, Θ subst ) p(t, Θ dem, Θ subst D) = p(t, Θ dem, Θ subst ) p(d) Jay Taylor (ASU) Bayesian Analysis August 2014 23 / 46

Bayesian Analysis If the substitution process is assumed to be neutral, then the prior distribution and the likelihood functions can be simplified: Bayes formula for phylogenetic inference with neutral data «p(d T, Θsubst ) p(t, Θ subst, Θ dem D) = p(θ subst ) p(θ dem ) p(t Θ dem ) p(d) This is a consequence of the following assumptions: Under the prior distribution, the parameters of the substitution process Θ subst are independent of the tree T and the demographic parameters Θ dem. The demographic parameters Θ dem typically determine the conditional distribution of the genealogy T, e.g., through a coalescent model. Conditional on T and Θ subst, the sequence data D is independent of Θ dem. Jay Taylor (ASU) Bayesian Analysis August 2014 24 / 46

Bayesian Analysis Example: Bayesian Inference of Effective Population Size Suppose that our data D consists of n randomly sampled individuals that have been sequenced at a neutral locus and that our objective is to estimate the effective population size N e. For simplicity, we will use the Jukes-Cantor model for the substitution process with a strict molecular clock and we will assume that the demography can be described by the constant population size coalescent. To carry out a Bayesian analysis, we need to specify p(µ), p(n e) and p(t N e). p(µ) should be chosen to reflect what we know about the mutation rate, e.g., we could use a lognormal distribution with mean u and variance σ 2. Since N e is a scale parameter in the coalescent, it is common practice to use the Jeffreys prior p(n e) 1/N e. p(t N e) is then determined by Kingman s coalescent. Jay Taylor (ASU) Bayesian Analysis August 2014 25 / 46

Bayesian Analysis Assuming that we are only interested in N e, then µ and T are nuisance parameters and so we need to calculate the marginal posterior distribution of N e by integrating over µ and T : Z Z p(n e D) = p(t, N e, µ D)dµdT Z Z = p(ne) p(d) p(d T, µ)p(µ)p(t N E )dµdt. However, unless n is quite small, integration over T is not feasible. For example, if our sample contains 20 sequences, then there are approximately 8 10 21 possible trees to be considered. Even with the fastest computers available, this is an impossible calculation. Jay Taylor (ASU) Bayesian Analysis August 2014 26 / 46

MCMC Markov Chain Monte Carlo Methods (MCMC) Until recently, Bayesian methods were regarded as impractical for many problems because of the computational difficulty of evaluating the posterior distribution. In particular, to use Bayes formula, p(θ D) = p(θ) «p(d θ), p(d) we need to evaluate the marginal probability of the data Z p(d) = p(d θ)p(θ)dθ, which requires integration of the likelihood function over the parameter space. Except in special cases, this integration must be performed numerically, but sometimes even this is very difficult. Jay Taylor (ASU) Bayesian Analysis August 2014 27 / 46

MCMC An alternative is to use Monte Carlo methods to sample from the posterior distribution. Here the idea is to generate a random sample from the distribution and then use the empirical distribution of that sample to approximate p(θ D): p(θ D) {z } posterior 1 NX δ Θi where Θ 1,, Θ N p(θ D) N {z } i=1 {z } sample empirical distribution For example, the figure shows two histogram estimators for the Beta(8, 4) density generated using either 100 (left) or 1000 (right) independent samples: 4 4 3.5 sample: N =100 Beta(8,4) 3.5 sample: N =1000 Beta(8,4) 3 3 posterior density 2.5 2 1.5 posterior density 2.5 2 1.5 1 1 0.5 0.5 0 0 0.2 0.4 0.6 0.8 1 θ 0 0 0.2 0.4 0.6 0.8 1 θ Jay Taylor (ASU) Bayesian Analysis August 2014 28 / 46

MCMC In particular, the empirical distribution can be used to estimate probabilities and expectations under p(θ D): P(θ A D) 1 N E[f (θ) D] 1 N NX 1 A (Θ i ) i=1 NX f (Θ i ). i=1 What makes this approach difficult is the need to generate random samples from distributions that are only known up to a constant of proportionality (e.g., p(d)). This is where Markov chain Monte Carlo methods come in. Jay Taylor (ASU) Bayesian Analysis August 2014 29 / 46

MCMC Markov Chains A discrete-time Markov chain is a stochastic process X 0, X 1, with the property that the future behavior of the process only depends on its current state. More precisely, this means that for every set A and t, s 0, P(X t+s A X t, X {z } {z} t 1,, X 0 ) = P(X t+s A X t). {z } future present past frequency 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Simulations of the Wright Fisher Process N =500, µ =0.001 present past future Markov chains have numerous applications in biology. Some familiar examples include: random walks branching processes the Wright-Fisher model chain-binomial models (Reed-Frost) 0 0 20 40 60 80 100 120 140 160 180 200 time Jay Taylor (ASU) Bayesian Analysis August 2014 30 / 46

MCMC One consequence of the Markov property is that many Markov chains have a tendency to forget their initial state as time progresses. More precisely, Asymptotic behavior of Markov chains Many Markov chains have the property that there is a probability distribution π, called the stationary distribution of the chain, such that for large t the distribution of X t approaches π, i.e., for all states x and sets A, lim P(Xt A X0 = x) = π(a). t 0.16 p = 0.01 t =2 0.035 t =20 0.025 t =100 Stationary behavior of the Wright-Fisher process: (N = 100, µ = 0.02) density 0.14 0.12 0.1 0.08 0.06 0.04 0.02 p = 0.5 p = 0.9 density 0.03 0.025 0.02 0.015 0.01 0.005 density 0.02 0.015 0.01 0.005 0 0 0.5 1 p 0 0 0.5 1 p 0 0 0.5 1 p Jay Taylor (ASU) Bayesian Analysis August 2014 31 / 46

MCMC p Some Markov chains satisfy an even stronger property, called ergodicity. Ergodicity A Markov chain with stationary distribution π is said to be ergodic if for every initial state X 0 = x and every set A, we have 1 lim T T TX 1 A (X t) = π(a). t=1 In other words, if we run the chain for a long time, then the proportion of time spent visiting the set A is approximately π(a). 1 Neutral Wright Fisher model 0.9 Ergodic behavior of the Wright-Fisher process: (N = 100, µ = 0.02) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Generation x 10 4 Jay Taylor (ASU) Bayesian Analysis August 2014 32 / 46

MCMC Markov Chain Monte Carlo: General Approach The central idea in Markov Chain Monte Carlo is to use an ergodic Markov chain to generate a random sample from the target distribution π. 1 The first step is to select an ergodic Markov chain which has π as its stationary distribution. There are different methods for doing this, including the Metropolis-Hastings algorithm and the Gibbs Sampler. 2 We then need to simulate the Markov chain until the distribution is close to the target distribution. This initial period is often called the burn-in period. 3 We continue simulating the chain, but because successive values are highly correlated, it is common practice to only collect a sample every T generations, so as to reduce the correlations between the sampled states (thinning). 4 We can then use these samples to approximate the target distribution: 1 N NX δ XB+nT π. n=1 Jay Taylor (ASU) Bayesian Analysis August 2014 33 / 46

MCMC The Metropolis-Hastings Algorithm Given a probability distribution π, the Metropolis-Hastings algorithm can be used to explicitly construct a Markov chain that has π as its stationary distribution. Implementation requires the following elements: the target distribution, π, known up to a constant of proportionality; a family of proposal distributions, Q(y x), and a way to efficiently sample from these distributions. The MH algorithm is based on a more general idea known as rejection sampling. Instead of sampling directly from π, we propose values using a distribution Q(y x) that we can easily sample but then we reject values that are unlikely under π. Jay Taylor (ASU) Bayesian Analysis August 2014 34 / 46

MCMC The Metropolis-Hastings algorithm consists of repeated application of the following three steps. Suppose that X n = x is the current state of the Markov chain. Then the next state is chosen as follows: Step 1: We first propose a new value for the chain by sampling y with probability Q(y x). Step 2: We then calculate the acceptance probability of the new state: j ff π(y)q(x y) α(x; y) = min π(x)q(y x), 1 Step 3: With probability α(x; y), set X n+1 = y. Otherwise, set X n+1 = x. Remark Because π enters into α as a ratio π(y)/π(x), we only need to know π up to a constant of proportionality. This is why the MH algorithm is so well suited for Bayesian analysis. Jay Taylor (ASU) Bayesian Analysis August 2014 35 / 46

MCMC The Proposal Distribution The choice of the proposal distribution Q can have a profound impact on the performance of the MH algorithm. While there is no universal procedure for selecting a good proposal distribution, the following considerations are important. Q should be chosen so that the chain rapidly converges to its stationary distribution. Q should also be chosen so that it is easy to sample from. There is usually a tradeoff between these two conditions and sometimes it is necessary to try out different proposal distributions to identify one with good properties. Many implementations of MH (e.g., BEAST, MIGRATE) offer the user some control over the proposal distribution. Jay Taylor (ASU) Bayesian Analysis August 2014 36 / 46

MCMC MCMC: Convergence and Mixing One of the most challenging issues in MCMC is knowing for how long to run the chain. There are two related considerations. 1 We need to run the chain until its distribution is sufficiently close to the target distribution (convergence). 2 We then need to collect a large enough number of samples that we can estimate any quantities of interest (e.g., the mean TMRCA) sufficiently accurately (mixing). Unfortunately, there is no universally-valid, fool-proof way to guarantee that either one of these conditions is satisfied. However, there are a number of convergence diagnostics that can indicate when there are problems. Jay Taylor (ASU) Bayesian Analysis August 2014 37 / 46

MCMC Convergence Diagnostics: Trace Plots Trace plots show how the value of a parameter changes over the course of a simulation. In general, what we want to see is that the mean and the variance of the parameter are fairly constant over the duration of the trace plot, as in the two examples shown below. 2000 6 1950 5 1900 4 1850 P_vivax_gsr_geo1.log 1800 1750 P_vivax_gsr_geo1.log 3 2 1700 1 1650 1600 0 10000000 2E7 3E7 0 0 10000000 2E7 3E7 State State Jay Taylor (ASU) Bayesian Analysis August 2014 38 / 46

MCMC Problems with convergence or mixing may be revealed by trends or sudden changes in the behavior of the trace plot. -500-7300 -7325-550 -7350 P_vivax_gsr_geo1.log -600-650 P_vivax_gsr_geo1.log -7375-7400 -700-7425 -750 0 10000000 2E7 3E7 State -7450 0 10000000 2E7 3E7 State The increasing trend indicates that the chain has not yet converged. The sudden changes in mean indicate that the chain is poorly mixing. Jay Taylor (ASU) Bayesian Analysis August 2014 39 / 46

MCMC Trace Plots: Some Guidelines 1 You should examine the trace plot of every parameter of interest, including the likelihood and the posterior probability. If any of the trace plots look problematic, then all of the results are suspect. 2 The fact that a trace plot appears to have converged is not conclusive proof that it has. Especially in high-dimensional problems, a chain that appears to be stationary for the first 500 million generations may well show a sudden change in behavior in the next. 3 The program Tracer (http://tree.bio.ed.ac.uk/software/tracer/) can be used to display and analyze trace plots generated by BEAST, MrBayes and LAMARC. Jay Taylor (ASU) Bayesian Analysis August 2014 40 / 46

MCMC Convergence Diagnostics: Effective Sample Size Because successive states visited by a Markov chain are correlated, an estimate derived using N values generated by such a chain will usually be less precise than an estimate derived using N independent samples. This motivates the following definition. Effective Sample Size (ESS) The effective sample size of a sample of N correlated random variables is equal to the number of independent samples that would estimate the mean with the same variance. For a stationary Markov chain with autocorrelation coefficients ρ k, the ESS of N successive samples is equal to ESS = N 1 + 2 P k=1 ρ. k Jay Taylor (ASU) Bayesian Analysis August 2014 41 / 46

MCMC Effective Sample Size: Guidelines 1 Each parameter has its own ESS and these can differ between parameters by more than order of magnitude. 2 Parameters with small ESS s indicate that a chain either has not converged or is slowly mixing. As a rule of thumb, the ESS of every parameter should exceed 1000 and larger values are even better. 3 Thinning by itself will not increase the ESS. However, we can increase the ESS by simultaneously thinning and increasing the duration of the chain, e.g., collecting a 1000 samples from a chain lasting 100000 generations is better than collecting a 1000 samples from a chain lasting 10000 generations. 4 The ESS of a parameter can usually only be estimated from its trace. For this reason, large ESS s do not guarantee that the chain has converged. Jay Taylor (ASU) Bayesian Analysis August 2014 42 / 46

MCMC Formal Tests of Convergence There are several formal tests of stationarity that can be applied to MCMC. The Geweke diagnostic compares the mean of a parameter estimated from two non-overlapping parts of the chain and tests whether these are significantly different. The Raftery-Lewis diagnostic uses a pilot chain to estimate the burn-in and chain length required to estimate the q th quantile of a parameter to within some tolerance. Both of these methods suffer from the defect that you ve only seen where you ve been (Robert & Casella, 2004). In other words, these methods cannot detect that the chain has failed to visit part of the support of the target distribution. These and other diagnostic tests are implemented in the R package coda. Jay Taylor (ASU) Bayesian Analysis August 2014 43 / 46

MCMC Convergence Diagnostics: Multiple Chains Another approach to testing the convergence of a MCMC analysis is to run multiple independent chains and compare the results. Large differences between the posterior distributions estimated by the different chains indicate problems with convergence or mixing. It is often useful to start the different chains from different, randomly-chosen initial conditions. If the different chains give similar results, then their traces can be combined using a program such as LogCombiner (http://beast2.org). Best Practice It is always a good idea to run at least two independent chains in any MCMC analysis. Jay Taylor (ASU) Bayesian Analysis August 2014 44 / 46

MCMC Metropolis-Coupled Markov Chain Monte Carlo (MC 3 ) MC 3 is a generalization of MCMC which attempts to improve mixing by running m chains in parallel, while randomly exchanging their states. The first chain (the cold chain) is constructed so that the target distribution π is its stationary distribution. The i th chain is constructed so that its stationary distribution is proportional to π i (x) π(x) 1/T i where T i is said to be the temperature of the chain. As T i increases, the distribution π i (x) becomes flatter, which makes it easier for this chain to converge. The MH algorithm is used to swap the states occupied by different chains in such a way that the stationary distribution of each chain is maintained. The output from the cold chain is then used to approximate the target distribution. Jay Taylor (ASU) Bayesian Analysis August 2014 45 / 46

References I MCMC M. A. Beaumont and B. Rannala, The Bayesian revolution in genetics, Nat. Rev. Genetics 5 (2004), 251 261. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, second ed., Chapman & Hall/CRC, 2004. P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, Cambridge University Press, 2010. P. Lemey, M. Salemi, and A.-M. Vandamme (eds.), The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing, second ed., Cambridge University Press, 2009. S. P. Otto and T. Day, A Biologist s Guide to Mathematical Modeling in Ecology and Evolution, Princeton University Press, 2007. C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer-Verlag, 2004. Jay Taylor (ASU) Bayesian Analysis August 2014 46 / 46