Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Size: px

Start display at page:

Download "Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder"

Polly Hart
5 years ago
Views:

1 Bayesian inference & Markov chain Monte Carlo Note 1: Many slides for this lecture were kindly rovided by Paul Lewis and Mark Holder Note 2: Paul Lewis has written nice software for demonstrating Markov chain Monte Carlo idea. Software is called MCRobot and is freely available at: htt://hydrodictyon.eeb.uconn.edu/eole/lewis/software.h (unfortunately software only works for windows oerating system, but see the imcmc rogram by John Huelsenbeck at: htt://cteg.berkeley.edu/software.html... also try the MCMC robot iad a ) Assume we want to estimate a arameter θ with data X. Maximum likelihood aroach to estimating θ finds value of θ that maximizes Pr (X θ). Before observing data, we may have some idea of how lausible are values of θ. This idea is called our rior distribution of θ and we ll denote it Pr (θ). Bayesians base estimate of θ on the osterior distribution Pr (θ X).

2 Pr (θ X) = Pr (θ, X) Pr (X) = Pr (X θ)pr (θ) θ Pr (X, θ)dθ = Pr (X θ)pr (θ) θ Pr (X θ)pr (θ)dθ = likelihood rior difficult quantity to calculate Often, determining the exact value of the above integral is difficult. Problems with Bayesian aroachs in general: 1. Disagreements about hilosohy of inference & Disagreements over riors 2. Heavy Comutational Requirements (roblem 2 is raidly becoming less noteworthy)

3 Potential advantages of Bayesian hylogeny inference Interretation of osterior robabilities of toologies is more straightforward than interretation of bootstra suort. If rior distributions for arameters are far from diffuse, very comlicated and realistic models can be used and the roblem of overarameterization can be simultaneously avoided. MrBayes software for hylogeny inference is at: htt://mrbayes.csit.fsu.edu/ Let be the robability of heads. Then 1- is the robability of tails Imagine a data set X with these results from fliing a coin Toss Result H T H T T T Probability P(X ) = (1-) 2 4 almost binomial distribution form

4 Likelihood with 2 heads and 4 tails Likelihood P(X ) Log-Likelihood with 2 heads and 4 tails Log-Likelihood log(p(x ))

5 For integers a and b, Beta density B(a,b) is a-1 b-1 P()= (a+b-1)!/((a-1)!(b-1)!) (1-) where is between 0 and 1. Exected value of is a/(a+b) 2 Variance of is ab/((a+b+1)(a+b) ) Beta distribution is conjugate rior for data from binomial distribution Uniform Prior Distribution (i.e., Beta(1,1) distribution) Prior Density P()

6 Beta(3,5) osterior from Uniform rior + data (2 heads and 4 tails) Posterior P( X) Posterior Mean = 3/(3+5) Beta(20,20) rior distribution Prior Mean = 0.5 rior density P()

7 Beta(22,24) osterior from Beta(20,20) rior + data (2 heads and 4 tails) Posterior P( X) Beta(30,10) rior distribution Prior Mean = 0.75 rior density P()

8 Beta(32,14) osterior from Beta(30,10) rior + data (2 heads and 4 tails) Posterior Density P( X) Posterior Mean = 32/(32+14) Likelihood with 20 Heads and 40 Tails Likelihood P(X ) 0.0e e e e e e - 17

9 Log-Likelihood with 20 heads and 40 tails Log-Likelihood log(p(x )) Uniform Prior Distribution (i.e., Beta(1,1) distribution) Prior Density P()

10 Beta(21,41) osterior from Uniform rior + data (20 heads and 40 tails) Posterior P( X) Beta(20,20) rior distribution Prior Mean = 0.5 rior density P()

11 Beta(40,60) osterior from Beta(20,20) rior + data (20 heads and 40 tails) Posterior P( X) Beta(30,10) rior distribution Prior Mean = 0.75 rior density P()

12 Beta(50,50) osterior from Beta(30,10) rior + data (20 heads and 40 tails) Posterior P( X) Markov chain Monte Carlo (MCMC) idea aroximates Pr (θ X) by samling a large number of θ values from Pr (θ X). So, θ values with higher osterior robability are more likely to be samled than θ values with low osterior robability.

13 Question: How is this samling achieved? Answer: A Markov chain is constructed and simulated. The states of this chain reresent values of θ. The stationary distribution of this chain is Pr (θ X). In other words, we start chain at some initial value of θ. After running chain for a long enough time, the robability of the chain being at some articular state will be aroximately equal to the osterior robability of the state. Let θ (t) be the value of θ after t stesofthemarkovchainwhere θ (0) is the initial value. Each ste of the Markov chain involves randomly roosing a new value of θ based on the current value of θ. Call the roosed value θ. We decide with some robability to either accet θ as our new stateortorejecttheroosedθ and remain at our current state. The Hastings (Hastings 1970) algorithm is a way to make this decision and force the stationary distribution of the chain to be Pr (θ X). According to the Hastings algorithm, what state should we adot at ste t +1 if θ (t) is the current state and θ is the roosed state?

14 Let J(θ θ (t) ) be the juming distribution, i.e. the robability of roosing θ given that the current state is θ (t). Define r as r = Pr (X θ )Pr (θ )J(θ (t) θ ) Pr ( X θ (t)) Pr ( θ (t)) J(θ θ (t) ) With robability equal to the minimum of r and 1, weset Otherwise, we set θ (t+1) = θ. θ (t+1) = θ (t). For the Hastings algorithm to yield the stationary distribution Pr (θ X), there are a few required conditions. The most imortant condition is that it must be ossible to reach each state from any other in a finite number of stes. Also, the Markov chain can t be eriodic. MCMC imlementation details: The Markov chain should be run as long as ossible. We may have T total samles after running our Markov chain. They would be θ (1), θ (2),..., θ (T ). The first B (1 B < T) of these samles are often discarded (i.e. not used to aroximate the osterior). The eriod before the chain has gotten these B samles that will be discarded is referred to as the burn in eriod. The reason for discarding these samles is that the early samles tyically are largely deendent on the initial state of the Markov chain and often the initial state of the chain is (either intentionally or unintentionally) atyical with resect to the osterior distribution. The remaining samles θ (B+1), θ (B+2),..., θ (T ) are used to aroximate the osterior distribution. For examle, the average among the samled values for a arameter might be a good estimate of its osterior mean.

15 Markov Chain Monte Carlo and Relatives (some imortant aers) CARLIN, B.P., and T.A. LOUIS Bayes and Emirical Bayes Methods for Data Analysis. Chaman and Hall, London. GELMAN, A., J.B. CARLIN, H.S. STERN, and D.B. RUBIN Bayesian Data Analysis. Chaman and Hall, London. GEYER, C Markov chain Monte Carlo maximum likelihood. Pages in Comuting Science and Statistics: Proceedings of the 23rd Symosium on the Interface. Keramidas, ed. Fairfax Station: Interface Foundation HASTINGS, W.K. (1970) Monte Carlo samling methods using Markov chains and their alications. Biometrika 57: METROPOLIS, N., A.W. ROSENBLUTH, M.N. ROSENBLUTH, A.H. TELLER, and E. TELLER Equations of state calculations by fast comuting machines. J. Chem. Phys. 21: Posterior redictive inference (notice resemblance to arametric bootstra) 1. Via MCMC or some other technique, get N samled arameter sets θ (1),...,θ (N) from osterior distribution (θ X) 2. For each samled arameter set θ (k), simulate a new data set X (k) from (X θ (k) ) 3. Calculate a test statistic value T (X (k) ) from each simulated data set and see where test statistic value for actual data T (X) is relative to simulated distribution of test statistic values.

16 From Huelsenbeck et al Syst Biol 52(2): Notation for following ages: X data M i,m j :Modelsiand j θ i,θ j : arameters for models i and j (X θ i,m i ),(X θ j,m j ): likelihoods

17 Bayes factor (M i X) (M j X) = (M i)(x M i )/(X) (M j )(X M j )/(X) = (M i) (M j ) (X M i) (X M j ) Left factor is called rior odds and right factor is called Bayes factor. Bayes factor is ratio of marginal likelihoods of the two models. BF ij = (X M i) (X M j ) According to wikiedia, Jeffreys (1961) interretation of BF 12 (1 reresenting one model and 2 being the other): BF 12 Interretation < 1 : 1 Negative (suorts M 2 ) 1 : 1 to 3 : 1 Barely worth mentioning 3 : 1 to 10 : 1 Substantial 10:1to30:1 Strong 30 : 1 to 100 : 1 Very Strong > 100 : 1 Decisive

18 BF ij = (X M i) (X M j ) Bayes factors hard to comute because marginal likelihoods hard to comute: (X M i )= θ i (X M i,θ i )(θ i M i )dθ i Imortant oint to note from above: Bayes factors deend on riors (θ i M i ) because marginal likelihoods deend on riors! How to aroximate/comute marginal likelihood? (X M i )= θ i (X M i,θ i )(θ i M i )dθ i Harmonic mean estimator of marginal likelihood (widely used but likely to be terrible and should be avoided): 1 (X M i ). = 1 N N k=1 1 (X θ (k) i,m i ) where θ (k) i are samled from osterior (θ i X, M i ).

19 Imortant aers regarding Bayesian Model Comarison... Posterior Predictive Inference in Phylogenetics: J.P. Bollback Molecular Biology and Evolution. 19: Harmonic Mean and other techniques for estimating Bayes factors: Newton and Raftery Journal of the Royal Statistical Society. Series B. 56(1):3-48. more reliable ways to aroximate marginal likelihood Thermodynamic Integration to Aroximate Bayes Factors (adated to molecular evolution data): Lartillot and Philie Syst. Biol. 55: Imroving marginal likelihood estimation for Bayesian hylogenetic model selection. W. Xie, P.O. Lewis, Y. Fan, L. Kao, M-H Chen Syst Biol. 60(2): Choosing among artition models in Bayesian hylogenetics. Y. Fan, R. Wu, M-H Chen, L Kuo, P.O. Lewis Mol. Biol. Evol. 28(1): Markov chain Monte Carlo without likelihoods. P. Marjoram, J. Molitor, V. Plagnol, and S. Tavare PNAS USA. 100(26): H. Jeffreys. The Theory of Probability (3e). Oxford (1961);. 432 M.A. Beaumont, W. Zhang, D.J. Balding. Aroximate Bayesian Comutation in Poulation Genetics Genetics 162: Paul Lewis MCMC Robot Demo Target distribution: Mixture of bivariate normal hills inner contours: 50% of the robability outer contours: 95% Proosal scheme: random direction gamma-distributed ste length (mean 45 ixels, s.d. 40 ixels) reflection at edges

20 MCMC robot rules Slightly downhill stes are usually acceted Drastic off the cliff downhill stes are almost never acceted Uhill stes are always acceted With these rules, it is easy to see that the robot tends to stay near the tos of hills Burn-in First 100 stes Note that first few stes are not at all reresentative of the distribution. Starting oint

21 Problems with MCMC aroaches: 1. They are difficult to imlement. Imlementation may need to be clever to be comutationally tractable and rogramming bugs are a serious ossibility. 2. For the kinds of comlicated situations that biologists face, it may be very difficult to know how fast the Markov chain converges to the desired osterior distribution. There are diagnostics for evaluating whether a chain has converged to the osterior distribution but the diagnostics do not rovide a guarantee of convergence. A GOOD DIAGNOSTIC : MULTIPLE RUNS!! Just how long is a long run? What would you conclude about the target distribution had you stoed the robot at this oint? One way to detect this mistake is to erform several indeendent runs. Results different among runs? Probably none of them were run long enough!

22 History lots Ln Likelihood White noise aearance is a good sign Burn in is over right about here Imortant! This is a lot of first 1000 stes, and there is no indication that anything is wrong (but we know for a fact that we didn t let this one run long enough) Slow mixing Chain is sending long eriods of time stuck in one lace Indicates ste size is too large, and most roosed stes would take the robot off the cliff

23 The roblem of co-linearity Parameter Joint osterior density for a model having two highly correlated arameters is a narrow ridge If we have searate roosals for and, even small stes may be too large! Parameter Some material on Bayesian model comarison and hyothesis testing 1. Some Bayesians dislike much hyothesis testing because null hyotheses often are known a riori to be false and value deends both on how wrong null is and on amount of data. 2. Posterior redictive inference for assessing fit of models (see next ages) 3. Bayes factors for comaring models (see next ages)

24 The Tradeoff Pro: Proosing big stes hels in juming from one island in the osterior density to another Con: Proosing big stes often results in oor mixing Solution: Better roosals - MCMCMC Huelsenbeck has found that a technique called Metroolis-Couled Markov chain Monte Carlo (i.e., MCMCMC!! or MC 3) suggested by C.J. Geyer is useful for getting convergence with hylogeny reconstruction. The idea of MCMCMC is to run multile Markov chains in arallel. One chain will have stationary distribution that is the osterior of interest. The other chains will aroximate osterior distributions that are various degrees more smooth that than the osterior distribution of interest. Each chain is run searately, excet that occasionally 2 chains are randomly icked and a roosal to switch the states of these two chains is made. This roosal is randomly acceted or reject with the aroriate robability

25 Metroolis-couled Markov chain Monte Carlo (MCMCMC, or MC 3 ) MC 3 involves running several chains simultaneously The cold chain is the one that counts, the rest are heated chains. What is a heated chain? Instead of using R, to make accetance or rejection decisions, heated chains use: In MrBayes: H = Temerature*(Chain's index) The cold chain has index 0 Heated chains exlore the surface more freely Occasionally, you roose to switch the ositions of 2 of the chains

26 Heated chains act as scouts Phylogeny Priors: For hylogeny inference, arameters might reresent toology, branch lengths, base frequencies, transition-transversion ratio, etc. Each arameter needs secified rior distribution. For examle All unrooted toologies can be considered equally robable a riori. Given toology, all branch lengths between 0 and some big number could be considered equally likely a riori 2. All combinations of base frequencies could be considered equally likely a riori 3. The transition-transversion ratio could have a rior distribution that is uniform between 0 & some big number.... and so on.

27 Moving through Tree Sace Larget Simon Local Move Moving through Tree Sace Larget Simon Local Move

28 Moving through Tree Sace Larget Simon Local Move Moving through Tree Sace Larget Simon Local Move

29 Moving through arameter sace Using (ratio of the transition rate to the transversion rate) as an examle of a model arameter. Proosal distribution is the uniform distribution on the interval ( -, + ) A larger means the samler will attemt to make larger jums on average. Putting it all together Start with an initial tree and model arameters (often chosen randomly). Proose a new, randomly-selected move. Accet or reject the move (Walking). Every k generations, save tree, branch lengths and all model arameters (Thinning). After n generations, summarize the samle using histograms, means, credibility intervals, etc. (Summarizing).

30 Samling the chain tells us: Which tree has the highest osterior robability? What is the robability that tree X is the true tree? What values of the arameters are most robable? What if we are only interested in one grouing? Which of the trees in the MCMC run contained the clade (e.g. A + C)? The roortion of trees with A and C together in our samle aroximates the osterior robability that A and C are sister to each other.

31 Slit (a.k.a. clade) robabilities A slit is a artitioning of taxa that corresonds with a articular branch. Slits are usually reresented by strings: asterisks (*) show which taxa are on one side of the branch, and the hyhens ( ) show the taxa on the other side. Posteriors of model arameters lower = uer = % credible interval Histogram created from a samle of 1000 kaa values From: Lewis, L., & Flechtner, V. (2002. Taxon 51: ) mean = 3.234

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Bayesian inference & Markov chain Monte Carlo Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder Note 2: Paul Lewis has written nice software for demonstrating Markov