Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches. Mark Pagel Reading University.

Similar documents
A Bayesian Approach to Phylogenetics

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Infer relationships among three species: Outgroup:

Molecular Evolution & Phylogenetics

MCMC: Markov Chain Monte Carlo

Estimating Evolutionary Trees. Phylogenetic Methods

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Parallel Tempering I

Discrete & continuous characters: The threshold model


Bayesian Models for Phylogenetic Trees

ETIKA V PROFESII PSYCHOLÓGA

A (short) introduction to phylogenetics

EVOLUTIONARY DISTANCES

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Bayesian Estimation of Ancestral Character States on Phylogenies

Lecture 6 Phylogenetic Inference

Machine Learning for Data Science (CS4786) Lecture 24

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Approximate Bayesian Computation: a simulation based approach to inference

Bayesian Phylogenetics:

Markov Chain Monte Carlo

Constructing Evolutionary/Phylogenetic Trees

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Quantifying Uncertainty

Framework for functional tree simulation applied to 'golden delicious' apple trees

Constructing Evolutionary/Phylogenetic Trees

T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r )

CS 188: Artificial Intelligence. Bayes Nets

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Bayesian Classification and Regression Trees

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Phylogeny. November 7, 2017

Rapid evolution of the cerebellum in humans and other great apes

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

CS 343: Artificial Intelligence

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Phylogenetic Tree Reconstruction

C.DARWIN ( )

Markov Chain Monte Carlo Lecture 6

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Consensus methods. Strict consensus methods

Lecture 12: Bayesian phylogenetics and Markov chain Monte Carlo Will Freyman

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic analyses. Kirsi Kostamo

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Algorithms in Bioinformatics

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Molecular Phylogeny

Juan Juan Salon. EH National Bank. Sandwich Shop Nail Design. OSKA Beverly. Chase Bank. Marina Rinaldi. Orogold. Mariposa.

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogeny: building the tree of life

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Bayesian Phylogenetics

Dr. Amira A. AL-Hosary

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bayesian Estimation of Concordance among Gene Trees

Markov Chains and MCMC

Phylogenetics: Parsimony

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Tutorial on ABC Algorithms

Phylogenetic inference

An Investigation of Phylogenetic Likelihood Methods

A Bayesian Analysis of Metazoan Mitochondrial Genome Arrangements

BINF6201/8201. Molecular phylogenetic methods

Bayes Nets: Sampling

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Monte Carlo in Bayesian Statistics

Evolutionary Tree Analysis. Overview

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Theory of Evolution Charles Darwin

Multiple Sequence Alignment. Sequences

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Introduction to characters and parsimony analysis

Multimodal Nested Sampling

Bayesian Inference and MCMC

Phylogeny: traditional and Bayesian approaches


Transcription:

Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches Mark Pagel Reading University m.pagel@rdg.ac.uk Phylogeny of the Ascomycota Fungi showing the evolution of lichen-formation lichen forming ambiguous not lichen forming Mark Pagel, Reading Univ (ITP 5/14/01) 1

¾ ¾ ½ ¾ ½ Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches!" #$%'& ()& *+-,/.10234 Ž D Ž D M < g š : < : M c œœ: < Dž Dž Ÿ Ÿ DD < : Ÿ Ÿ :D :F 8 c M < 5687:9<; 56879<; = >?8@BADC =/>?8@EAFCDGG<HD?IFJ KKMLL:NN<O<PBN OMPEN:KKRQQ FÀ<ÁÃÂFÄ DÀ<ÁÃÂFÄgÅÅÇÆ<È É ÁÊDËFÆ ÁÊDËDÆ<ÈÈ:ÂÌ8Á[Í:ËDÌ ÂÌ8Á[ÍËDÌ8ÄÄgÆ<È Â Æ<È QSNDN<TUQ QSNDN<TVQ W/X:Y[Z<\ W XY[Z<\E]]S^:Z ^ZR `acbdde `fagbdde'eeihkj hrj lnm:o lpm:orqq snt<uwv spt<uwv<xxzyy snt<uwv spt<uwv<xxzyy { g}}~} }: <ƒb MƒE R ˆ : <ŠŠc zœ cª «' ± ² ³p ²µ[ ¹pºp» ¼ Phylogenetic Uncertainty Three Species Rooted Trees A B C Unrooted Tree A B A B C C A C B Mark Pagel, Reading Univ (ITP 5/14/01) 2

Number of Possible Phylogenetic Trees Species Unrooted Rooted 3 1 3 4 3 15 5 15 105 6 105 945 10 2,027,025 34,459,425 No. of Trees 50 2.83 X 10 74 2.75 X 10 76 No. of tips (species) N=50 No. rooted = 27529213532835651545259729751524430639300973035816196098326553772152587890625 No. unrooted = 283806325080779912837729172696128150920628587998105114415737667754150390625 Accounting for Phylogenetic Uncertainty Markov-Chain Monte Carlo (MCMC) Methods generate a long chain of phylogenetic trees (tree proposal and acceptance mechanisms) randomly sample from the converged chain calculate event or evolutionary process in each Tree acceptance mechanism: The Metropolis-Hastings Algorithm Accept new tree with p=1.0 if L(T n+1 ) > L(T n ) otherwise accept with probability L(T n+1 )/ L(T n ) Mark Pagel, Reading Univ (ITP 5/14/01) 3

Metropolis-Hastings Algorithm: Accept new tree according to: R = min 1, f ( X T' ) f (X T) x f (T' ) f (T) x f (T T' ) f (T' T) likelihood ratio prior ratio proposal ratio X=data (e.g., gene sequences) T=tree (topology, branches, parameters) MCMC Sampling Mark Pagel, Reading Univ (ITP 5/14/01) 4

Î Ð Ï Ñ Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches Primer of finding the likelihood of a phylogenetic tree 1. aligned gene- sequence data Sheep ATGGTGAAAA GCCACATAGG CAGTTGGATC CTGGTTCTCT TTGTGGCCAT Human ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC Gorilla ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC Mink ATGGTGAAAA GCCACATAGG CAGCTGGCTC CTGGTTCTCT TTGTGGCCAC H G S M 2. model of sequence evolution 3. the probability of sequence substitutions in a given branch of the phylogeny = Q P(t) = Exp[Qt] 4. the likelihood of a given phylogenetic tree L = branches Π P(t) = Π Exp[Qt] 5. search alternative topologies H G S M H M S G Sheep Human Gorilla Mink ATGGTGAAAA GCCACATAGG CAGTTGGATC CTGGTTCTCT TTGTGGCCAT ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC ATGGTGAAAA GCCACATAGG CAGCTGGCTC CTGGTTCTCT TTGTGGCCAC H G S M 4 s 1 s p i (x T,v,Q) = w root(i) p nk,x ki (v k,q) n k =1 s 2 k=1 p n' k,n k (v k,q) probability of observing ith nucleotide prior weight ith site possible assignments of ancestral nodes (e.g., 64) product over s branches leading to species product over s-2 internal branches L(x T,v,Q) = i p i (x T,v,Q) product over all i nucleotides Mark Pagel, Reading Univ (ITP 5/14/01) 5

-14000-15000 Convergence of a Markov Chain loglikelihood -16000 of -17000 topology -18000-19000 -20000-14745 -14755 loglikelihood -14775-14765 of topology -14785-14795 -14805-14815.......... -14825-250 2250 4750 7250 9750 12250 14750 17250 19750 22250 Position in Markov Chain -21000-250 2250 4750 7250 9750 12250 14750 17250 19750 22250 data: 20000Out1.lpd position in Markov chain 3500 Tree Likelihoods from Markov-Chain Monte Carlo Simulation n=54 taxa Ascomycota fungi 3000 2500 n=10,000 trees frequency 2000 1500 1000 500 data: 2911 best.lpd 0-14800 -14790-14780 -14770-14760 -14750-14740 -14730 log-likelihood of tree Mark Pagel, Reading Univ (ITP 5/14/01) 6

Character Transition Rates gains and losses of lichenization Single Tree* MCMC Integration ** gains (q 01 ) = 1.04 (0.05-4.5) 1.47±0.32 losses (q 10 ) = 2.41(0.7-5.6) 2.12 ±0.31 *consensus tree **n=20,000 trees Mark Pagel, Reading Univ (ITP 5/14/01) 7

MCMC Some Issues Lack of convergence (poor mixing) Tree and parameter proposal mechanisms Tree and parameter updating schedules Detecting convergence Mark Pagel, Reading Univ (ITP 5/14/01) 8

Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) Given m simultaneous Markov chains, update chains, then swap states among a randomly chosen pair i and j each iteration according to: R = min 1, f i (y i ) f j (y j ) f i (x i ) f j (x j ) x i chain x j x k {likelihood ratio chain i * likelihood ratio chain j} y i y j y k probability of swapping with chain i = R * 2/m cold temperature hot Temperatures of heated chains cold chain t=0.2 t=0.5} 1/i 1/(1+t( i-1) number of chains, i Mark Pagel, Reading Univ (ITP 5/14/01) 9

generation log-likelihood 54 taxa Ascomycota data n= 858 nucleotides generation 54 taxa Ascomycota data n= 858 nucleotides log-likelihood Mark Pagel, Reading Univ (ITP 5/14/01) 10

Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) c22 b6 possum c21 b4 c22 b5 c21 c22 b3 b3 c1 c6 b3 b4 c22 c6 b2 b2 c22 b1 c6 b3 c1 b2 c1 mouse3 b1 C1.18 C1.20 c21 b1 C6.20 C22.18 C21.19R C1.19 C21.20 C22.13 C6.15 C22.14 C1.17 C6.19 C6.18 C21.17R C22.15 C6.16 C22.17 C1.16 C21.15R C1.15R C21.16 C6.17 C21.18 C6.12 C21.12 C1.13 C1.10 C1.14 C21.9 C1.12 C6.11 C1.11R C21.10 C6.10 C6.8 C22.8 C1.9 C21.6 C21.7 C21.8 C22.7 C6.9R C1.7 C1.8 C6.7 C21.5 C22.6 C21.4 C22.5 C1.6 0.1 C6.6 C22.4 C1.4 C1.5 C22.2 C22.3 C21.3 C6.5 C6.4 C1.3 C6.3R C1.2 C6.2 C21.2 C21.1 C1.1 L1 gorilla C22.1 B-globi C6.1 ~120 ~90 millions of years ago ~10-15 c21 b5 c6 b5 c21 b2 C22.11 C21.14 c6 b1 log-likelihood MCMCMC Analysis of LINEs data 92 LINE elements, 4000 nucleotides Ò Ó ÔgÕÃÓ Ùc Ò Ó ÔgÕÃÓ ØBÙ Ò Ó ÔgÕgÚc g Ò Ó ÔgÕgÚgÚgÙ Ò Ó ÔgÕgÚgÙc Ò Ó ÔgÕgÚšØBÙ Ò Ó ÔgÕgÔc g Ò Ó ÔgÕgÔgÚgÙ Ò Ó ÔgÕgÔgÙc Ò Ó ÔgÕgÔšØBÙ Ò Ó ÔgÕEÖš g 4-chains 1-chain ÙgÙc c g Ó gùc g c Ó ÙgÙš g c Úš EÙc c g ÚgÙcÙc g c Ôc EÙš g c generation Mark Pagel, Reading Univ (ITP 5/14/01) 11

log-likelihood LINEs data. ÒÔšØgØE g ÒÔšØBÛc g ÒÔcÜ'Ó g ÒÔcÜEÔc g ÒÔcÜEÙc g ÒÔcÜcØE g ÒÔcÜEÛc g ÒÔgÛÃÓ g ÒÔgÛgÔc g ÒÔgÛgÙc g Simultaneous chains with heating and swapping cold chain hot chain warm chain Chain swapping Ó Ùš g c ÔgÙš g g ÙgÙc c g ØEÙc c g ÛcÙc g c ÓgÓ Ùš g c Ó ÔgÙš g g generation LINEs data Log-likelihoods of trees from cold chain ( converged chain) 45 40 35 30 Count 25 20 15 10 5 0-37820 -37810-37800 -37790-37780 -37770-37760 Log-likelihoods pre-swap trees post-swap trees Mark Pagel, Reading Univ (ITP 5/14/01) 12

Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) c22 b6 possum c21 b4 c22 b5 c21 c22 b3 b3 c1 c6 b3 b4 c22 c6 b2 b2 c22 b1 c6 b3 c1 b2 c1 mouse3 b1 C1.18 C1.20 c21 b1 C6.20 C22.18 C21.19R C1.19 C21.20 C22.13 C6.15 100 C22.14 C1.17 C6.19 C6.18 C21.17R C22.15 C6.16 C21.15R C22.17 C1.16 C1.15R C21.16 C6.17 C21.12 C21.18 C6.12 100 C1.13 C1.10 C1.14 C21.9 C1.12 C6.11 C1.11R C21.10 C22.8 C6.10 C6.8 C1.9 C21.6 C21.7 C21.8 100 C22.7 C6.9R C1.7 C1.8 C6.7 C21.5 C22.6 C21.4 C22.5 C1.6 0.1 C6.6 C22.4 C1.4 C1.5 C22.2 C22.3 C21.3 C6.5 C6.4 C1.3 C6.3R C1.2 C6.2 C21.2 C21.1 C1.1 L1 gorilla C22.1 B-globi C6.1 ~120 ~90 millions of years ago ~10-15 c21 b5 c6 b5 c21 b2 C22.11 C21.14 c6 b1 Some topics and issues MCMC (getting stuck and slow progress) Tree proposal algorithms (random, deep, shallow,???) Alternation among suites of parameters (topology, branch lengths, model parameters) what schedule to use? Optimisation alternating with bouts of M-H selection? MCMCMC (highly inefficient) Tree swapping: encourage early tree swapping? Once converged, cool heated chains and use in inference? Mark Pagel, Reading Univ (ITP 5/14/01) 13