Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches Mark Pagel Reading University m.pagel@rdg.ac.uk Phylogeny of the Ascomycota Fungi showing the evolution of lichen-formation lichen forming ambiguous not lichen forming Mark Pagel, Reading Univ (ITP 5/14/01) 1
¾ ¾ ½ ¾ ½ Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches!" #$%'& ()& *+-,/.10234 Ž D Ž D M < g š : < : M c œœ: < Dž Dž Ÿ Ÿ DD < : Ÿ Ÿ :D :F 8 c M < 5687:9<; 56879<; = >?8@BADC =/>?8@EAFCDGG<HD?IFJ KKMLL:NN<O<PBN OMPEN:KKRQQ FÀ<ÁÃÂFÄ DÀ<ÁÃÂFÄgÅÅÇÆ<È É ÁÊDËFÆ ÁÊDËDÆ<ÈÈ:ÂÌ8Á[Í:ËDÌ ÂÌ8Á[ÍËDÌ8ÄÄgÆ<È Â Æ<È QSNDN<TUQ QSNDN<TVQ W/X:Y[Z<\ W XY[Z<\E]]S^:Z ^ZR `acbdde `fagbdde'eeihkj hrj lnm:o lpm:orqq snt<uwv spt<uwv<xxzyy snt<uwv spt<uwv<xxzyy { g}}~} }: <ƒb MƒE R ˆ : <ŠŠc zœ cª «' ± ² ³p ²µ[ ¹pºp» ¼ Phylogenetic Uncertainty Three Species Rooted Trees A B C Unrooted Tree A B A B C C A C B Mark Pagel, Reading Univ (ITP 5/14/01) 2
Number of Possible Phylogenetic Trees Species Unrooted Rooted 3 1 3 4 3 15 5 15 105 6 105 945 10 2,027,025 34,459,425 No. of Trees 50 2.83 X 10 74 2.75 X 10 76 No. of tips (species) N=50 No. rooted = 27529213532835651545259729751524430639300973035816196098326553772152587890625 No. unrooted = 283806325080779912837729172696128150920628587998105114415737667754150390625 Accounting for Phylogenetic Uncertainty Markov-Chain Monte Carlo (MCMC) Methods generate a long chain of phylogenetic trees (tree proposal and acceptance mechanisms) randomly sample from the converged chain calculate event or evolutionary process in each Tree acceptance mechanism: The Metropolis-Hastings Algorithm Accept new tree with p=1.0 if L(T n+1 ) > L(T n ) otherwise accept with probability L(T n+1 )/ L(T n ) Mark Pagel, Reading Univ (ITP 5/14/01) 3
Metropolis-Hastings Algorithm: Accept new tree according to: R = min 1, f ( X T' ) f (X T) x f (T' ) f (T) x f (T T' ) f (T' T) likelihood ratio prior ratio proposal ratio X=data (e.g., gene sequences) T=tree (topology, branches, parameters) MCMC Sampling Mark Pagel, Reading Univ (ITP 5/14/01) 4
Î Ð Ï Ñ Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches Primer of finding the likelihood of a phylogenetic tree 1. aligned gene- sequence data Sheep ATGGTGAAAA GCCACATAGG CAGTTGGATC CTGGTTCTCT TTGTGGCCAT Human ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC Gorilla ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC Mink ATGGTGAAAA GCCACATAGG CAGCTGGCTC CTGGTTCTCT TTGTGGCCAC H G S M 2. model of sequence evolution 3. the probability of sequence substitutions in a given branch of the phylogeny = Q P(t) = Exp[Qt] 4. the likelihood of a given phylogenetic tree L = branches Π P(t) = Π Exp[Qt] 5. search alternative topologies H G S M H M S G Sheep Human Gorilla Mink ATGGTGAAAA GCCACATAGG CAGTTGGATC CTGGTTCTCT TTGTGGCCAT ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC ATGGCGAA-- ----CCTTGG CTGCTGGATG CTGGTTCTCT TTGTGGCCAC ATGGTGAAAA GCCACATAGG CAGCTGGCTC CTGGTTCTCT TTGTGGCCAC H G S M 4 s 1 s p i (x T,v,Q) = w root(i) p nk,x ki (v k,q) n k =1 s 2 k=1 p n' k,n k (v k,q) probability of observing ith nucleotide prior weight ith site possible assignments of ancestral nodes (e.g., 64) product over s branches leading to species product over s-2 internal branches L(x T,v,Q) = i p i (x T,v,Q) product over all i nucleotides Mark Pagel, Reading Univ (ITP 5/14/01) 5
-14000-15000 Convergence of a Markov Chain loglikelihood -16000 of -17000 topology -18000-19000 -20000-14745 -14755 loglikelihood -14775-14765 of topology -14785-14795 -14805-14815.......... -14825-250 2250 4750 7250 9750 12250 14750 17250 19750 22250 Position in Markov Chain -21000-250 2250 4750 7250 9750 12250 14750 17250 19750 22250 data: 20000Out1.lpd position in Markov chain 3500 Tree Likelihoods from Markov-Chain Monte Carlo Simulation n=54 taxa Ascomycota fungi 3000 2500 n=10,000 trees frequency 2000 1500 1000 500 data: 2911 best.lpd 0-14800 -14790-14780 -14770-14760 -14750-14740 -14730 log-likelihood of tree Mark Pagel, Reading Univ (ITP 5/14/01) 6
Character Transition Rates gains and losses of lichenization Single Tree* MCMC Integration ** gains (q 01 ) = 1.04 (0.05-4.5) 1.47±0.32 losses (q 10 ) = 2.41(0.7-5.6) 2.12 ±0.31 *consensus tree **n=20,000 trees Mark Pagel, Reading Univ (ITP 5/14/01) 7
MCMC Some Issues Lack of convergence (poor mixing) Tree and parameter proposal mechanisms Tree and parameter updating schedules Detecting convergence Mark Pagel, Reading Univ (ITP 5/14/01) 8
Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) Given m simultaneous Markov chains, update chains, then swap states among a randomly chosen pair i and j each iteration according to: R = min 1, f i (y i ) f j (y j ) f i (x i ) f j (x j ) x i chain x j x k {likelihood ratio chain i * likelihood ratio chain j} y i y j y k probability of swapping with chain i = R * 2/m cold temperature hot Temperatures of heated chains cold chain t=0.2 t=0.5} 1/i 1/(1+t( i-1) number of chains, i Mark Pagel, Reading Univ (ITP 5/14/01) 9
generation log-likelihood 54 taxa Ascomycota data n= 858 nucleotides generation 54 taxa Ascomycota data n= 858 nucleotides log-likelihood Mark Pagel, Reading Univ (ITP 5/14/01) 10
Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) c22 b6 possum c21 b4 c22 b5 c21 c22 b3 b3 c1 c6 b3 b4 c22 c6 b2 b2 c22 b1 c6 b3 c1 b2 c1 mouse3 b1 C1.18 C1.20 c21 b1 C6.20 C22.18 C21.19R C1.19 C21.20 C22.13 C6.15 C22.14 C1.17 C6.19 C6.18 C21.17R C22.15 C6.16 C22.17 C1.16 C21.15R C1.15R C21.16 C6.17 C21.18 C6.12 C21.12 C1.13 C1.10 C1.14 C21.9 C1.12 C6.11 C1.11R C21.10 C6.10 C6.8 C22.8 C1.9 C21.6 C21.7 C21.8 C22.7 C6.9R C1.7 C1.8 C6.7 C21.5 C22.6 C21.4 C22.5 C1.6 0.1 C6.6 C22.4 C1.4 C1.5 C22.2 C22.3 C21.3 C6.5 C6.4 C1.3 C6.3R C1.2 C6.2 C21.2 C21.1 C1.1 L1 gorilla C22.1 B-globi C6.1 ~120 ~90 millions of years ago ~10-15 c21 b5 c6 b5 c21 b2 C22.11 C21.14 c6 b1 log-likelihood MCMCMC Analysis of LINEs data 92 LINE elements, 4000 nucleotides Ò Ó ÔgÕÃÓ Ùc Ò Ó ÔgÕÃÓ ØBÙ Ò Ó ÔgÕgÚc g Ò Ó ÔgÕgÚgÚgÙ Ò Ó ÔgÕgÚgÙc Ò Ó ÔgÕgÚšØBÙ Ò Ó ÔgÕgÔc g Ò Ó ÔgÕgÔgÚgÙ Ò Ó ÔgÕgÔgÙc Ò Ó ÔgÕgÔšØBÙ Ò Ó ÔgÕEÖš g 4-chains 1-chain ÙgÙc c g Ó gùc g c Ó ÙgÙš g c Úš EÙc c g ÚgÙcÙc g c Ôc EÙš g c generation Mark Pagel, Reading Univ (ITP 5/14/01) 11
log-likelihood LINEs data. ÒÔšØgØE g ÒÔšØBÛc g ÒÔcÜ'Ó g ÒÔcÜEÔc g ÒÔcÜEÙc g ÒÔcÜcØE g ÒÔcÜEÛc g ÒÔgÛÃÓ g ÒÔgÛgÔc g ÒÔgÛgÙc g Simultaneous chains with heating and swapping cold chain hot chain warm chain Chain swapping Ó Ùš g c ÔgÙš g g ÙgÙc c g ØEÙc c g ÛcÙc g c ÓgÓ Ùš g c Ó ÔgÙš g g generation LINEs data Log-likelihoods of trees from cold chain ( converged chain) 45 40 35 30 Count 25 20 15 10 5 0-37820 -37810-37800 -37790-37780 -37770-37760 Log-likelihoods pre-swap trees post-swap trees Mark Pagel, Reading Univ (ITP 5/14/01) 12
Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) c22 b6 possum c21 b4 c22 b5 c21 c22 b3 b3 c1 c6 b3 b4 c22 c6 b2 b2 c22 b1 c6 b3 c1 b2 c1 mouse3 b1 C1.18 C1.20 c21 b1 C6.20 C22.18 C21.19R C1.19 C21.20 C22.13 C6.15 100 C22.14 C1.17 C6.19 C6.18 C21.17R C22.15 C6.16 C21.15R C22.17 C1.16 C1.15R C21.16 C6.17 C21.12 C21.18 C6.12 100 C1.13 C1.10 C1.14 C21.9 C1.12 C6.11 C1.11R C21.10 C22.8 C6.10 C6.8 C1.9 C21.6 C21.7 C21.8 100 C22.7 C6.9R C1.7 C1.8 C6.7 C21.5 C22.6 C21.4 C22.5 C1.6 0.1 C6.6 C22.4 C1.4 C1.5 C22.2 C22.3 C21.3 C6.5 C6.4 C1.3 C6.3R C1.2 C6.2 C21.2 C21.1 C1.1 L1 gorilla C22.1 B-globi C6.1 ~120 ~90 millions of years ago ~10-15 c21 b5 c6 b5 c21 b2 C22.11 C21.14 c6 b1 Some topics and issues MCMC (getting stuck and slow progress) Tree proposal algorithms (random, deep, shallow,???) Alternation among suites of parameters (topology, branch lengths, model parameters) what schedule to use? Optimisation alternating with bouts of M-H selection? MCMCMC (highly inefficient) Tree swapping: encourage early tree swapping? Once converged, cool heated chains and use in inference? Mark Pagel, Reading Univ (ITP 5/14/01) 13