Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Mixture Models in Phylogenetic Inference Mark Pagel and Andrew Meade Reading University m.pagel@rdg.ac.uk

Mixture models in phylogenetic inference!some background statistics relevant to phylogenetic inference!mixture models!a mixture model for pattern heterogeneity Q!performance and results!real data!a mixture model for heterotachy t!4-taxon simulation study!performance and results!application to real data!mixtures of topologies T

The growth of GenBank

Growth in the use of molecular phylogenies in scientific research 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Cumulative publications: molecular AND phylogen*24000 PCR made available Nobel Prize to Kerry Mullis 2 years 3 years 4 years 1984 1988 1992 1996 2000 2004 2008 Year projected 2007 source: Web of Science, April, 2007. Search terms molecular AND phylogen*. Line of best fit is a quadratic spline or quadratic polynomial

Mixture models in phylogenetic inference L(Q)! P(D Q) conventional likelihood model mixture model in Q P(D Q,T) = " i P(D i Q,T) P(D Q 1,Q 2, Q J,T) = " i #w j P(D i Q j,t) Let T = (S, t) mixture model in t P(D Q,t 1,t 2 t J ) = " i #w j P(D i Q,t j )

Overview: Phylogenetic Inference from gene sequences 1. Homogeneous Process Model: A C G T Sp 1 AACGTTGTCCCTT Sp 2 AACGTTCCTTGCT.. Sp n CCGGTTGCAAGCT A --- q ac q ag q at C --- q cg q ct G --- q gt =1 T --- 2. popular modifications to basic homogeneous model a) rate-heterogeneity (Yang, 1994): apply homogeneous model but allow rates to vary over sites according to a discrete gamma. Equivalent to fitting k rate matrices to each site, where the matrices differ from each other only by a proportional scalar b) partitioning data: apply a different rate matrix to different sites, chosen beforehand by the investigator

Limitations to homogeneous, rate-heterogeneity and partitioning approaches Homogeneous model: not all sites may evolve according to the same model Gamma model: rate variation may not be gamma or sites may vary in some other way Partitioning data: presumes investigator knows with certainty which model best applies to each site

Pattern-Heterogeneity Model of Gene-Sequence Evolution Allow for different sites to evolve in quantitatively or qualitatively different ways without prior partitioning by the investigator. Method: fit more than one model of evolution to each site, summing the likelihood at each site over all models. Allow the data to determine the best model for each site. Motivation for model: Pattern-heterogeneity model will always equal or better the performance of homogeneous or gamma rate heterogeneity models and normally improve on partitioning models. Frequently yields substantial improvements (100 s of log-units) Applications Phylogenetic inference Detecting regions within genes that evolve differently Detecting differences among genes

Applications of pattern-heterogeneity model Single gene alignment species 1 species n pattern 1 pattern 2 pattern 3 Concatenated gene alignment species 1 species 2 gene 1 gene 2 gene 3 gene k... species n Supermatrix alignment species 1 species 2 species n-k species n

Detecting pattern-heterogeneity in simulated data Method: I) generate simulated gene-sequence data on a random tree according to two different models of sequence evolution, creating two genes with different patterns of substitutions 2) analyse simulated data using homogeneous, gamma rates and pattern-heterogeneity model Tree used in simulations rate matrices A C G T A --- 4.97 2.11 3.41 3.13 C --- 0.35 3.87 0.82 0.34 2.82 1.49 G --- 1 1 T --- Data: two genes of length 1200 and 800 sites

Posterior likelihoods for data simulated from two qualitatively distinct rate matrices -100200-100300 -100400 log-likelihood -100500-100600 -100700-100800 Homogeneous Gamma (4) Pattern-Homogeneity -100900-101000 500000 1500000 2500000 3500000 4500000 Iteration of Markov Chain A <--> C A <--> G A <--> T C <--> G C <--> T G <--> T Q1 4.97 3.41 0.82 0.35 2.82 1 Q2 2.11 3.13 0.34 3.87 1.49 1

Pattern-heterogeneity model: Simulated and estimated values of the rate parameters 5 A C G T Estimated values of rate parameters 4 3 2 1 A --- 4.97 2.11 3.41 3.13 C --- 0.35 3.87 0.82 0.34 2.82 1.49 G --- 1 1 T --- Q2 Q1 0 0 1 2 3 4 5 Actual values of rate parameters

Detecting pattern-heterogeneity in simulated data 15 log(q1 likelihood) - log(q2 likelihood) 10 5 0-5 -10-15 gene 1 gene 2-20 0 250 500 750 1000 1250 1500 1750 2000 Site in simulated sequence

Detecting rate-heterogeneity in simulated data Method: I) generate simulated gene-sequence data on a random tree according to a gamma rate heterogeneity model (continuous-gamma, $=1.0) 2) analyse data using homogeneous model, gamma rates (4 categories) and patternheterogeneity model (4 rate matrices) Tree used in simulations rate matrices A C G T A --- 4.97 3.41 0.82 C --- 0.35 2.82 G --- 1 T --- Data: a gene of length 2000 sites

Posterior likelihoods for data simulated using four gamma-distributed rate categories -82000-83000 -84000-82840 -82860-85000 -82880-82900 log-likelihood -86000-87000 -88000-82920 -82940 0 2000000 4000000-89000 -90000 Q(4) Q(1) Gamma(4) -91000-92000 0 1000000 2000000 3000000 4000000 5000000 Iteration of Markov Chain

Average estimated transition rates from pattern-heterogeneity model applied to simulated gamma rate heterogeneity data: input values shown

Detecting secondary structure in Mitochondrial ribosomal 12s sequences n=57 mammals

Detecting secondary structure using the pattern-heterogeneity model Stem and loop structure leads to predictions about the pattern of nucleotide substitutions stems: dominated by Watson-Crick base pairings. Therefore expect compensatory substitutions to maintain Watson-Crick pairings. Non-compensatory changes have lower fitness. This predicts that transitional changes will occur at a far higher rate than transversional changes. Often observe Tr/Tv ratio of 10-20 in mitochondrial DNA loops: no a priori substitutional pattern expected

Detecting secondary structure; likelihoods of pattern-heterogeneity and Gamma models -22900-22950 -23000 log-likelihood -23050-23100 -23150 Gamma (4) Pattern-heterogeneity (4) -23200-23250 -23300 0 1E7 2E7 3E7 4E7 5E7 Iteration number

Mammal 12S data: analysis of rate matrices A <-> C A <-> G A <-> T C <-> G C <-> T G <-> T Tr/Tv Q1 13.75 10.52 7.89 1.69 59.65 3.34 5.26 Q2 44.25 66.37 35.57 10.64 88.46 3.59 3.29 Q3 1.60 45.29 1.77 1.59 23.30 1.62 20.84 Q4 0.30 0.80 0.18 0.20 1.62 0.10 6.26 Best fit to loop or stem Q1 Q2 Q3 Q4 Stem 57 13 122 272 Purines A G Pyrimidines C T Loop 102 150 57 236

Application of pattern heterogeneity to n=159 datasets (using a reversible-jump algorithm) RJ-algorithm allows GTR rate matirices to be gained and lost using split and merge moves Q split Q i merge Q Q j Jumps to higher or lower dimensions governed by: likelihood ratio % prior ratio % proposal ratio % jacobian Mean no. of taxa =39.9±32, median = 29 (range 7-178) Mean alignment size = 1983±2545 nucleotides, median = 1140 (range 204-16937) Mean no. of genes = 2.7±4.5, median = 1, (range 1-37) Allow RJ algorithm to run to apparent convergence Improvement over conventional modelling How many Qs?

Distribution of number of rate matrices (RJ-algorithm) and empirical weight 1 mean = 2.2±1.4 median =2 correlation with no. of taxa, r 2 =0.15 no. of sites in alignment, = r 2 =0.27 no. of genes, r 2 =0.25 no. of taxa has highest partial correlation 1 2 3 4 5 6 7 8 9 10 number of rate matrices Weight 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Position

Improvement in log-likelihood over a single Q model 100 mean improvement = 265±608 log-units 75 mean improvement = 379± 698 for n=111 datasets with > 1 Q (light green) 50 25-500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Increase in log-likelihood over single Q model

Improvement in log-likelihood versus no. of rate matrices 4000 log-likelihood difference 3000 2000 1000 0 0 1 2 3 4 5 6 7 8 9 10 No Q's note: curve is a quadratic fit

Conclusions Mixture model can detect pattern-heterogeneity in simulated and real data and recover the parameters of the model of sequence evolution Can lead in general to large improvements in likelihood over single Q model Can improve topology and reduce long-branch attraction Can reduce the node density artefact Can be used to investigate gene evolution (such as secondary structure) or be applied to concatenated data sets of multiple genes.

Heterotachy mixture model in t Mixture model in branch lengths, t P(D Q,t 1,t 2 t J ) = " i #w j P(D i Q,t j ) A B C A B C A B C What does the model deliver?!variation in rates across the tree (heterotachy): a non-parametric covarion? basic covarion model (Tuffley and Steel, 1998) Q (off or slow) - s ij I s ij I Questions!does it work?!can we estimate sets of branch lengths?!are there a small number of sets in real data? Q= s ji I Q (on or fast)- s ji I

Does the branch-lengths model work? Variable evolutionary rates over time: Where Parsimony beats ML site pattern 1 site pattern 2 A C A C p r B q D B D Site-specific variation in rates of evolution, causes maximum likelihood but not parsimony to find the wrong tree. One-half of the sites vary their rates as in the left diagram, and the remaining sites vary according to the right diagram. Maximum likelihood tends to put A and C and B and D together, whereas parsimony tends to find the true tree (Kolaczkowski and Thornton, 2004)

Simulation!four taxon tree A site pattern 1 site pattern 2 C A C!two sets of site patterns p r!750 sites per pattern B q D B D!generate 5000 random data sets of a hard case: p=0.75, q=0.05, r={0,0.4}!analyse with MP, JC-MCMC, Covarion, BL-MCMC (blind) Do we get back the true tree? A B C D

Information ratio = xxyy/(xyxy + xyyx) A r C e.g., aacc/(acac+acca) B D 6 5 information ratio 4 3 2 1 0 0.1.2.3.4.5 r

A C 1 Probability of True Tree versus r B r D 0.9 0.8 p 50 scores in r 0.7 0.6 Parsimony = 0.220 p 50 Y 0.5 JC MCMC = 0.284 0.4 Covarion = 0.280 0.3 BL MCMC = 0.039 0.2 0.1 0 0.1.2.3.4 R r 0.039 0.284 BL MCMC: recovered two branch length sets of correct lengths

Probability of recovering true tree vs. information ratio: BL MCMC model 1.8 probability correct.6.4.2 0 0 1 2 3 4 5 information ratio

Can we estimate sets of branch lengths? Simulated branch-length sets data* Two branch length sets Analysis Likelihood Variance Weight BLS 1-131949 68.53649 1 BLS 2-131020 66.02778 0.491023 0.508977 BLS 3-131021 77.96875 0.003187 0.480406 0.516406 BLS 4-130933 93.67612 0.504338 0.003581 0.318459 0.173622 Three branch length sets Analysis Likelihood Variance Weights BLS 1-198450 52.2581 1 BLS 2-197717 53.65423 0.666638 0.333362 BLS 3-196898 73.11806 0.323386 0.355342 0.321271 BLS 4-196899 347.5093 0.322306 0.347741 0.31752 0.012432 Four branch length sets Analysis Likelihood Variance Weights BLS 1-266091 39.67153 1 BLS 2-265204 114.0733 0.24488 0.75512 BLS 3-264547 122.6783 0.268026 0.273999 0.457976 BLS 4-263799 58.81039 0.240158 0.239082 0.260894 0.259867 *Data simulated on a tree of about 60 tips. 1500 sites. Likelihood is mean from Markov chain; Variance is of the likelihood.

A reversible-jump algorithm for heterotachy -- allow each branch to choose one or more length -- need mechanism for splitting and merging branches -- test on simulated data 0.2 branch length set 2 0.15 0.1 0.05 0 0.05.1.15.2 branch length set 1

Data simulated to have two branch-length sets log-likelihoods returned by one and two branch-length sets models and RJ algorithm -130900-131000 -131100-131200 -131300-131400 Y -131500-131600 -131700-131800 Y log-likelihood BLS 1 log-likelihood BLS 2 log-likelihood RJ +925 log-units over BLS 1 138 extra branches +870 log-units 73±2.46 branches 94% of improvement in log-likelihood with 65 fewer parameters -131900-132000 -132100 0 10000000 20000000 30000000 40000000 50000000 iteration

Time series plot of no. of extra branches inferred from RJ algorithm No. branches 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 33000000 34000000 35000000 iteration

Eukaryote phylogeny of 48 species based on four genes (EF1-$, EF2- $, and two trna synthetases) One branch-length set (homogeneous model) log-likelihood= -229785 two branch-length set log-likelihood= -228891 894 log-units for 94 extra branches RJ-algorithm log-likelihood = -228923 862 log-units(96%) for 45± 1.4 extra branches

Summary Branch-lengths model seems to work Can estimate branch length sets RJ algorithm can identify in which branches the heterotachy is most pronounced Uses Improve phylogenetic inference (e.g., should reduce long branch attraction) Identify where in tree and for which sites in the alignment the heterotachy is occurring Sp 1 AACGTTGTCCCTT Sp 2 AACGTTCCTTGCT.. Sp n CCGGTTGCAAGCT

A mixture of topologies L(Q)! P(D Q) conventional likelihood model mixture model in Q (Pagel and Meade, 2004) P(D Q,T) = "P(D i Q,T) i P(D Q 1,Q 2, Q J,T) = "#w j P(D i Q j,t) i j mixture model in T P(D Q,T 1,T 2 T J ) = "#w j P(D i Q,T j ) i j

Some causes of conflicting phylogenetic signals -- gene trees vs species trees (coalescent events) -- convergent evolution (e.g., lysozyme in cows and monkeys) -- gene duplication -- lateral gene transfer gene-duplication lateral gene transfer

How does the multiple-topologies mixture model work? conventional (Bayesian) inference alignment T likelihood iteration Li=1=P(D T1) Li=2=P(D T2) Li=3=P(D T3) 1 2 3..... n multiple-topologies mixture model alignment.. Li=n=P(D Tn) k distinct Markov chains T1 T2..Tk combined likelihood iteration 1 2 3..... n Converged combined likelihood: Li=n=w1L1+w2L2+..wkLk combined likelihood allows tree to diverge if this increases L calculate posterior probabilities of each tree separately Li=1=w1L1+w2L2+..wkLk Li=2=w1L1+w2L2+..wkLk Li=3=w1L1+w2L2+..wkLk.. Li=n=w1L1+w2L2+..wkLk

Mixture model in T mixture model in T P(D Q,T 1,T 2 T J ) = "#w j P(D i Q,T j ) i j A B C A B C A C A C B B What does the model deliver?!variation in the topology!variation in rates across the tree: a non-parametric covarion? Questions!does it work?!can we estimate sets of topologies/branch lengths?!are there a small number of sets in real data?

Does the multiple-topology model work? Simulation of two-topologies A topology 1 topology 2 C A B!two four taxon trees!two topologies p B q r D p C q r D!generate 20,000 random alignments of 1000 sites sets. Vary the proportion of sites per topology!analyse with Bayesian MCMC using: one-tree model (conventional), two-tree model Can we detect two trees?

Converged Markov chains for one and two-topology models -3050 Data: 1000 sites simulated up two different trees (80/20 split) two-topology model -3060 log-likelihood -3070 one-topology model -3080 0 100000 300000 500000 700000 900000 1100000 Iteration of Markov chain

Proportion of sites favouring topology 1 topology 2 Outcome one-topology model two-topology model 100 0 A B C D A B C D 50 50 A D A C A B B C B D C D 0 100 A B A B C D C D

Performance of one and two-topology models given conflicting phylogenetic signal A B (1) C A (2) D C B D 1 0.9 Posterior support for topology 1 0.8 0.7 0.6 0.5 0.4 0.3 two-topology model Analysis one-tree model two-tree model topology 1 two-tree model topology 2 0.2 0.1 0 one-topology model 0.1.2.3.4.5.6.7.8.9 1 Proportion of sites favouring topology 1

significant differences detected with ~ 10% of sites 1 0.9 0.8 Posterior support for topology 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1.2.3.4.5.6.7.8.9 1 Proportion of sites favouring topology 1 likelihood difference 240 220 200 180 160 140 120 100 80 60 40 20 0 0.1.2.3.4.5.6.7.8.9 1 Proportion of sites favouring topology 1

Estimating the number of topologies Data: 1000 sites simulated up two different trees of 50 tips, 60/40 split Model Log-likelihood s.d W 1 W 2 W 3 One-topology -85222.8 7.2 Two topologies -64920.4 9.9 0.599294 0.400706 Three topologies -64909.9 14.8 0.598558 0.399572 0.001871

Conflicting signal for ecdysozoa/coelomata phylogenies Nematodes (e.g., C. elegans) Arthropods (e.g., Drosophila) Chordates (e.g., humans) Nematodes (e.g., C. elegans) Arthropods (e.g., Drosophila) Chordates (e.g., humans) ecdysozoa coelomata dataset: 13,946 BP alignment of fourteen genes from four eukaryotic species, Homologene (NCBI) database Number Name Length 1 Beta tubulin - 2083 536 2 EEF1A1-68181 926 3 HSP40-4B-56013 700 4 HSP70-4-1624 1696 5 HSP70-5-3908 1310 6 HSP70-9B-39452 1304 7 HSP70-90 1162 8 HSP90-1A 1414 9 14s 302 10 18s 304 11 40s 522 12 RNA-Bind-Motif-19 1974 13 TUBA6 896 14 TUBB. 900

one-topology model: likelihood = -45435.9±2.58 data from Homologene database (NCBI) posterior support for coelomata = 87 tree length 1.01±0.03

One-topology model versus two-topology results likelihood = -45435.9±2.58 likelihood = -45382.8±3.01 weight = 0.57 weight = 0.43 coelomata coelomata ecdysozoa tree length 1.01±0.03 tree length 0.93±0.13 tree length 1.33±0.3 note: two recent papers support ecdysozoa!log-l = 53.1 log units

Simulation of null hypothesis distribution of differences in likelihood n=1000 simulated data sets of 13,946 sites. Fit one and two topology model to each. Find difference in likelihoods. n=1000 < 53, p<0.001 log(l2)-log(l1)

Prokaryote phylogenetics: placement of hyperthermophiles Log likelihood =-199,345

Prokaryotes: two topologies showing that hyperthermophiles become derived (longer tree) w 1 =0.55 w 2 =0.45 Log likelihood =-194,022!L = 5323

Summary Multiple topologies model seems to work Can estimate more than one topology and identify conflicting signal directly Some questions How to test? How many distinct trees in real data?

Acknowledgement BBSRC and NERC (UK)

Phylogeny of the Ascomycota Fungi showing the evolution of lichen-formation lichen forming ambiguous not lichen forming

log-likelihoods for combined LSU/SSU nrdna data set: 54 Ascomycota species n=800 sites -14400-14500 log-likelihood -14600-14700 -14800-14900 Q(2) Gamma(4) Q(4) -15000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 iteration number

Site by site analysis fitting two independent rate matrices: LSU/SSU nrdna Ascomycota combined data set SSU/LSU divide log-likelihood Q1 - log-likelihood Q2 15 10 5 0-5 -10-15 -20-25 -30-35 0 100 200 300 400 500 600 700 800 Note: data modelled with two independent rate matrices site in the alignment

Compensatory substitutions transitional changes transitional changes tranversinal changes low fitness Purines A G Pyrimidines C U

Detecting secondary structure: application of pattern-heterogeneity model to 12s mitochondrial DNA data on 57 mammals Tree here

Mixture models in phylogenetic inference!mixture models!a mixture model in T!simulation study!performance at estimating T s!application to ecdysozoa/coelomata and prokaryotes

Detecting Conflicting Phylogenetic Signals a mixture-model approach

Average estimated transition rates from gamma and pattern-heterogeneity models applied to simulated gamma rate heterogeneity data: input values shown 14 12 10 A C G T A --- 4.97 3.41 0.82 C --- 0.35 2.82 G --- 1 T --- average rate = 2.22 5.48 Average 8 6 2.19 4 2 0 0.2 5 0.9 4-2 Q1 Gamma1 Q2 Gamma2 Q3 Gamma3 Q4 Gamma4

Results for branch length sets RJ model for branch lengths RJ for each branch. How many branches? identifying where and when

Is the number of branch lengths sets small in real data? Two prokaryote data sets 1) rrna 751 sites 2) ribosomal + proteins ~ 6000 sites Conventional likelihood model Beta proteobacteria Gamma proteobacteria Alpha proteobacteria Epsilon proteobacteria Chlamydiae Spirochaete Cyanobacteria DeinococcusThermus Actinobacteria Firmicutes Fusobacteria Thermotoga Aquificae Nanoarchaeota Crenarchaeota Euryarchaeota Branch-lengths mixture model Beta Proteobacteria Gamma Proteobacteria Alpha Proteobacteria Epsilon Proteobacteria Chlamydiae Spirochaetes Actinobacteria Aquifex Thermotoga Fusobacterium Firmicutes DeinococcusThermus Cyanobacteria Nanoarchaeota Crenarchaeota Euryarchaeota