Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Similar documents
Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Phylogenetics: Building Phylogenetic Trees

Concepts and Methods in Molecular Divergence Time Estimation

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

BINF6201/8201. Molecular phylogenetic methods

Phylogenetics. BIOL 7711 Computational Bioscience

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Dr. Amira A. AL-Hosary

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Constructing Evolutionary/Phylogenetic Trees

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Anatomy of a species tree

A Bayesian Approach to Phylogenetics

A (short) introduction to phylogenetics

Accounting for Phylogenetic Uncertainty in Comparative Studies: MCMC and MCMCMC Approaches. Mark Pagel Reading University.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Phylogenetic Assumptions

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Genomics and Bioinformatics

The Importance of Data Partitioning and the Utility of Bayes Factors in Bayesian Phylogenetics

EVOLUTIONARY DISTANCES

BIOLOGICAL SCIENCE. Lecture Presentation by Cindy S. Malone, PhD, California State University Northridge. FIFTH EDITION Freeman Quillin Allison

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Inferring Molecular Phylogeny

Today s Lecture: HMMs

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

1 ATGGGTCTC 2 ATGAGTCTC

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Estimating Evolutionary Trees. Phylogenetic Methods

Consensus Methods. * You are only responsible for the first two

Phylogenetic inference

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Intraspecific gene genealogies: trees grafting into networks

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Phylogenetic Tree Reconstruction

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Rooting Major Cellular Radiations using Statistical Phylogenetics

1. Can we use the CFN model for morphological traits?

Molecular Evolution & Phylogenetics

Phylogenetic Inference using RevBayes

Phylogenetic methods in molecular systematics

Graph Alignment and Biological Networks

Jed Chou. April 13, 2015

Supplementary material to Whitney, K. D., B. Boussau, E. J. Baack, and T. Garland Jr. in press. Drift and genome complexity revisited. PLoS Genetics.

MCMC: Markov Chain Monte Carlo

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Reconstruire le passé biologique modèles, méthodes, performances, limites

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences

8/23/2014. Phylogeny and the Tree of Life

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Understanding relationship between homologous sequences

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Biology 211 (2) Week 1 KEY!

Inferring Speciation Times under an Episodic Molecular Clock

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley


Molecular phylogeny How to infer phylogenetic trees using molecular sequences

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Lecture 4. Models of DNA and protein change. Likelihood methods

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

PHYLOGENOMICS AND THE RECONSTRUCTION OF THE TREE OF LIFE

Motivating the need for optimal sequence alignments...

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Macroevolution Part I: Phylogenies

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Molecular evidence for multiple origins of Insectivora and for a new order of endemic African insectivore mammals

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

Recovering evolutionary history by complex network modularity analysis

CLASSIFICATION. Why Classify? 2/18/2013. History of Taxonomy Biodiversity: variety of organisms at all levels from populations to ecosystems.

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Algorithms in Bioinformatics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Lecture 3: Markov chains.

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Transcription:

Mixture Models in Phylogenetic Inference Mark Pagel and Andrew Meade Reading University m.pagel@rdg.ac.uk

Mixture models in phylogenetic inference!some background statistics relevant to phylogenetic inference!mixture models!a mixture model for pattern heterogeneity Q!performance and results!real data!a mixture model for heterotachy t!4-taxon simulation study!performance and results!application to real data!mixtures of topologies T

The growth of GenBank

Growth in the use of molecular phylogenies in scientific research 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Cumulative publications: molecular AND phylogen*24000 PCR made available Nobel Prize to Kerry Mullis 2 years 3 years 4 years 1984 1988 1992 1996 2000 2004 2008 Year projected 2007 source: Web of Science, April, 2007. Search terms molecular AND phylogen*. Line of best fit is a quadratic spline or quadratic polynomial

Mixture models in phylogenetic inference L(Q)! P(D Q) conventional likelihood model mixture model in Q P(D Q,T) = " i P(D i Q,T) P(D Q 1,Q 2, Q J,T) = " i #w j P(D i Q j,t) Let T = (S, t) mixture model in t P(D Q,t 1,t 2 t J ) = " i #w j P(D i Q,t j )

Overview: Phylogenetic Inference from gene sequences 1. Homogeneous Process Model: A C G T Sp 1 AACGTTGTCCCTT Sp 2 AACGTTCCTTGCT.. Sp n CCGGTTGCAAGCT A --- q ac q ag q at C --- q cg q ct G --- q gt =1 T --- 2. popular modifications to basic homogeneous model a) rate-heterogeneity (Yang, 1994): apply homogeneous model but allow rates to vary over sites according to a discrete gamma. Equivalent to fitting k rate matrices to each site, where the matrices differ from each other only by a proportional scalar b) partitioning data: apply a different rate matrix to different sites, chosen beforehand by the investigator

Limitations to homogeneous, rate-heterogeneity and partitioning approaches Homogeneous model: not all sites may evolve according to the same model Gamma model: rate variation may not be gamma or sites may vary in some other way Partitioning data: presumes investigator knows with certainty which model best applies to each site

Pattern-Heterogeneity Model of Gene-Sequence Evolution Allow for different sites to evolve in quantitatively or qualitatively different ways without prior partitioning by the investigator. Method: fit more than one model of evolution to each site, summing the likelihood at each site over all models. Allow the data to determine the best model for each site. Motivation for model: Pattern-heterogeneity model will always equal or better the performance of homogeneous or gamma rate heterogeneity models and normally improve on partitioning models. Frequently yields substantial improvements (100 s of log-units) Applications Phylogenetic inference Detecting regions within genes that evolve differently Detecting differences among genes

Applications of pattern-heterogeneity model Single gene alignment species 1 species n pattern 1 pattern 2 pattern 3 Concatenated gene alignment species 1 species 2 gene 1 gene 2 gene 3 gene k... species n Supermatrix alignment species 1 species 2 species n-k species n

Detecting pattern-heterogeneity in simulated data Method: I) generate simulated gene-sequence data on a random tree according to two different models of sequence evolution, creating two genes with different patterns of substitutions 2) analyse simulated data using homogeneous, gamma rates and pattern-heterogeneity model Tree used in simulations rate matrices A C G T A --- 4.97 2.11 3.41 3.13 C --- 0.35 3.87 0.82 0.34 2.82 1.49 G --- 1 1 T --- Data: two genes of length 1200 and 800 sites

Posterior likelihoods for data simulated from two qualitatively distinct rate matrices -100200-100300 -100400 log-likelihood -100500-100600 -100700-100800 Homogeneous Gamma (4) Pattern-Homogeneity -100900-101000 500000 1500000 2500000 3500000 4500000 Iteration of Markov Chain A <--> C A <--> G A <--> T C <--> G C <--> T G <--> T Q1 4.97 3.41 0.82 0.35 2.82 1 Q2 2.11 3.13 0.34 3.87 1.49 1

Pattern-heterogeneity model: Simulated and estimated values of the rate parameters 5 A C G T Estimated values of rate parameters 4 3 2 1 A --- 4.97 2.11 3.41 3.13 C --- 0.35 3.87 0.82 0.34 2.82 1.49 G --- 1 1 T --- Q2 Q1 0 0 1 2 3 4 5 Actual values of rate parameters

Detecting pattern-heterogeneity in simulated data 15 log(q1 likelihood) - log(q2 likelihood) 10 5 0-5 -10-15 gene 1 gene 2-20 0 250 500 750 1000 1250 1500 1750 2000 Site in simulated sequence

Detecting rate-heterogeneity in simulated data Method: I) generate simulated gene-sequence data on a random tree according to a gamma rate heterogeneity model (continuous-gamma, $=1.0) 2) analyse data using homogeneous model, gamma rates (4 categories) and patternheterogeneity model (4 rate matrices) Tree used in simulations rate matrices A C G T A --- 4.97 3.41 0.82 C --- 0.35 2.82 G --- 1 T --- Data: a gene of length 2000 sites

Posterior likelihoods for data simulated using four gamma-distributed rate categories -82000-83000 -84000-82840 -82860-85000 -82880-82900 log-likelihood -86000-87000 -88000-82920 -82940 0 2000000 4000000-89000 -90000 Q(4) Q(1) Gamma(4) -91000-92000 0 1000000 2000000 3000000 4000000 5000000 Iteration of Markov Chain

Average estimated transition rates from pattern-heterogeneity model applied to simulated gamma rate heterogeneity data: input values shown

Detecting secondary structure in Mitochondrial ribosomal 12s sequences n=57 mammals

Detecting secondary structure using the pattern-heterogeneity model Stem and loop structure leads to predictions about the pattern of nucleotide substitutions stems: dominated by Watson-Crick base pairings. Therefore expect compensatory substitutions to maintain Watson-Crick pairings. Non-compensatory changes have lower fitness. This predicts that transitional changes will occur at a far higher rate than transversional changes. Often observe Tr/Tv ratio of 10-20 in mitochondrial DNA loops: no a priori substitutional pattern expected

Detecting secondary structure; likelihoods of pattern-heterogeneity and Gamma models -22900-22950 -23000 log-likelihood -23050-23100 -23150 Gamma (4) Pattern-heterogeneity (4) -23200-23250 -23300 0 1E7 2E7 3E7 4E7 5E7 Iteration number

Mammal 12S data: analysis of rate matrices A <-> C A <-> G A <-> T C <-> G C <-> T G <-> T Tr/Tv Q1 13.75 10.52 7.89 1.69 59.65 3.34 5.26 Q2 44.25 66.37 35.57 10.64 88.46 3.59 3.29 Q3 1.60 45.29 1.77 1.59 23.30 1.62 20.84 Q4 0.30 0.80 0.18 0.20 1.62 0.10 6.26 Best fit to loop or stem Q1 Q2 Q3 Q4 Stem 57 13 122 272 Purines A G Pyrimidines C T Loop 102 150 57 236

Application of pattern heterogeneity to n=159 datasets (using a reversible-jump algorithm) RJ-algorithm allows GTR rate matirices to be gained and lost using split and merge moves Q split Q i merge Q Q j Jumps to higher or lower dimensions governed by: likelihood ratio % prior ratio % proposal ratio % jacobian Mean no. of taxa =39.9±32, median = 29 (range 7-178) Mean alignment size = 1983±2545 nucleotides, median = 1140 (range 204-16937) Mean no. of genes = 2.7±4.5, median = 1, (range 1-37) Allow RJ algorithm to run to apparent convergence Improvement over conventional modelling How many Qs?

Distribution of number of rate matrices (RJ-algorithm) and empirical weight 1 mean = 2.2±1.4 median =2 correlation with no. of taxa, r 2 =0.15 no. of sites in alignment, = r 2 =0.27 no. of genes, r 2 =0.25 no. of taxa has highest partial correlation 1 2 3 4 5 6 7 8 9 10 number of rate matrices Weight 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Position

Improvement in log-likelihood over a single Q model 100 mean improvement = 265±608 log-units 75 mean improvement = 379± 698 for n=111 datasets with > 1 Q (light green) 50 25-500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Increase in log-likelihood over single Q model

Improvement in log-likelihood versus no. of rate matrices 4000 log-likelihood difference 3000 2000 1000 0 0 1 2 3 4 5 6 7 8 9 10 No Q's note: curve is a quadratic fit

Conclusions Mixture model can detect pattern-heterogeneity in simulated and real data and recover the parameters of the model of sequence evolution Can lead in general to large improvements in likelihood over single Q model Can improve topology and reduce long-branch attraction Can reduce the node density artefact Can be used to investigate gene evolution (such as secondary structure) or be applied to concatenated data sets of multiple genes.

Heterotachy mixture model in t Mixture model in branch lengths, t P(D Q,t 1,t 2 t J ) = " i #w j P(D i Q,t j ) A B C A B C A B C What does the model deliver?!variation in rates across the tree (heterotachy): a non-parametric covarion? basic covarion model (Tuffley and Steel, 1998) Q (off or slow) - s ij I s ij I Questions!does it work?!can we estimate sets of branch lengths?!are there a small number of sets in real data? Q= s ji I Q (on or fast)- s ji I

Does the branch-lengths model work? Variable evolutionary rates over time: Where Parsimony beats ML site pattern 1 site pattern 2 A C A C p r B q D B D Site-specific variation in rates of evolution, causes maximum likelihood but not parsimony to find the wrong tree. One-half of the sites vary their rates as in the left diagram, and the remaining sites vary according to the right diagram. Maximum likelihood tends to put A and C and B and D together, whereas parsimony tends to find the true tree (Kolaczkowski and Thornton, 2004)

Simulation!four taxon tree A site pattern 1 site pattern 2 C A C!two sets of site patterns p r!750 sites per pattern B q D B D!generate 5000 random data sets of a hard case: p=0.75, q=0.05, r={0,0.4}!analyse with MP, JC-MCMC, Covarion, BL-MCMC (blind) Do we get back the true tree? A B C D

Information ratio = xxyy/(xyxy + xyyx) A r C e.g., aacc/(acac+acca) B D 6 5 information ratio 4 3 2 1 0 0.1.2.3.4.5 r

A C 1 Probability of True Tree versus r B r D 0.9 0.8 p 50 scores in r 0.7 0.6 Parsimony = 0.220 p 50 Y 0.5 JC MCMC = 0.284 0.4 Covarion = 0.280 0.3 BL MCMC = 0.039 0.2 0.1 0 0.1.2.3.4 R r 0.039 0.284 BL MCMC: recovered two branch length sets of correct lengths

Probability of recovering true tree vs. information ratio: BL MCMC model 1.8 probability correct.6.4.2 0 0 1 2 3 4 5 information ratio

Can we estimate sets of branch lengths? Simulated branch-length sets data* Two branch length sets Analysis Likelihood Variance Weight BLS 1-131949 68.53649 1 BLS 2-131020 66.02778 0.491023 0.508977 BLS 3-131021 77.96875 0.003187 0.480406 0.516406 BLS 4-130933 93.67612 0.504338 0.003581 0.318459 0.173622 Three branch length sets Analysis Likelihood Variance Weights BLS 1-198450 52.2581 1 BLS 2-197717 53.65423 0.666638 0.333362 BLS 3-196898 73.11806 0.323386 0.355342 0.321271 BLS 4-196899 347.5093 0.322306 0.347741 0.31752 0.012432 Four branch length sets Analysis Likelihood Variance Weights BLS 1-266091 39.67153 1 BLS 2-265204 114.0733 0.24488 0.75512 BLS 3-264547 122.6783 0.268026 0.273999 0.457976 BLS 4-263799 58.81039 0.240158 0.239082 0.260894 0.259867 *Data simulated on a tree of about 60 tips. 1500 sites. Likelihood is mean from Markov chain; Variance is of the likelihood.

A reversible-jump algorithm for heterotachy -- allow each branch to choose one or more length -- need mechanism for splitting and merging branches -- test on simulated data 0.2 branch length set 2 0.15 0.1 0.05 0 0.05.1.15.2 branch length set 1

Data simulated to have two branch-length sets log-likelihoods returned by one and two branch-length sets models and RJ algorithm -130900-131000 -131100-131200 -131300-131400 Y -131500-131600 -131700-131800 Y log-likelihood BLS 1 log-likelihood BLS 2 log-likelihood RJ +925 log-units over BLS 1 138 extra branches +870 log-units 73±2.46 branches 94% of improvement in log-likelihood with 65 fewer parameters -131900-132000 -132100 0 10000000 20000000 30000000 40000000 50000000 iteration

Time series plot of no. of extra branches inferred from RJ algorithm No. branches 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 33000000 34000000 35000000 iteration

Eukaryote phylogeny of 48 species based on four genes (EF1-$, EF2- $, and two trna synthetases) One branch-length set (homogeneous model) log-likelihood= -229785 two branch-length set log-likelihood= -228891 894 log-units for 94 extra branches RJ-algorithm log-likelihood = -228923 862 log-units(96%) for 45± 1.4 extra branches

Summary Branch-lengths model seems to work Can estimate branch length sets RJ algorithm can identify in which branches the heterotachy is most pronounced Uses Improve phylogenetic inference (e.g., should reduce long branch attraction) Identify where in tree and for which sites in the alignment the heterotachy is occurring Sp 1 AACGTTGTCCCTT Sp 2 AACGTTCCTTGCT.. Sp n CCGGTTGCAAGCT

A mixture of topologies L(Q)! P(D Q) conventional likelihood model mixture model in Q (Pagel and Meade, 2004) P(D Q,T) = "P(D i Q,T) i P(D Q 1,Q 2, Q J,T) = "#w j P(D i Q j,t) i j mixture model in T P(D Q,T 1,T 2 T J ) = "#w j P(D i Q,T j ) i j

Some causes of conflicting phylogenetic signals -- gene trees vs species trees (coalescent events) -- convergent evolution (e.g., lysozyme in cows and monkeys) -- gene duplication -- lateral gene transfer gene-duplication lateral gene transfer

How does the multiple-topologies mixture model work? conventional (Bayesian) inference alignment T likelihood iteration Li=1=P(D T1) Li=2=P(D T2) Li=3=P(D T3) 1 2 3..... n multiple-topologies mixture model alignment.. Li=n=P(D Tn) k distinct Markov chains T1 T2..Tk combined likelihood iteration 1 2 3..... n Converged combined likelihood: Li=n=w1L1+w2L2+..wkLk combined likelihood allows tree to diverge if this increases L calculate posterior probabilities of each tree separately Li=1=w1L1+w2L2+..wkLk Li=2=w1L1+w2L2+..wkLk Li=3=w1L1+w2L2+..wkLk.. Li=n=w1L1+w2L2+..wkLk

Mixture model in T mixture model in T P(D Q,T 1,T 2 T J ) = "#w j P(D i Q,T j ) i j A B C A B C A C A C B B What does the model deliver?!variation in the topology!variation in rates across the tree: a non-parametric covarion? Questions!does it work?!can we estimate sets of topologies/branch lengths?!are there a small number of sets in real data?

Does the multiple-topology model work? Simulation of two-topologies A topology 1 topology 2 C A B!two four taxon trees!two topologies p B q r D p C q r D!generate 20,000 random alignments of 1000 sites sets. Vary the proportion of sites per topology!analyse with Bayesian MCMC using: one-tree model (conventional), two-tree model Can we detect two trees?

Converged Markov chains for one and two-topology models -3050 Data: 1000 sites simulated up two different trees (80/20 split) two-topology model -3060 log-likelihood -3070 one-topology model -3080 0 100000 300000 500000 700000 900000 1100000 Iteration of Markov chain

Proportion of sites favouring topology 1 topology 2 Outcome one-topology model two-topology model 100 0 A B C D A B C D 50 50 A D A C A B B C B D C D 0 100 A B A B C D C D

Performance of one and two-topology models given conflicting phylogenetic signal A B (1) C A (2) D C B D 1 0.9 Posterior support for topology 1 0.8 0.7 0.6 0.5 0.4 0.3 two-topology model Analysis one-tree model two-tree model topology 1 two-tree model topology 2 0.2 0.1 0 one-topology model 0.1.2.3.4.5.6.7.8.9 1 Proportion of sites favouring topology 1

significant differences detected with ~ 10% of sites 1 0.9 0.8 Posterior support for topology 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1.2.3.4.5.6.7.8.9 1 Proportion of sites favouring topology 1 likelihood difference 240 220 200 180 160 140 120 100 80 60 40 20 0 0.1.2.3.4.5.6.7.8.9 1 Proportion of sites favouring topology 1

Estimating the number of topologies Data: 1000 sites simulated up two different trees of 50 tips, 60/40 split Model Log-likelihood s.d W 1 W 2 W 3 One-topology -85222.8 7.2 Two topologies -64920.4 9.9 0.599294 0.400706 Three topologies -64909.9 14.8 0.598558 0.399572 0.001871

Conflicting signal for ecdysozoa/coelomata phylogenies Nematodes (e.g., C. elegans) Arthropods (e.g., Drosophila) Chordates (e.g., humans) Nematodes (e.g., C. elegans) Arthropods (e.g., Drosophila) Chordates (e.g., humans) ecdysozoa coelomata dataset: 13,946 BP alignment of fourteen genes from four eukaryotic species, Homologene (NCBI) database Number Name Length 1 Beta tubulin - 2083 536 2 EEF1A1-68181 926 3 HSP40-4B-56013 700 4 HSP70-4-1624 1696 5 HSP70-5-3908 1310 6 HSP70-9B-39452 1304 7 HSP70-90 1162 8 HSP90-1A 1414 9 14s 302 10 18s 304 11 40s 522 12 RNA-Bind-Motif-19 1974 13 TUBA6 896 14 TUBB. 900

one-topology model: likelihood = -45435.9±2.58 data from Homologene database (NCBI) posterior support for coelomata = 87 tree length 1.01±0.03

One-topology model versus two-topology results likelihood = -45435.9±2.58 likelihood = -45382.8±3.01 weight = 0.57 weight = 0.43 coelomata coelomata ecdysozoa tree length 1.01±0.03 tree length 0.93±0.13 tree length 1.33±0.3 note: two recent papers support ecdysozoa!log-l = 53.1 log units

Simulation of null hypothesis distribution of differences in likelihood n=1000 simulated data sets of 13,946 sites. Fit one and two topology model to each. Find difference in likelihoods. n=1000 < 53, p<0.001 log(l2)-log(l1)

Prokaryote phylogenetics: placement of hyperthermophiles Log likelihood =-199,345

Prokaryotes: two topologies showing that hyperthermophiles become derived (longer tree) w 1 =0.55 w 2 =0.45 Log likelihood =-194,022!L = 5323

Summary Multiple topologies model seems to work Can estimate more than one topology and identify conflicting signal directly Some questions How to test? How many distinct trees in real data?

Acknowledgement BBSRC and NERC (UK)

Phylogeny of the Ascomycota Fungi showing the evolution of lichen-formation lichen forming ambiguous not lichen forming

log-likelihoods for combined LSU/SSU nrdna data set: 54 Ascomycota species n=800 sites -14400-14500 log-likelihood -14600-14700 -14800-14900 Q(2) Gamma(4) Q(4) -15000 0 250000 500000 750000 1000000 1250000 1500000 1750000 2000000 iteration number

Site by site analysis fitting two independent rate matrices: LSU/SSU nrdna Ascomycota combined data set SSU/LSU divide log-likelihood Q1 - log-likelihood Q2 15 10 5 0-5 -10-15 -20-25 -30-35 0 100 200 300 400 500 600 700 800 Note: data modelled with two independent rate matrices site in the alignment

Compensatory substitutions transitional changes transitional changes tranversinal changes low fitness Purines A G Pyrimidines C U

Detecting secondary structure: application of pattern-heterogeneity model to 12s mitochondrial DNA data on 57 mammals Tree here

Mixture models in phylogenetic inference!mixture models!a mixture model in T!simulation study!performance at estimating T s!application to ecdysozoa/coelomata and prokaryotes

Detecting Conflicting Phylogenetic Signals a mixture-model approach

Average estimated transition rates from gamma and pattern-heterogeneity models applied to simulated gamma rate heterogeneity data: input values shown 14 12 10 A C G T A --- 4.97 3.41 0.82 C --- 0.35 2.82 G --- 1 T --- average rate = 2.22 5.48 Average 8 6 2.19 4 2 0 0.2 5 0.9 4-2 Q1 Gamma1 Q2 Gamma2 Q3 Gamma3 Q4 Gamma4

Results for branch length sets RJ model for branch lengths RJ for each branch. How many branches? identifying where and when

Is the number of branch lengths sets small in real data? Two prokaryote data sets 1) rrna 751 sites 2) ribosomal + proteins ~ 6000 sites Conventional likelihood model Beta proteobacteria Gamma proteobacteria Alpha proteobacteria Epsilon proteobacteria Chlamydiae Spirochaete Cyanobacteria DeinococcusThermus Actinobacteria Firmicutes Fusobacteria Thermotoga Aquificae Nanoarchaeota Crenarchaeota Euryarchaeota Branch-lengths mixture model Beta Proteobacteria Gamma Proteobacteria Alpha Proteobacteria Epsilon Proteobacteria Chlamydiae Spirochaetes Actinobacteria Aquifex Thermotoga Fusobacterium Firmicutes DeinococcusThermus Cyanobacteria Nanoarchaeota Crenarchaeota Euryarchaeota