Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Similar documents
Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Lab 9: Maximum Likelihood and Modeltest

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Phylogenetic Inference using RevBayes

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Inferring Molecular Phylogeny

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

A Bayesian Approach to Phylogenetics

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Letter to the Editor. Department of Biology, Arizona State University

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Evolutionary Analysis of Viral Genomes

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Points of View JACK SULLIVAN 1 AND DAVID L. SWOFFORD 2

Mutation models I: basic nucleotide sequence mutation models

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

Using models of nucleotide evolution to build phylogenetic trees

Constructing Evolutionary/Phylogenetic Trees

Maximum Likelihood in Phylogenetics

Introduction to MEGA

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

7. Tests for selection

EVOLUTIONARY DISTANCES

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Dr. Amira A. AL-Hosary

Lecture Notes: Markov chains

Molecular Evolution, course # Final Exam, May 3, 2006

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Phylogenetic Tree Reconstruction

arxiv: v1 [q-bio.pe] 4 Sep 2013

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Lecture 4. Models of DNA and protein change. Likelihood methods

Consensus Methods. * You are only responsible for the first two

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Reconstruire le passé biologique modèles, méthodes, performances, limites

Inferring Speciation Times under an Episodic Molecular Clock

The impact of sequence parameter values on phylogenetic accuracy

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Maximum Likelihood in Phylogenetics

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

1. Can we use the CFN model for morphological traits?

What Is Conservation?

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

C.DARWIN ( )

Introduction to Statistical modeling: handout for Math 489/583

Phylogenetics. BIOL 7711 Computational Bioscience

The Importance of Proper Model Assumption in Bayesian Phylogenetics

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

In: P. Lemey, M. Salemi and A.-M. Vandamme (eds.). To appear in: The. Chapter 4. Nucleotide Substitution Models

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic Assumptions

Modeling Noise in Genetic Sequences

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

Natural selection on the molecular level

Week 5: Distance methods, DNA and protein models

Chris Simon, 1,2 Thomas R. Buckley, 3 Francesco Frati, 4 James B. Stewart, 5,6 and Andrew T. Beckenbach 5. Key Words. Abstract

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

In: M. Salemi and A.-M. Vandamme (eds.). To appear. The. Phylogenetic Handbook. Cambridge University Press, UK.

Understanding relationship between homologous sequences

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Probabilistic modeling and molecular phylogeny

The Phylogenetic Handbook

The Causes and Consequences of Variation in. Evolutionary Processes Acting on DNA Sequences

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Phylogenetic inference

Estimating Divergence Dates from Molecular Sequences

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetic Inference using RevBayes

Underparameterized Model of Sequence Evolution Leads to Bias in the Estimation of Diversification Rates from Molecular Phylogenies

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Building trees of algae: some advances in phylogenetic and evolutionary analysis

An Evaluation of Different Partitioning Strategies for Bayesian Estimation of Species Divergence Times

The Phylo- HMM approach to problems in comparative genomics, with examples.

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Transcription:

Preliminaries Download PAUP* from: http://people.sc.fsu.edu/~dswofford/paup_test 1

A model of the Boston T System 1 Idea from Paul Lewis

A simpler model? 2

Why do models matter? Model-based methods including ML and Bayesian inference (typically) make a consistent estimate of the phylogeny (estimate converges to true tree as number of sites increases toward infinity)

Why do models matter? Model-based methods including ML and Bayesian inference (typically) make a consistent estimate of the phylogeny (estimate converges to true tree as number of sites increases toward infinity)... even when you re in the Felsenstein Zone A C (Felsenstein, 1978) B D

In the Felsenstein Zone 1.00 Proportion Correct 0.75 0.50 0.25 parsimony ML-GTR 0 0 2500 5000 7500 10000 Sequence Length Sequence Length Simulation model = GTR

Why do models matter (continued)?

Why do models matter (continued)? Parsimony is inconsistent in the Felsenstein zone (and other scenarios)

Why do models matter (continued)? Parsimony is inconsistent in the Felsenstein zone (and other scenarios) Likelihood is consistent in any zone (when certain requirements are met)

Why do models matter (continued)? Parsimony is inconsistent in the Felsenstein zone (and other scenarios) Likelihood is consistent in any zone (when certain requirements are met) But this guarantee requires that the model be specified correctly! Likelihood can also be inconsistent if the model is oversimplified

Why do models matter (continued)? Parsimony is inconsistent in the Felsenstein zone (and other scenarios) Likelihood is consistent in any zone (when certain requirements are met) But this guarantee requires that the model be specified correctly! Likelihood can also be inconsistent if the model is oversimplified Real data always evolve according to processes more complex than any computationally feasible model would permit, so we have to choose good rather than correct models

What is a good model?

What is a good model? A model that appropriately balances fit of the data with simplicity (parsimony, in a different sense)

What is a good model? A model that appropriately balances fit of the data with simplicity (parsimony, in a different sense) i.e., if a simpler model fits the data almost as well as a more complex model, prefer the simpler one

What is a good model? A model that appropriately balances fit of the data with simplicity (parsimony, in a different sense) i.e., if a simpler model fits the data almost as well as a more complex model, prefer the simpler one 100 80 60 y 40 B B B B B 120 80 40 y 0 B B B B B B B B 20 B B -40 B 0 0 25 50 75 100 x y =1.30 + 0.965x (r 2 = 0.963) -80 0 25 50 75 100 x y = - 330 +134x - 15.5x 2 +0.816x 3-0.0225x 4 + 0.000335x 5-0.00000255x 6 +0.00000000777x 7 (r 2 =1.000)

The Principle of Parsimony in the world of statistics Burnham and Anderson (1998): Model Selection and Inference Parsimony lies between the evils of underfitting and overfitting. The concept of parsimony has a long history in in the sciences. Often this has been expressed as Occam s razor shave away all that is not necessary. Parsimony in statistics represents a tradeoff between bias and variance as a function of the dimension of the model. A good model is a balance between under- and over-fitting. Welch, J. J. 2006. Estimating the genomewide rate of adaptive protein evolution in Drosophila. Genetics 173: 821 837. Model selection is a process of seeking the least inadequate model from a predefined set, all of which may be grossly inadequate as a representation of reality

Why models don t have to be perfect Assertion: In most situations, phylogenetic inference is relatively robust to model misspecification, as long as critical factors influencing sequence evolution are accommodated Caveat: There are some kinds of model misspecification that are very difficult to overcome (e.g., heterotachy ) E.g.: A C A C B D B D Half of sites Other half Likelihood can be consistent in Felsenstein zone, but will be inconsistent if a single set of branch lengths are assumed when there are actually two sets of branch lengths (Chang 1996) ( heterotachy )

GTR Family of Reversible DNA Substitution Models 3 substitution types (transversions, 2 transition classes) (general time-reversible) GTR Equal base frequencies (Tamura-Nei) 2 substitution types (transitions vs. transversions) TrN SYM 3 substitution types (transitions, 2 transversion classes) (Hasegawa-Kishino-Yano) (Felsenstein) Single substitution type HKY85 F84 Equal base frequencies K3ST (Kimura 3-subst. type) 2 substitution types (transitions vs. transversions) (Felsenstein) F81 K2P (Kimura 2-parameter) Equal base frequencies Single substitution type JC Jukes-Cantor

Among site rate heterogeneity Lemur AAGCTTCATAG TTGCATCATCCA TTACATCATCCA Homo AAGCTTCACCG TTGCATCATCCA TTACATCCTCAT Pan AAGCTTCACCG TTACGCCATCCA TTACATCCTCAT Goril AAGCTTCACCG TTACGCCATCCA CCCACGGACTTA Pongo AAGCTTCACCG TTACGCCATCCT GCAACCACCCTC Hylo AAGCTTTACAG TTACATTATCCG TGCAACCGTCCT Maca AAGCTTTTCCG TTACATTATCCG CGCAACCATCCT Proportion of invariable sites Some sites extremely unlikely to change due to strong functional or structural constraint (Hasegawa et al., 1985) Gamma-distributed rates Rate variation assumed to follow a gamma distribution with shape parameter α Site-specific rates (another way to model ASRV) Different relative rates assumed for pre-assigned subsets of sites

0.08 Modeling ASRV with gamma distribution 0.06 α=200 α=0.5 Frequency 0.04 α=2 α=50 0.02 0 0 1 2 Rate can also include a proportion of invariable sites (p inv )

Performance of ML when its model is violated Tree α = 0.5, pinv=0.5 α = 1.0, pinv=0.5 α = 1.0, pinv=0.2 1 0.9 0.8 1 0.9 0.8 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GTRig HKYig GTRg HKYg GTRi HKYi GTRer HKYer Parsimony 100 1000 10000 100000 0.7 0.6 0.5 0.4 0.3 0.2 0.1 GTRig HKYig GTRg HKTg GTRi HKYi GTRer HKYer parsimony 0 100 1000 10000 100000 0.7 0.6 0.5 0.4 0.3 0.2 0.1 GTRig HYYig GTRg HKYg GTRi HKYi GRTer HKYer parsimony 0 100 1000 10000 100000 Proportion Correct 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 GTRig HKYig GTRg HKYg GTRi HKYi GTRer HKYer parsimony 0 100 1000 10000 100000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GTRig HKYig GTRg HKYg GTRi HKYi GTRer HKYer parsimony 100 1000 10000 100000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GTRig HKYig GTRg HKYg GTRi HKYi GTRer HKYer parsimony 100 1000 10000 100000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GTRig HKYig GTRg HKYg GTRi HKYi GTRer HKYer parsimony 100 1000 10000 100000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GTRig HKYig GTRg HKYg GTRi HKYi GTRer HKYer parsimony 100 1000 10000 100000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 100 1000 10000 100000 Sequence Length

MODERATE Felsenstein zone α = 1.0, p inv =0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 JCer JC+G JC+I JC+I+G GTRer GTR+G GTR+I GTR+I+G parsimony 0.1 0 100 1000 10000 100000

MODERATE Inverse- Felsenstein zone 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 JCer JC+G JC+I JC+I+G GTRer GTR+G GTR+I GTR+I+G parsimony 0.1 0 100 1000 10000 100000

Likelihood ratio tests Model selection criteria δ = 2( ln L 0 ln L 1 ) If model L 0 is nested within model L 1, δ is distributed as X 2 with degrees-of-freedom equal to difference in number of free parameters

Histogram of δ = 2( ln L 0 ln L 1 ) JC vs K80 models

Histogram of δ = 2( ln L 0 ln L 1 ) JC vs K80 models Ξ 1 2 0.05 and 0.01 critical values 3.84 6.64

Model selection criteria

Model selection criteria Akaike information criterion (AIC)

Model selection criteria Akaike information criterion (AIC) AIC i = 2ln L i + 2K where K is the number of free parameters estimated

Model selection criteria Akaike information criterion (AIC) AIC i = 2ln L i + 2K AICc (corrected AIC) where K is the number of free parameters estimated

Model selection criteria Akaike information criterion (AIC) AIC i = 2ln L i + 2K where K is the number of free parameters estimated AICc (corrected AIC) Bayesian information criterion (BIC)

Model selection criteria Akaike information criterion (AIC) AIC i = 2ln L i + 2K AICc (corrected AIC) where K is the number of free parameters estimated Bayesian information criterion (BIC) BIC i = 2ln L i + K lnn where K is the number of free parameters estimated and n is the sample size (typically number of sites)

AIC vs. BIC BIC performs well when true model is contained in model set, and among a set of simple models, AIC often selects a more complex model than the truth (indeed, AIC is formally statistically inconsistent) But in phylogenetics, no model is as complex as the truth, and the true model will never be contained in the model set. BIC often chooses models that seem too simple, however.

Par$$oned Models Many authors have emphasized the importance of modeling heterogeneity among genes or other subsets of the data appropriately

Par$$oned Models Many authors have emphasized the importance of modeling heterogeneity among genes or other subsets of the data appropriately...data par$$oning is more an art than a science, and it should rely on our knowledge of the biological system... Yang and Rannala (2012; Nature Rev. Genet. 13:303-314)

Ways to par$$on By gene By codon By gene/codon combina$on Stems vs. loops (probably not advisable e.g., Simon et al., 2006) Coding vs. noncoding

Naive par$$oning Run ModelTest/JModelTest; es$mate a model (from the GTR+I+G family) separately for each gene/subset Perform an ML/Bayesian analysis, assigning the chosen models to each

Naive par$$oning Run ModelTest/JModelTest; es$mate a model (from the GTR+I+G family) separately for each gene/subset Perform an ML/Bayesian analysis, assigning the chosen models to each Too many parameters! 1-10 parameters for each gene; amount of data available to es$mate each parameter does not increase

Over- Par$$oning Consider the following (contrived) example: Gene A: HKY+G, π = (0.26, 0.24, 0.23, 0.27), κ=1.1, α=3.0 Gene B: GTR, π = (0.25,0.24,0.25,0.26), (a,b,c,d,e)=(1.1, 1.2, 0.9, 1.1, 0.95) Gene C: JC+I (pinv=0.05)

Over- Par$$oning Consider the following (contrived) example: Gene A: HKY+G, π = (0.26, 0.24, 0.23, 0.27), κ=1.1, α=3.0 Gene B: GTR, π = (0.25,0.24,0.25,0.26), (a,b,c,d,e)=(1.1, 1.2, 0.9, 1.1, 0.95) Gene C: JC+I (pinv=0.05) These are all GTR models that are not far from the Jukes- Cantor model, but they all have different names Beher to es$mate one GTR model (even with 5+3+1+1=10 parameters, es$mated from all data) than 3 separate models with 2+5+1=8 parameters (but only one gene s worth of data for each model)

How to find op$mal par$$onings? Consider a data set with 3 genes, A, B, and C: A B A B A C B C A B For each par$$oning scheme, evaluate some set of models from the GTR+I+G (e.g., 56 models) according to AIC or BIC Choose a combina$on of par$$oning scheme and model for subsequent par$$oned- model analyses Rob Lanfear s Par$$onFinder (hhp://www.robertlanfear.com/par$$onfinder/) automates this process; method now also available in PAUP* test versions C C B A C

How many par$$onings? In general, the number of par$$onings on n subsets is a Bell number N 2 3 4 5 6 7 12 60 Bell number 2 5 52 203 877 4140 4 x 10 6 9.8 x 10 59 Obviously, there are too many par$$oning schemes to evaluate them all for more than a few subsets.

Greedy algorithm when there are too many Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D A D B D Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D A D B D B C A D Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D A D B D B C A D B D A C Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D A D B D B C A D B D A C C D A B Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B A D B D B C A D Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B A D B D B C A D Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B A D B D B C A D B C D A Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B A D B D B C A D B C A D B C D A Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B A D B D B C A D B C A D B C D A Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B A D B D B C A D B C A D B C D A Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

Greedy algorithm when there are too many A C B D B D A C C D A B 1 + n(n 2-1)/6 = 11 schemes A D B D B C A D B C A D B C D A For 1265 genes, there would s:ll be 337,380,561 schemes to evaluate! Lanfear, R., Calcott, B., Ho, S. Y. W., & Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution, 29(6), 1695 1701

How to par$$on thousands of genes (or other subsets)? Cluster analysis

How to par$$on thousands of genes (or other subsets)? Cluster analysis Li, Lu, and Or$ (2008) Es$mate model parameters on a shared model; similar subsets will have similar parameter es$mates and will cluster together. Problem? Similar models (in the sense of predic:ng similar site pajern frequencies), can have different parameter MLEs. Must use same model for all subsets.

How to par$$on thousands of genes (or other subsets)? Cluster analysis Li, Lu, and Or$ (2008) Es$mate model parameters on a shared model; similar subsets will have similar parameter es$mates and will cluster together. Problem? Similar models (in the sense of predic:ng similar site pajern frequencies), can have different parameter MLEs. Must use same model for all subsets. Lanfear et al. (most recent Par$$onFinder) Compute single- site likelihood values; cluster sites that have similar site likelihoods into larger subsets. Problem? Site likelihood is more determined by rate than anything else. Evolu:onary rates will have more to do with the clusterings than differences in subs:tu:on pajern, state frequencies, etc.