Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Similar documents
Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

1. Can we use the CFN model for morphological traits?

Phylogenetics: Parsimony

Lab 9: Maximum Likelihood and Modeltest

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Consistency Index (CI)

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Constructing Evolutionary/Phylogenetic Trees

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

The practice of naming and classifying organisms is called taxonomy.

Constructing Evolutionary/Phylogenetic Trees

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

A Bayesian Approach to Phylogenetics

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

How have gene and protein sequences evolved over

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

arxiv: v1 [q-bio.pe] 6 Jun 2013

Phylogenetic analyses. Kirsi Kostamo

Points of View JACK SULLIVAN 1 AND DAVID L. SWOFFORD 2

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Concepts and Methods in Molecular Divergence Time Estimation

Inferring Molecular Phylogeny

Ratio of explanatory power (REP): A new measure of group support

Dr. Amira A. AL-Hosary

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Is the equal branch length model a parsimony model?

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Bayesian Phylogenetics

Phylogenetic Inference using RevBayes

Phylogenetics: Likelihood

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

X X (2) X Pr(X = x θ) (3)

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

Weighted Least Squares

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Estimating Evolutionary Trees. Phylogenetic Methods

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Maximum Likelihood in Phylogenetics

Biology 211 (2) Week 1 KEY!

Theory of Evolution Charles Darwin

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Outline. Classification of Living Things

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

8/23/2014. Phylogeny and the Tree of Life

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Chapter 17A. Table of Contents. Section 1 Categories of Biological Classification. Section 2 How Biologists Classify Organisms

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Consensus methods. Strict consensus methods

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13

Biol 206/306 Advanced Biostatistics Lab 11 Models of Trait Evolution Fall 2016

Evaluating phylogenetic hypotheses

AP Biology. Cladistics

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

Consensus Methods. * You are only responsible for the first two

Autotrophs capture the light energy from sunlight and convert it to chemical energy they use for food.

TheDisk-Covering MethodforTree Reconstruction

ESS 345 Ichthyology. Systematic Ichthyology Part II Not in Book

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

Base Range A 0.00 to 0.25 C 0.25 to 0.50 G 0.50 to 0.75 T 0.75 to 1.00

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Lecture 5: Sampling Methods

Reconstruire le passé biologique modèles, méthodes, performances, limites

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Maximum Likelihood in Phylogenetics

1/24/2008. Review of Statistical Inference. C.1 A Sample of Data. C.2 An Econometric Model. C.4 Estimating the Population Variance and Other Moments

Algorithms in Bioinformatics

Models of Continuous Trait Evolution

ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS

The Life System and Environmental & Evolutionary Biology II

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Phylogenetic Inference using RevBayes

Reconstructing the history of lineages

Molecular Evolution, course # Final Exam, May 3, 2006

C.DARWIN ( )

Just Enough Likelihood

Transcription:

Pinvar approach Unlike the site-specific rates approach, this approach does not require you to assign sites to rate categories Assumes there are only two classes of sites: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r) Remarks: mean of relative rates = (p invar )(0) + (1-p invar )(r) = 1 this means that r = 1/(1-p invar ) if all sites are variable, p invar = 0 and r = 1 Copyright 2007 Paul O. Lewis 12

Constant site a site in which all of the taxa display the same character state. Invariable site a site in which only one character state is allowed. A site that cannot change state. All invariable sites are constant, but not all constant sites have to be invariable.

Pr(i i invariable) = 1 4 + 3 4 e 40ν 3 = 1 4 + 3 4 e0 = 1 Pr(i j invariable) = 1 4 1 4 e 40ν 3 = 0

A site s likelihood under the JC+ I model x i is the data pattern for site i. General form: Pr(x i JC+I) = p inv Pr(x i inv) + (1 p inv ) Pr If x i is a variable site: If x i is a constant site: Pr(x i JC+I) = (1 p inv ) Pr ( x i JC, Pr(x i JC+I) = p inv Pr(x i inv) + (1 p inv ) Pr ( x i JC, ν 1 p inv ( x i JC, ν 1 p inv ) ν 1 p inv ) )

Why ν 1 p inv? We want the mean rate of change to be 1.0 over all sites (so we can interpret the branch lengths in terms of the expected # of changes per site). If r is the rate of change for the variable sites then: 1 = 0p inv + r ( 1 p inv ) r = = r ( 1 p inv ) 1 1 p inv

Variable (but unknown) rates We expect more shades of grey rather than the on-or-off view of the pinvar model. a priori we do not know which sites are fast and which are slow We may be able to characterize the distribution of rates across sites high variance or low variance.

Gamma distributions relative frequency of sites 1.4 1.2 1 0.8 0.6 0.4 0.2 0 α = 10 α = 1 larger α means less heterogeneity α = 0.1 The mean equals 1.0 for all three of of these distributions smaller α means more heterogeneity 0 1 2 3 4 5 relative rate Copyright 2007 Paul O. Lewis 20

Gamma distribution f(r) = rα 1 β α e βr Γ(α) mean = α/β mean (in phylogenetics) = 1 (in phylogenetics) β = α variance = α/β 2 variance (in phylogenetics) = 1/α

Using Gamma-distributed rates across sites We usually use a discretized version of the gamma with 4-8 categories (the computation time increases linearly with the number of categories). Pr(x i JC + G) = ncat j Pr(x i JC, r j ν) Pr(r j ) where: ncat j r j Pr(r j ) = 1

Discrete gamma (continued) We break up the continuous gamma into intervals each of which has an equal probability, and use the mean rate within each interval as the representative rate for that rate category: Pr(r j ) = 1 ncat So: Pr(x i JC + G) = 1 ncat ncat j Pr(x i JC, r j ν)

Relative rates in 4-category case 1 0.8 Boundary between 1st and 2nd categories Boundaries are placed so that each category represents 1/4 of the distribution (i.e. 1/4 of the area under the curve) 0.6 Relative rates represent the mean of their 0.4 category Boundary between 2nd and 3rd categories Boundary between 3rd and 4th categories 0.2 r 1 = 0.137 r 2 = 0.477 r 3 = 1.000 r 4 = 2.386 0 0 0.5 1 1.5 2 2.5 Copyright 2007 Paul O. Lewis 21

Discrete gamma rate heterogeneity in PAUP* To use gamma distributed rates with 4 categories: lset rates=gamma ncat=4; To estimate the shape parameter: lset shape=estimate; To combine pinvar with gamma: lset rates=gamma shape=0.2 pinvar=0.4; Note: estimate, previous, or a specific value can be specified for both shape and pinvar Copyright 2007 Paul O. Lewis 23

Rate homogeneity in PAUP* Just tell PAUP* that you want all rates to be equal and that you want all sites to be allowed to vary: lset rates=equal pinvar=0; Note: these are the default settings, but it is useful to know how to go back to rate homogeneity after you have experimented with rate heterogeneity! Copyright 2007 Paul O. Lewis 24

Rate heterogeneity summary 1. among-character rate heterogeneity is pervasive, and detectable; 2. failure to account for it can lead to biased branch length estimates (and hence incorrect tree inference); 3. distance corrections can use estimates of rate heterogeneity (but this introduces a lot of variance); 4. recognizing fast characters can be thought of as downweighting them;

Histogram of d$steps Histogram of d$steps Frequency 0 20 40 60 80 100 120 Frequency 0 50 100 150 0 10 20 30 40 # steps 0 10 20 30 40 # steps

Rates of evolution and the reliability of characters Character fit on a 130 taxon tree (simulated with rate heterogeneity): Char. 67 Char. 882 Tree 1 Tree 2 Tree 1 Tree 2 Pscore 4 5 31 30 Based on these 2 characters, both trees have 35 steps.

Rates of evolution and the reliability of characters (continued) Character fit on a 130 taxon tree (simulated with rate heterogeneity): Char. 67 Char. 882 Tree 1 Tree 2 Tree 1 Tree 2 Pscore 4 5 31 30 K2P lnl -23.92-26.13-117.22-116.72 Preference for tree 1 (based on these two characters alone). ln L = 1.7 under K2P.

Rates of evolution and the reliability of characters (continued) Character fit on a 130 taxon tree (simulated with rate heterogeneity): Char. 67 Char. 882 Tree 1 Tree 2 Tree 1 Tree 2 Pscore 4 5 31 30 K2P lnl -23.92-26.13-117.22-116.72 K2P+G lnl -23.538-26.348-98.46-98.1 1 14.2% 3.94% 0% 0% categ. % 2 80.4% 84.2% 0% 0% 3 5.3% 11.84% 0.000017% 0.000019% 4 0.000039% 0.0002 % 99.999983% 99.999981% Stronger preference for tree 1 (based on these two characters alone). ln L = 2.45 under K2P+ Gamma.

Rate heterogeneity and parsimony 1. Successive weighting of Farris (1989): iterative reweighting by the rescaled consistency index on the best tree: min. # steps Consistency index: CI = obs. # steps max. # steps obs. # steps Retention index: RI = max. # steps min. # steps Rescaled consistency index: RC = (CI)(RI) 2. Implied weights of Goloboff (1993): K K+obs. # steps min. # steps

References Goloboff, P. (1993). Estimating character weights during tree search. Cladistics, 9(1):83 91.