Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Similar documents
Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Systematics - Bio 615

Constructing Evolutionary/Phylogenetic Trees

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Constructing Evolutionary/Phylogenetic Trees

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Consensus Methods. * You are only responsible for the first two

Evaluating phylogenetic hypotheses

Phylogenetics: Parsimony

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetic inference

C3020 Molecular Evolution. Exercises #3: Phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Assessing Phylogenetic Hypotheses and Phylogenetic Data

A Bayesian Approach to Phylogenetics

Consistency Index (CI)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

A (short) introduction to phylogenetics

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

X X (2) X Pr(X = x θ) (3)

Dr. Amira A. AL-Hosary

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Chapter 9 BAYESIAN SUPERTREES. Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton

Consensus methods. Strict consensus methods

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Inferring Molecular Phylogeny

Inference for Single Proportions and Means T.Scofield

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Psychology 282 Lecture #4 Outline Inferences in SLR

Lecture 6 Phylogenetic Inference

Phylogenetic study of Diploschistes (lichen-forming Ascomycota: Ostropales: Graphidaceae), based on morphological, chemical, and molecular data

1. Can we use the CFN model for morphological traits?

arxiv: v1 [q-bio.pe] 6 Jun 2013

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

7. Tests for selection

Intraspecific gene genealogies: trees grafting into networks

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Molecular Evolution & Phylogenetics

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Ratio of explanatory power (REP): A new measure of group support

Phylogenetic analyses. Kirsi Kostamo

Hypothesis testing and phylogenetics

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

BIOL 428: Introduction to Systematics Midterm Exam

Introduction to characters and parsimony analysis

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Latent Trait Reliability

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Frequentist Properties of Bayesian Posterior Probabilities of Phylogenetic Trees Under Simple and Complex Substitution Models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Reconstruire le passé biologique modèles, méthodes, performances, limites

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

A Chain Is No Stronger than Its Weakest Link: Double Decay Analysis of Phylogenetic Hypotheses

Review. A Bernoulli Trial is a very simple experiment:

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Molecular Evolution, course # Final Exam, May 3, 2006

Assessing Congruence Among Ultrametric Distance Matrices

Advanced Experimental Design

Is the equal branch length model a parsimony model?

Phylogenetic Analysis

Phylogenetic Analysis

Hypothesis testing (cont d)

Estimating Evolutionary Trees. Phylogenetic Methods

Phylogenetic Tree Reconstruction

Impact of errors on cladistic inference: simulation-based comparison between parsimony and three-taxon analysis

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Statistical Methods in Particle Physics

Lab 9: Maximum Likelihood and Modeltest

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Exact Inference by Complete Enumeration

(1) Introduction to Bayesian statistics

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Statistical Data Analysis Stat 3: p-values, parameter estimation

What is Phylogenetics

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Stat 101: Lecture 12. Summer 2006

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Phylogenetic methods in molecular systematics

Multiple sequence alignment accuracy and phylogenetic inference

Anatomy of a species tree

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Chapter 1 Statistical Inference

Algorithms in Bioinformatics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Phylogenetic Analysis

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Transcription:

Outline 1. Mechanistic comparison with Parsimony - branch lengths & parameters 2. Performance comparison with Parsimony - Desirable attributes of a method - The Felsenstein and Farris zones - Heterotachous data Derek S. Sikes University of Alaska 3. Confidence - Assessment (part 1): CI, consensus trees Confidence - Assessment of the Strength of Questions we can ask Are the data better than random - do they have signal? How much homoplasy is there? To what extent are particular elements of the trees (clades) supported? What alternative results can we reject? Do independent data sets corroborate or conflict with each other? Recall Stochastic error vs Systematic error These assessment methods help identify stochastic error How repeatable are the results? How strongly do the data support them? This is a measure of precision (which is hopefully related to accuracy) Accuracy and Precision Accuracy Accuracy is correctness. How close a measurement is to the true value. "" "(unless we know the true tree in "" "advance we cannot measure this)" Precision Precision is reproducibility. How closely two or more measurements agree with one another. (this we can measure!) 1

Recall Stochastic error vs Systematic error High accuracy High precision High accuracy Low precision All methods have assumptions - when violated they can produce systematic error Low accuracy High precision Low accuracy Low precision Confidence measures cannot detect systematic error - must use other methods to identify (compare methods that have different assumptions) Branch Support Measures Precision - Not Accuracy* - Random error +/- gone with huge dataset of 124,026 characters - Systematic error evident in ME analysis (tree on right) (even with corrected distance data!) - 100% branch support values indicate no stochastic error In other words (keep in mind) These methods may show a high precision but the tree can still be wrong due to systematic error and These methods may show a low precision but the tree can still be correct * Except possibly Bayesian posterior probabilities 1. Consistency Index 2. g1 statistic, PTP - test 3. Consensus trees 4. Decay index (Bremer Support) 5. Bootstrapping / Jacknifing 6. Posterior probability (see lecture on Bayesian) 1. Consistency Index 2. g1 statistic, PTP - test 3. Consensus trees 4. Decay index (Bremer Support) 5. Bootstrapping / Jacknifing 6. Posterior probability (see lecture on Bayesian) 2

Parsimony - tree scores are integers - often leads to many equally most-parsimonious trees e.g. 27,000 MPTs all length = 25 In contrast, log-likelihoods are real numbers and rarely are two different trees found with equal log-likelihoods e.g. 1 tree of -lnl = 1242.058 next best tree of -lnl = 1242.906 This leads to different approaches in assessing the strength of the phylogenetic signal for MP vs ML analyses Consistency Indices - interesting but less useful than other methods PTP-test, g1-statistic - rarely used Consensus trees - summary tree of all MPTs - more often used for MP than ML - also used for Bayesian Consistency Indices If all the characters have the same signal then the tree is more trustworthy The more agreement there is, the less homoplasy (more consistency) the characters will show on the most parsimonious tree We need statistics to measure consistency CI - Kluge & Farris 1969 How much homoplasy is there? Taxon 1 A C A T T T A Taxon 2 A C G A T T A Taxon 3 A G G A T A G Taxon 4 G A A A A C? Taxon 5 G A T A? C G ObsL 1 2 3 1 1 2 1 Min L 1 2 2 1 1 2 1 Minimum length overall = 10 Length of MP tree = 11 Consistency Index (C.I.) = minimum number of changes required by data set number of changes on tree Higher CI means lower homoplasy CI value for tree or character Consistency index CI = Min L = 10 = 0.91 Obs L 11 Homoplasy index HI = 1-CI = 0.09 3

How much homoplasy is there? MacClade 4.0 Characters colored by their CI Red = CI of 1.0 (change once) Blue = CI <1.0 (change > 1 ) Taxon 1 A C A T T T A Taxon 2 A C G A T T A Taxon 3 A G G A T A G Taxon 4 G A A A A C? Taxon 5 G A T A? C G ObsL 1 2 3 1 1 2 1 Min L 1 2 2 1 1 2 1 CI 1.0 1.0 0.67 1.0 1.0 1.0 1.0 Tree number 1 (rooted using user-specified outgroup)! Tree length = 405! Consistency index (CI) = 0.6519! Homoplasy index (HI) = 0.3481! CI excluding uninformative characters = 0.6347! HI excluding uninformative characters = 0.3653! Retention index (RI) = 0.8102! Rescaled consistency index (RC) = 0.5281! /-- nepalensis2! /------+- nepalensis3! /------------+ / nepalensis6! \---+ nepalensis12! /---------------+ /---------------------- podagricus2! \--------+---------------------- podagricus1! / melissae2! /-------------+ /+ melissae3! /-------------------+ melissae1! \-------------------+ / quadripuncta1! /---------+ \------------+-- quadripuncta2! / trumboi2! \--------------------------------------------------+ trumboi1! /---+ /- maculifrons2! /--------------+--- maculifrons3! \-------------------+------------------ montivagus2! / sayi1! +---------------------------------+ sayi2! /---- humator2! \-------------------+-- humator3! Retention Index Taxon 1 A C A T T T A Taxon 2 A C G A T T A Taxon 3 A G G A T A G Taxon 4 G A A A A C? Taxon 5 G A T A? C G Min L 1 2 2 1 1 2 1 Max L 2 3 3 1 1 3 2 Maximum length overall = 15 Retention index (RI) = MaxL - ObsL = 15-11 = 4 = 0.80 MaxL - MinL = 15-10 5 Farris (1989) - improvements over the CI - Downweights homoplastic characters - Excludes autapomorphies - Goes to 0.0 if Max change = Observed change (CI doesn t go to 0.0, hard to interpret) General trends observed with CI/RI s Strong negative correlation between taxon number and CI and RI Data sets with few characters can show unexpected high CI and RI Not a very reliable measure of strength of signal 4

How can we evaluate the significance of CI/RI? Permuting data removes phylogenetic signal CI depends directly on tree length We can compare the observed tree length with what we would obtain if there were no phylogenetic signal A permutation tail probability (PTP) tests the proportion of permuted data sets with as good or better measure of quality than the real data Taxon 1!ACATTTA! Taxon 2!ACGATTA! Taxon 3!AGGATAG! Taxon 4!GAAAAC?! Taxon 5!GATA?CG! Randomize states within a character Permuted data sets Taxon 1!GAAA?AA! Taxon 2!ACAATC?! Taxon 3!GAGTATG! Taxon 4!AGTATCG! Taxon 5!ACGATTA! PTP test in PAUP* permutation test = PTP! 1000 permutation test replicates completed! Time used = 5.83 sec! Results of PTP test:! Number of! Tree length replicates! -------------------------! 379* 1! 410 1! 411 1! 412 3! 413 10! 414 8! 415 24! 416 34! 417 43! 418 73! 419 81! 420 132! 421 142!!! Number of! Tree length replicates! -------------------------!! 422 135! 423 112! 424 88! 425 50! 426 36! 427 23! 428 2! 429 1! * = length for original! (unpermuted) data! P = 0.001000! Example without signal!!!number of!!!!number of! Tree length replicates! Tree length replicates! -------------------------! -------------------------! 1924 3!! 1940 6! 1926 1! 1941 7! 1927 4! 1942 4! 1928 1! 1943 2! 1929 2! 1944 1! 1930 8! 1945 1! 1931 6! 1946 1! 1932 5! 1947 1! 1933 4! 1950 3! 1934 4! 1952 1! 1935 5! 1953 1! 1936 1! 1955 1! 1937 8! 1958 1! 1938* 11! The permuted data are 1939 7! better than the real data! A data set without signal g1 statistic - a measure of skewness, more skew = more signal bell curve = random, noisy data, weak to no signal mean=599.182107 sd=4.944738 g1=-0.150922! 582.00000 /------------------------------------------------------------------------! 583.80000 (5)! 585.60000 # (25)! 587.40000 ### (71)! 589.20000 ######### (209)! 591.00000 ####### (161)! 592.80000 ####################### (521)! 594.60000 ####################################### (883)! 596.40000 ################################################## (1132)! 598.20000 ################################################################# (1469)! 600.00000 ################################### (788)! 601.80000 ######################################################################## (1631)! 603.60000 ################################################################## (1486)! 605.40000 ############################################## (1047)! 607.20000 ######################### (567)! 609.00000 ####### (157)! 610.80000 ######## (171)! 612.60000 ### (57)! 614.40000 (11)! 616.20000 (3)! 618.00000 (1)! \------------------------------------------------------------------------! A data set with signal g1 statistic - a measure of skewness, more skew = more signal bell curve = random, noisy data, weak to no signal mean=611.572872 sd=31.049455 g1=-0.942643! 501.00000 /------------------------------------------------------------------------! 508.65000 # (15)! 516.30000 ## (60)! 523.95000 ### (84)! 531.60000 ##### (135)! 539.25000 # (21)! 546.90000 # (26)! 554.55000 ### (96)! 562.20000 ###### (166)! 569.85000 ########## (290)! 577.50000 ########################## (737)! 585.15000 ######################################## (1118)! 592.80000 ######################## (665)! 600.45000 #### (120)! 608.10000 ########## (268)! 615.75000 ################## (497)! 623.40000 ############################ (796)! 631.05000 ############################################### (1337)! 638.70000 ######################################################################## (2031)! 646.35000 ######################################################### (1610)! 654.00000 ########### (323)! \------------------------------------------------------------------------! 5

/------------------------------------------------------------------------------ 379 (1) 380 (3) 381 (1) 382 (5) 383 (4) 384 (5) 385 (7) 386 (8) 387 (15) 388 (19) 389 (20) 390 (22) 391 (20) 392 # (40) 393 # (51) 394 # (46) 395 # (58) 396 ## (78) 397 ## (79) 398 ## (97) 399 ## (110) 400 ## (112) 401 ### (148) 402 ### (162) 403 #### (170) 404 #### (211) 405 ##### (228) 406 ##### (256) 407 ###### (291) 408 ####### (307) 409 ####### (312) 410 ######## (374) 411 ######### (403) 412 ########## (492) 413 ########### (526) 414 ############ (552) 415 ############# (628) 416 ############### (715) 417 ################# (779) 418 #################### (928) 419 ##################### (971) 420 ###################### (1024) 421 ######################## (1108) 422 ######################### (1165) 423 ############################# (1365) 424 ################################ (1507) 425 #################################### (1691) 426 ##################################### (1742) 427 ########################################## (1960) 428 ########################################## (1958) 429 ############################################# (2107) 430 ############################################## (2178) 431 #################################################### (2451) 432 ####################################################### (2603) 433 ######################################################## (2648) 434 ############################################################ (2810) 435 ################################################################ (3007) 436 ################################################################# (3050) 437 ############################################################### (2971) 438 ################################################################# (3038) 439 ################################################################## (3112) 440 ################################################################### (3131) 441 ###################################################################### (3265) 442 ################################################################### (3128) 443 ####################################################################### (3326) 444 ########################################################################## (3475) 445 ############################################################################# (3616) 446 ############################################################################ (3566) 447 ############################################################################ (3573) 448 ############################################################################## (3661) 449 ############################################################################## (3644) 450 ############################################################################ (3567) 451 ############################################################################# (3632) 452 ############################################################################# (3616) 453 ############################################################################ (3554) 454 ###################################################################### (3274) 455 #################################################################### (3202) 456 ################################################################# (3074) 457 ############################################################## (2902) 458 ################################################################# (3056) 459 ############################################################### (2947) 460 ########################################################## (2739) 461 ################################################## (2358) 462 ############################################## (2181) 463 ########################################### (2026) 464 #################################### (1678) 465 ############################ (1322) 466 ##################### (966) 467 ################# (776) 468 ########## (488) 469 ####### (307) 470 #### (187) 471 ## (86) 472 # (39) 473 (12) 474 (11) 475 (1) \------------------------------------------------------------------------------ Systematics - Bio 615 Frequency distribution of tree scores: mean=442.504629 sd=14.368220 g1=-0.556859 g2=0.042436 Tests for phylogenetic signal (g1 and PTP) Are sensitive to any signal in the data For example g1 of permuted data = -0.04 (ns) Duplicate one taxon and g1 = -1.56** Useful for identifying truly useless data (very rare) But otherwise does not tell you much about data quality Thus, not in your text or Page & Holmes (1998) Consensus & branch support CI & PTP methods seek to determine overall data quality as a guide to whether we should believe particular results We can, instead, evaluate particular results Clade support measures: bootstrap/decay Statistical tests of alternative hypotheses Terms - from lecture & readings precision accuracy consistency index g1 statistic homoplasy index retention index PTP test Study questions What do we mean when we say a method relaxes an assumption? [Compare how the JC69 and more complex models (eg K2P, HKY, or GTR) treat the Ts/Tv ratio parameter.]" Why is Quantifying the uncertainty of a phylogenetic estimate at least as important a goal as obtaining the phylogenetic estimate itself.? Do assessment methods like bootstrapping attempt to measure accuracy or precision? Both stochastic and systematic error can affect accuracy and precision - How can we can minimize one of these types of error? And by doing so what can we maximize - accuracy or precision?" The g1 statistic and the PTP test are not often used for assessment - what is that they can tell us about our data?" 6