PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

Similar documents
Dr. Amira A. AL-Hosary

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Reconstructing the history of lineages

Weighted Quartets Phylogenetics

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

HASSET A probability event tree tool to evaluate future eruptive scenarios using Bayesian Inference. Presented as a plugin for QGIS.

BINF6201/8201. Molecular phylogenetic methods

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Constructing Evolutionary/Phylogenetic Trees

Introduction to characters and parsimony analysis

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Consistency Index (CI)

TheDisk-Covering MethodforTree Reconstruction

A (short) introduction to phylogenetics

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Is the equal branch length model a parsimony model?

Constructing Evolutionary/Phylogenetic Trees

X X (2) X Pr(X = x θ) (3)

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

C3020 Molecular Evolution. Exercises #3: Phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Session 5: Phylogenomics

Phylogenetic Tree Reconstruction

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Systematics - Bio 615

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics: Building Phylogenetic Trees

Phylogenetic analyses. Kirsi Kostamo

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Comparative Bioinformatics Midterm II Fall 2004

Chapter 26 Phylogeny and the Tree of Life

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Phylogenetics: Parsimony

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

HOW TO USE MIKANA. 1. Decompress the zip file MATLAB.zip. This will create the directory MIKANA.

Consensus Methods. * You are only responsible for the first two

Phylogeny Tree Algorithms

Package OUwie. August 29, 2013

Phylogenetics: Likelihood

Reconstruire le passé biologique modèles, méthodes, performances, limites

Classification, Phylogeny yand Evolutionary History

BIG4: Biosystematics, informatics and genomics of the big 4 insect groups- training tomorrow s researchers and entrepreneurs

Chapter 19: Taxonomy, Systematics, and Phylogeny

1. Can we use the CFN model for morphological traits?

Ratio of explanatory power (REP): A new measure of group support

Supplementary Information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Networks, Trees, and Clusters

Algorithms in Bioinformatics

Algebraic Statistics Tutorial I

Software GASP: Gapped Ancestral Sequence Prediction for proteins Richard J Edwards* and Denis C Shields

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

What is Phylogenetics

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Hands-On Nine The PAX6 Gene and Protein

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogeny: building the tree of life

Jed Chou. April 13, 2015

Phylogenetic inference

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

Anatomy of a species tree

Biologists have used many approaches to estimating the evolutionary history of organisms and using that history to construct classifications.

Unsupervised Learning in Spectral Genome Analysis

Infer relationships among three species: Outgroup:

Power of the Concentrated Changes Test for Correlated Evolution

Phylogenetic methods in molecular systematics

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

Non-independence in Statistical Tests for Discrete Cross-species Data

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Multiple Sequence Alignment. Sequences

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Species Tree Inference using SVDquartets

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Biology 211 (2) Week 1 KEY!

Transcription:

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence Are directed quartets the key for more reliable supertrees? Patrick Kück Department of Life Science, Vertebrates Division, The Natural History Museum London Bioinformatics 2016 P Bioinformatics 2016 1 / 15

Introduction Tree Reliability & Long-Branch Attraction Systematic errors in phylogenetics Increasingly apparent as more data are analysed ielding maximally support of incorrect relationships Long-branch attraction (LBA) as a major source Which Topology is correct? Terminal nodes can consist of...... single taxa... multiple taxa clades Bioinformatics 2016 2 / 15

Introduction Tree Reliability & Long-Branch Attraction Maximum Likelihood Success (PhyML) - 1.5 0.01 0.01-1.5 occurrences occurrences 100 50 ML Reconstruction Success (PhyML) 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 100 50 0 ML Reconstruction Success (PhyML) α=0.5 α=0.7 α=1.0 α=2.0 LB 0.3 0.7 0.9 1.1 1.3 0.5 1.5 LB α=0.5 α=0.7 α=1.0 α=2.0 occurrences 100 50 ML Reconstruction Success (PhyML) 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 α=0.5 α=0.7 α=1.0 α=2.0 LB GTR; α: 0.3, 0.5, 0.7, 1.0, 2.0; I: 0.3; L: 250.000bp 4 rate categories instead of continuous rate distribution for ML Bioinformatics 2016 3 / 15

Introduction Tree Reliability & Long-Branch Attraction Maximum Likelihood Success (PhyML) - 1.5 0.01 0.01-1.5 occurrences occurrences 100 50 ML Reconstruction Success (PhyML) 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 100 50 0 ML Reconstruction Success (PhyML) α=0.5 α=0.7 α=1.0 α=2.0 LB 0.3 0.7 0.9 1.1 1.3 0.5 1.5 LB α=0.5 α=0.7 α=1.0 α=2.0 occurrences 100 50 ML Reconstruction Success (PhyML) 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 α=0.5 α=0.7 α=1.0 α=2.0 LB ML Reliability further reduced by...... alignment errors... stochastic sampling errors... stronger model misspecifications Bioinformatics 2016 3 / 15

Introduction Tree Reliability & Long-Branch Attraction Is it possible to develop alternative techniques that are less effected by extreme branch length asymmetries? P Bioinformatics 2016 4 / 15

Introduction Tree Reliability & Long-Branch Attraction Is it possible to develop alternative techniques that are less effected by extreme branch length asymmetries? Modern probabilistic substitution models assume time-reversibility Distinction between new (apomorphic) and old (plesiomorphic) homologies P Bioinformatics 2016 4 / 15

Introduction Tree Reliability & Long-Branch Attraction Is it possible to develop alternative techniques that are less effected by extreme branch length asymmetries? Modern probabilistic substitution models assume time-reversibility Distinction between new (apomorphic) and old (plesiomorphic) homologies PhyQuart Quartet based algorithm Consideration of 2 different directions of character alteration along the internal branch Allows discernibility between old and new character split-supporting site patterns and...... ML estimation of the expected number of convergent split support Combination of Hennigian logic and ML estimation represents a completely new strategy for the evaluation of sequence data P Bioinformatics 2016 4 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference 3 Possible Quartet Trees for a Set of 4 Taxa 15 different split pattern Symmetric Directive Asymmetric W Singelton Bioinformatics 2016 5 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference 3 Tree Supporting Split-Pattern Symmetric Directive Asymmetric Nap = Ntot - W Singelton N ap : Potentially phylogenetic informative split-pattern signal N tot : Total number of tree supporting split-pattern (alignment observed) Bioinformatics 2016 6 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference 1 Uninformative, Old Split-Pattern per Tree Direction Symmetric Directive Asymmetric Nap = Ntot - Np - W Singelton N ap : Potentially phylogenetic informative split-pattern signal N tot : Total number of tree supporting split-pattern (alignment observed) N p : Plesiomorphic character similarity, uninformative (alignment observed) Bioinformatics 2016 7 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference 2 Possibly Convergent Evolved Split-Pattern per Tree Direction Symmetric Directive Asymmetric W Singelton Constraint Nap = Ntot - Np - Nc ML (P4) Mean Expected Number: N ap : Potentially phylogenetic informative split-pattern signal N tot : Total number of tree supporting split-pattern (alignment observed) N p : Plesiomorphic character similarity, uninformative (alignment observed) N c : Convergently evolved, uninformative (ML expected mean) Bioinformatics 2016 8 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference Reduction of Support Underestimation Symmetric Directive Asymmetric W Singelton Multiple hits may erode the support for the correct tree Correction of support values Frequency of singelton pattern as indicator for terminal branch lengths Bioinformatics 2016 9 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference Reduction of Support Underestimation Symmetric Directive Asymmetric W Singelton Correction factor (CF): CF = (N Sing Smallest 4)/N Sing T otal Corrected support values closer to what would be expected if external branches were of equal length Bioinformatics 2016 10 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference Reduction of Support Underestimation Symmetric Directive Asymmetric W Singelton Nap = CFobs * (Ntot - Np) - CFexp * Nc) Alignment ML (P4) Correction factor (CF): CF = (N Sing Smallest 4)/N Sing T otal Corrected support values closer to what would be expected if external branches were of equal length 2 correction factors: CF obs (Alignment) & CF exp (ML) Bioinformatics 2016 10 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference Final Scoring PhyQuart Scores Nap < Nap < Nap < Nap < Nap < Nap Nap = CFobs * (Ntot - Np) - CFexp * Nc) Alignment ML (P4) PhyQuart Score: For each quartet tree it s the highest of the scores for it s polarised quartets Normalised so that the scores of all three alternative trees sum to 1 PhyQuart results imply both info about support scores & root info Bioinformatics 2016 11 / 15

PhyQuart Algorithm PhyQuart - Quartet Based Algorithm for Phylogenetic Inference Final Scoring Nap < Nap << Nap Nap < Nap < Nap High Support High Conflict PhyQuart Score: For each quartet tree it s the highest of the scores for it s polarised quartets Normalised so that the scores of all three alternative trees sum to 1 PhyQuart results imply both info about support scores & root info PhyQuart score network-graph Bioinformatics 2016 11 / 15

PhyQuart - Performance PhyQuart - Performance in Identifying Correct Quartets PhyQuart Success 100 PhyQuart Reconstruction Success - 1.5 0.01 occurrences 50 α=0.7 α=1.0 α=2.0 100 PhyQuart Reconstruction Success 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 100 LB PhyQuart Reconstruction Success occurrences 50 α=0.7 α=1.0 α=2.0 0.01-1.5 occurrences 50 α=0.7 α=1.0 α=2.0 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 LB 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 LB GTR; α: 0.3, 0.5, 0.7, 1.0, 2.0; I: 0.3; L: 250.000bp 4 rate categories instead of continuous rate distribution for ML estimation Bioinformatics 2016 12 / 15

PhyQuart - Performance PhyQuart - Performance in Identifying Correct Quartets PhyQuart Success - 1.5 0.01 ML Reconstruction Success (PhyML) PhyQuart Reconstruction Success occurrences 100 50 100 α=0.5 α=0.7 α=1.0 50 α=2.0 α=0.7 α=1.0 α=2.0 0.01 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 LB 0 0.3 0.7 0.9 1.1 1.3 0.5 1.5 LB - 1.5 Maximum Likelihood PhyQuart PhyQuart...... is quite successful in inferring correct quartet topologies from very heterogeneous sequence data... can outperform ML in both overcoming of long-branch attraction & repulsion... not recommended for shorter sequence lengths (<50 kbp) Bioinformatics 2016 12 / 15

PhyQuart - Application Implementation of PhyQuart PENGUIN Manual Command line driven Perl script Runs on Windows, Mac OS, and Linux Extensive user options available Download Link: https://github.com/patrickkueck/penguin Bioinformatics 2016 13 / 15

PhyQuart - Application 3 Applicability of PhyQuart (PENGUIN) Divide & Conquer Clan 1 Analysis of...... all quartets of larger trees... predefined quartets of multitaxon clans Clan 2 Clan 3 Clan 4 P Bioinformatics 2016 14 / 15

PhyQuart - Application Applicability of PhyQuart (PENGUIN) Divide & Conquer CORRECT RELATIONSHIP INPUT FILES Clan 1 Clan 3 Clan Definition File: -p 'clan_file_61_taxa.txt' + ADDITIONAL OPTIONS S1 e.g. -l 1000 Alignment File: -i 'msa_file_61_taxa_30kbp.fas' Clan 2 Clan 4 ANALSIS OF POSSIBLE SPLIT SUPPORT OF CLAN RELATIONSHIPS,, FOR EACH GENERATED SET OF QUARTET SEQUENCES BETWEEN DEFINED CLANS Clan 1 Clan 3 Clan 1 Clan 2 Clan 1 Clan 2 S1 S2 S3 Clan 2 Clan 4 Clan 4 Clan 3 Clan 3 Clan 4 TRIANGLE GRAPHIC RESULT FILES ABOUT: - SUMMARIED SEQUENCE SPLIT SUPPORT VALUES - SINGLE QUARTET SPLIT SUPPORT VALUES Single Split Support per Quartet Mean Split Support per Taxa Median Split Support per Taxa SPLIT GRAPHIC RESULT FILES ABOUT SUMMARIED QUARTET SPLIT SUPPORT VALUES N Best Quartet Topologies Mean Split Support (Overall) Median Split Support (Overall) Clan 1 Clan 3 Clan 1 Clan3 Clan 1 Clan 3 SINGLE QUARTET ANALSIS SVG OUPTUR RESULT FILES Analysis of...... all quartets of larger trees... predefined quartets of multitaxon clans Evaluation of...... contradicting signals to assess the robustness of relationships within a more complex tree Clan 2 Clan 4 Clan 2 Clan 4 Clan 2 Clan 4 P Bioinformatics 2016 14 / 15

PhyQuart - Application Applicability of PhyQuart (PENGUIN) Divide & Conquer CORRECT RELATIONSHIP INPUT FILES Clan 1 Clan 3 Clan Definition File: -p 'clan_file_61_taxa.txt' + ADDITIONAL OPTIONS S1 e.g. -l 1000 Alignment File: -i 'msa_file_61_taxa_30kbp.fas' Clan 2 Clan 4 ANALSIS OF POSSIBLE SPLIT SUPPORT OF CLAN RELATIONSHIPS,, FOR EACH GENERATED SET OF QUARTET SEQUENCES BETWEEN DEFINED CLANS Clan 1 Clan 3 Clan 1 Clan 2 Clan 1 Clan 2 S1 S2 S3 Clan 2 Clan 4 Clan 4 Clan 3 Clan 3 Clan 4 TRIANGLE GRAPHIC RESULT FILES ABOUT: - SUMMARIED SEQUENCE SPLIT SUPPORT VALUES - SINGLE QUARTET SPLIT SUPPORT VALUES Single Split Support per Quartet Mean Split Support per Taxa Median Split Support per Taxa SPLIT GRAPHIC RESULT FILES ABOUT SUMMARIED QUARTET SPLIT SUPPORT VALUES N Best Quartet Topologies Mean Split Support (Overall) Median Split Support (Overall) Clan 1 Clan 3 Clan 1 Clan3 Clan 1 Clan 3 SINGLE QUARTET ANALSIS SVG OUPTUR RESULT FILES Analysis of...... all quartets of larger trees... predefined quartets of multitaxon clans Evaluation of...... contradicting signals to assess the robustness of relationships within a more complex tree Identification of...... of potentially rogue taxa Clan 2 Clan 4 Clan 2 Clan 4 Clan 2 Clan 4 P Bioinformatics 2016 14 / 15

PhyQuart - Application Applicability of PhyQuart (PENGUIN) Divide & Conquer CORRECT RELATIONSHIP INPUT FILES Clan 1 Clan 3 Clan Definition File: -p 'clan_file_61_taxa.txt' + ADDITIONAL OPTIONS S1 e.g. -l 1000 Alignment File: -i 'msa_file_61_taxa_30kbp.fas' Clan 2 Clan 4 ANALSIS OF POSSIBLE SPLIT SUPPORT OF CLAN RELATIONSHIPS,, FOR EACH GENERATED SET OF QUARTET SEQUENCES BETWEEN DEFINED CLANS Clan 1 Clan 3 Clan 1 Clan 2 Clan 1 Clan 2 S1 S2 S3 Clan 2 Clan 4 Clan 4 Clan 3 Clan 3 Clan 4 TRIANGLE GRAPHIC RESULT FILES ABOUT: - SUMMARIED SEQUENCE SPLIT SUPPORT VALUES - SINGLE QUARTET SPLIT SUPPORT VALUES Single Split Support per Quartet Mean Split Support per Taxa Median Split Support per Taxa SPLIT GRAPHIC RESULT FILES ABOUT SUMMARIED QUARTET SPLIT SUPPORT VALUES N Best Quartet Topologies Mean Split Support (Overall) Median Split Support (Overall) Clan 1 Clan 3 Clan 1 Clan3 Clan 1 Clan 3 Clan 2 Clan 4 Clan 2 Clan 4 Clan 2 Clan 4 SINGLE QUARTET ANALSIS SVG OUPTUR RESULT FILES Analysis of...... all quartets of larger trees... predefined quartets of multitaxon clans Evaluation of...... contradicting signals to assess the robustness of relationships within a more complex tree Identification of... Used...... of potentially rogue taxa... in combination with quartet-based supertree methods... for network development P Bioinformatics 2016 14 / 15

PhyQuart - Outro PhyQuart - Publication Submitted to Journal of Theoretical Biology Thank you for your attention. Bioinformatics 2016 15 / 15