Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Similar documents
Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

Ultra- large Mul,ple Sequence Alignment

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogenetic inference

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetic Tree Reconstruction

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

Phylogenetic analyses. Kirsi Kostamo

Phylogeny: building the tree of life

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Constructing Evolutionary/Phylogenetic Trees

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Dr. Amira A. AL-Hosary

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

TheDisk-Covering MethodforTree Reconstruction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

Fast coalescent-based branch support using local quartet frequencies

Systematics - Bio 615

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Multiple Sequence Alignment. Sequences

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Effects of Gap Open and Gap Extension Penalties

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms

Quantifying sequence similarity

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Computa(onal Challenges in Construc(ng the Tree of Life

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Introduction to Bioinformatics Introduction to Bioinformatics

X X (2) X Pr(X = x θ) (3)

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Woods Hole brief primer on Multiple Sequence Alignment presented by Mark Holder

What is Phylogenetics

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetic Geometry

Tools and Algorithms in Bioinformatics

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics. BIOL 7711 Computational Bioscience

Recent Advances in Phylogeny Reconstruction

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Phylogenetic inference: from sequences to trees

Isolating - A New Resampling Method for Gene Order Data

BINF6201/8201. Molecular phylogenetic methods

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

How to read and make phylogenetic trees Zuzana Starostová

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

A (short) introduction to phylogenetics

Moreover, the circular logic

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Anatomy of a species tree

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Examining Phylogenetic Reconstruction Algorithms

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Computational methods for predicting protein-protein interactions

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

EVOLUTIONARY DISTANCES

Letter to the Editor. Department of Biology, Arizona State University

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Consistency Index (CI)

Phylogenetic hypotheses and the utility of multiple sequence alignment

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

SUPPLEMENTARY INFORMATION

An Investigation of Phylogenetic Likelihood Methods

Transcription:

Mul$ple Sequence Alignment Methods Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Constructing the Tree of Life: Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes Big data complexity: model misspecifica$on fragmentary sequences errors in input data streaming data

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden$fy orthologs Compute mul$ple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the es$mated gene trees, OR Es$mate a tree from a concatena$on of the mul$ple sequence alignments Get sta$s$cal support on each branch (e.g., bootstrapping) Es$mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden$fy orthologs Compute mul$ple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the es$mated gene trees, OR Es$mate a tree from a concatena$on of the mul$ple sequence alignments Get sta$s$cal support on each branch (e.g., bootstrapping) Es$mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenetic reconstruction methods 1 Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Cost Local optimum Phylogenetic trees Global optimum 2 Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

Solving maximum likelihood (and other hard optimization problems) is unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 10 20 100 4.5 x 10 190 1000 2.7 x 10 2900

The Real Problem! U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U X Y V W

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Input: unaligned sequences

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

1000 taxon models, ordered by difficulty (Liu et al., 2009)

Major Challenges Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements) Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

Multiple Sequence Alignment (MSA): another grand challenge 1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Generalized Tree Alignment Input: set S of sequences and func$on for gap costs. Output: Tree T and sequences at the internal nodes to minimize the total cost on the tree (sum of edit distances on the edges). Note that the output also defines a mul$ple sequence alignment! NP-hard to find the best solu$on!

So^ware for GTA (treelength op$miza$on) POY is the most well-known method for co-es$ma$ng alignments and trees using treelength criteria (however note that the developers of POY say to ignore the alignment and only use the tree). BeeTLe (Be?er Tree Length) is a heuris$c that is guaranteed to always be as least as accurate as POY for the treelength criterion. The accuracy of the final tree depends on the edit distance formula$on as noted by several studies. Affine gap penal$es are more biologically realis$c than simple gap penal$es.

Gap penal$es Simple gap penal$es: cost of a gap of length L is cl for some constant c>0 Affine gap penal$es: cost of a gap of length L is cl + c, for some pair of constants c and c Other gap penal$es are also possible (e.g., cost could be L + c log L)

Treelength ques$ons Is BeeTLe actually be?er than POY at the treelength problem (as promised)? Is it be?er to use affine than simple gap penal$es? How accurate are the alignments? How accurate are the trees, compared to Maximum Parsimony analyses of good alignments Maximum Likelihood analyses of good alignments

Treelength ques$ons Is BeeTLe actually be?er than POY at the treelength problem (as promised)? YES! Is it be?er to use affine than simple gap penal$es? How accurate are the alignments? How accurate are the trees, compared to Maximum Parsimony analyses of good alignments Maximum Likelihood analyses of good alignments

Alignment Error/Accuracy SPFN: percentage of homologies in the true alignment that are not recovered (false nega$ve homologies) SPFP: percentage of homologies in the es$mated alignment that are false (false posi$ve homologies) TC: total number of columns correctly recovered SP-score: percentage of homologies in the true alignment that are recovered Pairs score: 1-(avg of SP-FN and SP-FP)

How well do POY and BeeTLe do, compared to other MSA methods? We simulated sequences down evolu$onary trees with subs$tu$ons, inser$ons, and indels. We computed alignments on each dataset using mul$ple techniques (e.g., POY, BeeTLe, Muscle, Mag, etc.) We computed alignment errors using SPFN See Liu, K. and T. Warnow, PLOS One 7(3). 2012, "Treelength op$miza$on for phylogeny es$ma$on

Figure 5. Alignment SP-FN error of different methods on 100-taxon model conditions. Averages and standard error bars are shown; n~20 for each reported value. doi:10.1371/journal.pone.0033104.g005 Simulated 100-sequence DNA datasets with varying rates of evolu$on Results from Liu and Warnow, PLoS ONE 2012

Observa$ons Choice of gap penalty has big impact on BeeTLE with affine gap penal$es much be?er than simple gap penal$es. Even so, alignments produced by BeeTLe are not nearly as accurate as alignments produced by other methods. The best accuracy is obtained using Opal, MAFFT, SATe, or SATe-II.

Tree Es$ma$on using Treelength Beetle produce phylogene$c (evolu$onary) trees as well as alignments. Given an alignment, we can compute phylogene$c trees on mul$ple sequence alignments using many methods. Examples of tree es$ma$on methods: Maximum Parsimony Maximum Likelihood

FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

Maximum Parsimony (MP) on different alignments Figure 4. Missing branch rates of different methods on 100-taxon model conditions. We report missing branch rates for BeeTLe-Affin comparison to ML methods, SATé, and SATé-II (top chart) and in comparison to MP methods (middle chart). On model conditions marked wit ML(MAFFT) s missing Simulated branch rate 100-sequence significantly improved DNA upon datasets BeeTLe-Affine s with varying (using rates one-tailed of evolu$on pairwise t-tests with Benjamini-Hochberg correction for multiple tests, n~40 for each test, and a~0:05). On model conditions marked with $, BeeTLe-Affine s missing branch rate significa improved upon MP(MAFFT) s Results from (usingliu similar and statistical Warnow, tests). PLoS Averages ONE and2012 standard error bars are shown; n~20 for each reported value. doi:10.1371/journal.pone.0033104.g004

Observa$ons, MP trees MP trees computed using different alignment methods, compared to BeeTLe trees Choice of gap penalty s$ll has an impact, with affine gap penal$es be?er than simple gap penal$es. BeeTLe-affine be?er than the best MP trees Not examined: MP on BeeTLe alignments and MP on SATe or SATe-II alignments.

Maximum Likelihood (ML) on different alignments Treelength Optimization for Phylogeny Estima Simulated 100-sequence DNA datasets with varying rates of evolu$on Results from Liu and Warnow, PLoS ONE 2012

Observa$ons, ML trees ML trees computed using different alignment methods, compared to BeeTLe trees Choice of gap penalty s$ll has an impact, with affine gap penal$es be?er than simple gap penal$es. BeeTLe-affine worse than nearly all ML trees (excep$on is ClustalW). Best trees obtained using SATe, SATe-II, Opal and MAFFT Not examined: ML on BeeTLe alignments

Overall Observa$ons Maximum Likelihood (ML) be?er at es$ma$ng trees than Maximum Parsimony (MP). The best alignments are obtained using SATe or SATe-2. Opal and MAFFT are also good. The best trees are obtained using ML on good alignments. Generalized Tree Alignment not that good for either alignment or tree es$ma$on.

But It is intriguing that BeeTLe alignments are so bad, while trees aren t that bad The edit distance func$on has an impact, so maybe a be?er edit distance func$on would result in good alignments and trees. In other words, we don t know the real answer yet.