SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Similar documents
SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Ultra- large Mul,ple Sequence Alignment

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Computa(onal Challenges in Construc(ng the Tree of Life

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Woods Hole brief primer on Multiple Sequence Alignment presented by Mark Holder

Algorithms in Bioinformatics

Phylogenetic inference

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin

Recent Advances in Phylogeny Reconstruction

TheDisk-Covering MethodforTree Reconstruction

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetic Geometry

Estimating Evolutionary Trees. Phylogenetic Methods

Phylogenetic Tree Reconstruction

SUPPLEMENTARY INFORMATION

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Multiple Sequence Alignment. Sequences

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

Phylogenetic analyses. Kirsi Kostamo

Phylogeny: building the tree of life

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Organisatorische Details

Molecular Evolution & Phylogenetics

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Dr. Amira A. AL-Hosary

Taxonomical Classification using:

Computational methods for predicting protein-protein interactions

SUPPLEMENTARY INFORMATION

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Constructing Evolutionary/Phylogenetic Trees

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

EVOLUTIONARY DISTANCES

Quantifying sequence similarity

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Fast coalescent-based branch support using local quartet frequencies

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Bioinformatics. Dept. of Computational Biology & Bioinformatics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

PHYLOGENY AND SYSTEMATICS

Effects of Gap Open and Gap Extension Penalties

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel

X X (2) X Pr(X = x θ) (3)

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

Phylogenetic Analysis

Phylogenetic Analysis

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

MiGA: The Microbial Genome Atlas

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Tools and Algorithms in Bioinformatics

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Overview Multiple Sequence Alignment

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Computational Biology From The Perspective Of A Physical Scientist

BINF6201/8201. Molecular phylogenetic methods

How to read and make phylogenetic trees Zuzana Starostová

arxiv: v1 [q-bio.pe] 27 Oct 2011

Protein Complex Identification by Supervised Graph Clustering

Phylogenetics in the Age of Genomics: Prospects and Challenges

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Anatomy of a species tree

Phylogenetic Analysis

Constructing Evolutionary/Phylogenetic Trees

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Probalign: Multiple sequence alignment using partition function posterior probabilities

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Phylogenetic Networks, Trees, and Clusters

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Microbial analysis with STAMP

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

CS 6375 Machine Learning

Transcription:

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas

Phylogeny (evolutionary tree) Orangutan From the Tree of the Life Website, University of Arizona Gorilla Chimpanzee Human

How did life evolve on earth? Courtesy of the Tree of Life project

Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Computational Phylogenetics and Metagenomics Courtesy of the Tree of Life project

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Issues: accuracy and speed

Phylogenetic Placement Input: Backbone alignment and tree on fulllength sequences, and a set of query sequences (short fragments) Output: Placement of query sequences on backbone tree Phylogenetic placement can be used for taxon identification, but it has general applications for phylogenetic analyses of NGS data.

Major Challenges Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements) Metagenomic analyses: methods for species classification of short reads have poor sensitivity. Efficient high throughput is necessary (millions of reads).

Today s Talk SATé: Simultaneous Alignment and Tree Estimation (Liu et al., Science 2009, and Liu et al. Systematic Biology, 2011) SEPP: SATé-enabled Phylogenetic Placement (Mirarab, Nguyen and Warnow, Pacific Symposium on Biocomputing 2012) TIPP: Taxon Identification using Phylogenetic Placement (Nguyen, Mirarab, and Warnow, in preparation - TIPP+Metaphyler collaboration with Mihai Pop and Bo Liu)

Part 1: SATé Liu, Nelesen, Raghavan, Linder, and Warnow, Science, 19 June 2009, pp. 1561-1564. Liu et al., Systematic Biology, 2011, 61(1):90-106 Public software distribution (open source) through the University of Kansas, in use, world-wide

DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT U X Y V W

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3 True tree and alignment Compare S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 S2 S3 Estimated tree and alignment

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuristic for large-scale ML optimization

1000 taxon models, ordered by difficulty (Liu et al., 2009)

Problems Large datasets with high rates of evolution are hard to align accurately, and phylogeny estimation methods produce poor trees when alignments are poor. Many phylogeny estimation methods have poor accuracy on large datasets (even if given correct alignments) Potentially useful genes are often discarded if they are difficult to align. These issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

SATé Algorithm Obtain initial alignment and estimated ML tree Tree

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Alignment Use tree to compute new alignment

SATé Algorithm Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Tree Alignment Use tree to compute new alignment

SATé Algorithm Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Tree Alignment Use tree to compute new alignment If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

One SATé iteration (really 32 subsets) A B e C D Decompose based on input tree A C B D Align subproblems A C B D Estimate ML tree on merged alignment ABCD Merge subproblems

1000 taxon models, ordered by difficulty

1000 taxon models, ordered by difficulty 24 hour SATé analysis, on desktop machines (Similar improvements for biological datasets)

1000 taxon models ranked by difficulty

Part II: SEPP SEPP: SATé-enabled Phylogenetic Placement, by Mirarab, Nguyen, and Warnow Pacific Symposium on Biocomputing, 2012 (special session on the Human Microbiome)

Phylogenetic Placement Align each query sequence to backbone alignment Place each query sequence into backbone tree, using extended alignment

Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S4 S2 S3

Align Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S4 S2 S3

Place Sequence S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S4 Q1 S2 S3

Phylogenetic Placement Align each query sequence to backbone alignment HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree Pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

Insights from SATé

Insights from SATé

Insights from SATé

Insights from SATé

Insights from SATé

SEPP Parameter Exploration Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP 10% rule (subset sizes 10% of backbone) had best overall performance

SEPP (10%-rule) on simulated data 0.0 0.0 Increasing rate of evolution

SEPP (10%) on Biological Data 16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments For 1 million fragments: PaPaRa+pplacer: ~133 days HMMALIGN+pplacer: ~30 days SEPP 1000/1000: ~6 days

SEPP (10%) on Biological Data 16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments For 1 million fragments: PaPaRa+pplacer: ~133 days HMMALIGN+pplacer: ~30 days SEPP 1000/1000: ~6 days

Part III: Taxon Identification Objective: identify the taxonomy (species, genus, etc.) for each short read (a classification problem)

Taxon Identification Objective: identify species, genus, etc., for each short read Leading methods: Metaphyler (Univ Maryland), Phylopythia, PhymmBL, Megan

Megan vs MetaPhyler on 60bp rpsb gene

OBSERVATIONS MEGAN is very conservative MetaPhyler makes more correct predictions than MEGAN Other methods not as sensitive on these 31 marker genes as MetaPhyler (see MetaPhyler study in Liu et al, BMC Bioinformatics 2011) Thus, the best taxon identification methods have high precision (make accurate predictions), but low sensitivity (i.e., they fail to classify a large portion of reads) even at higher taxonomy levels.

TIPP: Taxon Identification using Phylogenetic Placement Fragmentary Unknown Reads: (60-200 bp long) ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG... ACCT AGG GCAT (species1) Estimated alignment and tree (gene tree or taxonomy) on known full length sequences TAGC...CCA (species2) (500-10,000 bp long) TAGA...CTT (species3) AGC...ACA (species4) ACT..TAGA..A (species5)

TIPP - Version 1 Given a set Q of query sequences for some gene, a taxonomy T*, and a set of full-length sequences for the gene, Compute backbone alignment/tree pair (T,A) on the full-length sequences, using SATé Use SEPP to place query sequence into T* Compute extended alignment for each query sequence, using (T,A) Place query sequence into T* using pplacer (maximizing likelihood score) But TIPP version 1 too aggressive (over-classifies)

TIPP - Version 1 Given a set Q of query sequences for some gene, a taxonomy T*, and a set of full-length sequences for the gene, Compute backbone alignment/tree pair (T,A) on the full-length sequences, using SATé Use SEPP to place query sequence into T* Compute extended alignment for each query sequence, using (T,A) Place query sequence into T* using pplacer (maximizing likelihood score) But TIPP version 1 too aggressive (over-classifies)

TIPP version 2 Consider uncertainty in each step of the algorithm. Use statistical support values from pplacer and from HMMER to move placements up towards the root of the tree. Classify each fragment at the LCA of all placements obtained for the fragment. TIPP version 2 dramatically reduces false positive rate with small reduction in true positive rate by considering uncertainty, using statistical techniques.

TIPP+Metaphyler Use Metaphyler to perform initial placement of read into taxonomy Use TIPP to modify the placement, moving the read further into the clade identified by Metaphyler

Results on rpsb gene (60 bp)

Summary SATé gives better alignments and trees than standard alignment estimation methods SEPP can enable alignment of short (fragmentary) sequences into alignments of full-length sequences, and phylogenetic placement into gene trees or taxonomies TIPP enables taxon identification of short reads -- not limited to 31 marker genes, and no training is needed.

Overall message When data are difficult to analyze, develop better methods - don t throw out the data.

Phylogenetic Boosters SATé: co-estimation of alignments and trees SEPP/TIPP: phylogenetic analysis of fragmentary data Algorithmic strategies: divide-and-conquer and iteration to improve the accuracy and scalability of a base method

Phylogenetic boosters (meta-methods) Goal: improve accuracy, speed, robustness, or theoretical guarantees of base methods Examples: DCM-boosting for distance-based methods (1999) DCM-boosting for heuristics for NP-hard problems (1999) SATé-boosting for alignment methods (2009) SuperFine-boosting for supertree methods (2011) SEPP-boosting for metagenomic analyses (2012) DACTAL-boosting for all phylogeny estimation methods (in prep)

Acknowledgments Guggenheim Foundation Fellowship, Microsoft Research New England, National Science Foundation: Assembling the Tree of Life (ATOL), ITR, and IGERT grants, and David Bruton Jr. Professorship NSERC support to Siavash Mirarab Collaborators: SATé: Kevin Liu, Serita Nelesen, Sindhu Raghavan, and Randy Linder SEPP: Siavash Mirarab and Nam Nguyen TIPP: Siavash Mirarab, Nam Nguyen, Bo Liu, and Mihai Pop