Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Similar documents
CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Ultra- large Mul,ple Sequence Alignment

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Cladistics and Bioinformatics Questions 2013

Organizing Life s Diversity

Multiple Sequence Alignment. Sequences

A Phylogenetic Network Construction due to Constrained Recombination

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Chapter 18 Active Reading Guide Genomes and Their Evolution

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Microbial analysis with STAMP

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

SUPPLEMENTARY INFORMATION

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109


Genomes and Their Evolution

Algorithms in Bioinformatics

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Phylogenetic Geometry

10 Biodiversity Support. AQA Biology. Biodiversity. Specification reference. Learning objectives. Introduction. Background

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Taming the Beast Workshop

Computa(onal Challenges in Construc(ng the Tree of Life

Lecture 11 Friday, October 21, 2011

2018 Phylogenomics Software Symposium Institut des Sciences de l Evolution - Montpellier August 17, 2018 Abstracts

Phylogenetic inference

MiGA: The Microbial Genome Atlas

Quantifying sequence similarity

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Computational Biology From The Perspective Of A Physical Scientist

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Emily Blanton Phylogeny Lab Report May 2009

thebiotutor.com AS Biology Unit 2 Classification, Adaptation & Biodiversity

Phylogenetic Trees. How do the changes in gene sequences allow us to reconstruct the evolutionary relationships between related species?

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

The practice of naming and classifying organisms is called taxonomy.

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Effects of Gap Open and Gap Extension Penalties

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Name: Class: Date: ID: A

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Biology. Slide 1 of 24. End Show. Copyright Pearson Prentice Hall

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Microbiome: 16S rrna Sequencing 3/30/2018

Computational methods for predicting protein-protein interactions

Biodiversity. The Road to the Six Kingdoms of Life

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Evidences of Evolution (Clues)

Semantic Integration of Biological Entities in Phylogeny Visualization: Ontology Approach

Processes of Evolution

A phylogenomic toolbox for assembling the tree of life

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Predicting Protein Functions and Domain Interactions from Protein Interactions

9/19/2012. Chapter 17 Organizing Life s Diversity. Early Systems of Classification

PHYLOGENY AND SYSTEMATICS

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

OMICS Journals are welcoming Submissions

Unsupervised Learning in Spectral Genome Analysis

K-means-based Feature Learning for Protein Sequence Classification

Postgraduate teaching for the next generation of taxonomists

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

For Classroom Trial Testing

doi: / _25

#33 - Genomics 11/09/07

Figure 1. Consider this cladogram. Let s examine it with all three species concepts:

Phylogenetic analyses. Kirsi Kostamo

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

BIOINFORMATICS: An Introduction

Fast coalescent-based branch support using local quartet frequencies

Biodiversity. The Road to the Six Kingdoms of Life

Phylogenetic Networks, Trees, and Clusters

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

Transcription:

Phylogenomics, Multiple Sequence Alignment, and Metagenomics Tandy Warnow University of Illinois at Urbana-Champaign

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Phylogeny + genomics = genome-scale phylogeny estimation.

Multiple Sequence Alignment (MSA): a scientific grand challenge 1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Estimating the Tree of Life Figure from https://en.wikipedia.org/wiki/common_descent Basic Biology: How did life evolve? Applications of phylogenies and multiple sequence alignments to: protein structure and function population genetics human migrations metagenomics

Muir, 2016

Estimating the Tree of Life Large datasets! Millions of species thousands of genes NP-hard optimization problems Exact solutions infeasible Approximation algorithms Heuristics Multiple optima Figure from https://en.wikipedia.org/wiki/common_descent High Performance Computing: necessary but not sufficient

Computational Phylogenetics (2005) Current methods can use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week Courtesy of the Tree of Life web project, tolweb.org

Computational Phylogenetics (2018) 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity Courtesy of the Tree of Life web project, tolweb.org 2016-2017: Scaling methods to very large heterogeneous datasets using novel machine learning and supertree methods.

Metagenomic taxonomic identification and phylogenetic profiling Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Basic Questions 1. What is this fragment? (Classify each fragment as well as possible.) 2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.) 3. What are the organisms in this metagenomic sample doing together?

Some of Our Methods and Software SEPP (phylogenetic placement), S. Mirarab et al. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17:247-258. TIPP (Taxonomic abundance profiling of metagenomic data), N. Nguyen et al. "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548-3555. UPP (Ultra-large alignments), N. Nguyen et al. "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 HIPPI (protein family classification), N. Nguyen et al., HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765 PASTA (co-estimation of alignments and trees), S. Mirarab et al., J. Comput. Biol. (2014), 22(5): 377-386. ASTRAL (species tree estimation from multi-locus data), S. Mirarab et al., Bioinformatics (2014), 30(17): i541-i548. ASTRID (species tree estimation from multi-locus data). P. Vachaspati and T. Warnow. BMC Genomics (2015), 16:S3. All software available in open-source form on github. See http://tandy.cs.illinois.edu/software.html for more information