Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Similar documents
ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Fast coalescent-based branch support using local quartet frequencies

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Anatomy of a species tree

Taming the Beast Workshop

Jed Chou. April 13, 2015

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

Phylogenetic Geometry

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

The Inference of Gene Trees with Species Trees

Quartet Inference from SNP Data Under the Coalescent Model

Phylogenomics of closely related species and individuals

Estimating phylogenetic trees from genome-scale data

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

A Phylogenetic Network Construction due to Constrained Recombination

Estimating phylogenetic trees from genome-scale data

Maximum Likelihood Inference of Reticulate Evolutionary Histories

An introduction to phylogenetic networks

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Phylogenetics: Building Phylogenetic Trees

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Unified modeling of gene duplication, loss and coalescence using a locus tree

MiGA: The Microbial Genome Atlas

Phylogenetic Networks, Trees, and Clusters

Workshop III: Evolutionary Genomics

Supplementary Materials for

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Species Tree Inference using SVDquartets

Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting

The impact of missing data on species tree estimation

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Estimating Evolutionary Trees. Phylogenetic Methods

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Inference of Gene Tree Discordance and Recombination. Yujin Chung. Doctor of Philosophy (Statistics)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

GATC: A Genetic Algorithm for gene Tree Construction under the Duplication Transfer Loss model of evolution

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

Inference of Parsimonious Species Phylogenies from Multi-locus Data

Session 5: Phylogenomics

Species Tree Inference by Minimizing Deep Coalescences

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

On the variance of internode distance under the multispecies coalescent

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Supplementary Information

8/23/2014. Phylogeny and the Tree of Life

Phylogenetics in the Age of Genomics: Prospects and Challenges

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

Dr. Amira A. AL-Hosary

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Gene Trees, Species Trees, and Species Networks

Phylogenetic inference

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

How should we organize the diversity of animal life?

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Processes of Evolution

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetics. BIOL 7711 Computational Bioscience

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Algorithms in Bioinformatics

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Regular networks are determined by their trees

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

Detection and Polarization of Introgression in a Five-Taxon Phylogeny

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

Phylogenetic analyses. Kirsi Kostamo

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

2018 Phylogenomics Software Symposium Institut des Sciences de l Evolution - Montpellier August 17, 2018 Abstracts

Large-scale gene family analysis of 76 Arthropods

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Improved maximum parsimony models for phylogenetic networks

Transcription:

Upcoming challenges in phylogenomics Siavash Mirarab University of California, San Diego

Gene tree discordance The species tree gene1000 Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) Hybridization A gene tree 2

Gene tree discordance The species tree gene1000 Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) Hybridization 2 A gene tree c-gene : recombination-free orthologous stretches of the genome

Gene evolution model Sequence evolution model ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

Incomplete Lineage Sorting (ILS) The coalescent process extended to multiple species Omnipresent; most likely for short branches or large population sizes Tracing alleles through generations 4

Incomplete Lineage Sorting (ILS) The coalescent process extended to multiple species Omnipresent; most likely for short branches or large population sizes Tracing alleles through generations 4

Incomplete Lineage Sorting (ILS) The coalescent process extended to multiple species Omnipresent; most likely for short branches or large population sizes Tracing alleles through generations Multi-species coalescent. The species tree defines the probability distribution on gene trees, and is identifiable from the distribution on gene tree topologies [Degnan and Salter, Int. J. Org. Evolution, 2005] 4

Multi-gene species tree estimation ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 000 Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference anzee gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G Gene tree estimation Approach 2: Summary methods 000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 2 000 Summary method anzee 5

Multi-gene species tree estimation ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 000 Gene tree estimation Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods anzee Statistically inconsistent [Roch and Steel,2014] 000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 2 000 Summary method anzee Can be statistically consistent given true gene trees 5

Multi-gene species tree estimation ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G supermatrix gene 2 000 Gene tree estimation Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods STAR, STELLS, BUCKy (population), anzee Summary method gene 2 000 anzee Statistically inconsistent [Roch and Steel,2014] MP-EST, NJst (ASTRID), ASTRAL, Can be statistically consistent given true gene trees 5

Multi-gene species tree estimation ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G 000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 000 Gene tree estimation Approach 1: Concatenation CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods co-estimation (e.g., *BEAST), anzee site-based (e.g., Summary SVDQuartets) method anzee Statistically inconsistent [Roch and Steel,2014] There are also other approaches: STAR, STELLS, BUCKy (population), gene 2 000 MP-EST, NJst (ASTRID), ASTRAL, Can be statistically consistent given true gene trees 5

Challenges What is a gene or a species and how do we find them? Modeling: multiple evolutionary processes operate together, sometimes creating patterns that are hard to distinguish. How do we untangle them? Inference: phylogenetics is hard. Dealing with multi-locus datasets and complex evolutionary processes is often intractable. Reliability and interpretation Catching up with new data acquisition technologies

Recombination and gene boundaries For the coalescence theory to work, we need (c-)genes to be recombination-free regions. Should we try to find recombination free regions? How? Is the signal preserved through millions of years of evolution? Long enough to permit accurately reconstructing gene-specific trees? How robust or sensitive are various phylogenetic methods to presence of some recombination?

Species tree The definition of species and the delineation of boundaries between them is not trivial Trees are not always good models. Networks needed in the presence of hybridization, HGT and gene flow (migration) Are species trees the most useful entity to infer? Maybe gene trees are more useful for downstream analyses

Models of discordance Single-cause statistical models: ILS: multi-species coalescent Duplication+Loss (duploss): birth+death models Reticulation (HGT): (random models; Roch and Snir) ILS+Duploss (Rasmussen & Kellis) ILS+Hybridization (Yun et al., Luay s lab) Duploss+HGT (Tofigh et al., Szöllósi et al.)

Complex models Combining multiple causes of discordance results in complex (parameter-rich) models Inference is hard There are often identifiability issues See Szöllõsi et al., 2015 for a recent review

Inference Ideally, we combine sequence and gene evolution into a single hierarchical model and co-estimate gene and species trees Combining all processes is computationally intractable Pipeline: assemble reads > find orthologous genes (gene families) or genomic regions > multiple sequence alignment per gene > infer gene trees > infer species trees/networks Error propagates from one step to the next

Progress Co-estimation methods exist for substitutions+ils (e.g., *BEAST) substitutions+duploss (e.g., PHYLDOG) ILS+Hybridization (e.g., PhyloNet). Scalability limited to small numbers of genes (scalability is gradually improving) small numbers of species (e.g., tens) Sequence-based (gene tree free) methods of species tree estimation (e.g., SNAPP, SVDQuartets, etc.) Heuristic methods of improving gene trees (e.g., gene binning) HMM-based methods of scanning genomes (e.g., CoalHMM)

Interpretability, Data, Interpretation: truth is knowable in phylogenetics. How do we evaluate models, methods, results? Need good generative models (ideally more complex than inference models) Best ways to estimate support? How to interpret networks? Data visualization Biological events Evolution in new types of data (e.g., metagenomics, cancer, immunogenetics, HIV, etc.); data generation models.

Where to go? Can inference under existing models become scalable to hundreds of species and thousands of genes? Can we combine even more processes into a single model? For example, a model of ILS+Duploss+Transfer+Substitutions+Indel? Can smart scalable heuristic approaches be designed to sidestep some of the scalability challenges? What are the fundamental limits of a full inference of past evolutionary processes? Are there magic markers out there? Are large-scale processes such as rearrangements full of signal, waiting to be discovered?

Challenges What is a gene or a species and how do we find them? Modeling: multiple evolutionary processes operate together, sometimes creating patterns that are hard to distinguish. How do we untangle them? Inference: phylogenetics is hard. Dealing with multi-locus datasets and complex evolutionary processes is often intractable. Reliability and interpretation Catching up with new data acquisition technologies