ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Similar documents
Fast coalescent-based branch support using local quartet frequencies

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Anatomy of a species tree

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

Jed Chou. April 13, 2015

Phylogenetic Geometry

Workshop III: Evolutionary Genomics

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Taming the Beast Workshop

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

The impact of missing data on species tree estimation

Estimating Evolutionary Trees. Phylogenetic Methods

Quartet Inference from SNP Data Under the Coalescent Model

Estimating phylogenetic trees from genome-scale data

Phylogenetic inference

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Estimating phylogenetic trees from genome-scale data

Species Tree Inference using SVDquartets

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

On the variance of internode distance under the multispecies coalescent

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

Phylogenetic Tree Reconstruction

A Phylogenetic Network Construction due to Constrained Recombination

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Concepts and Methods in Molecular Divergence Time Estimation

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Reconstruire le passé biologique modèles, méthodes, performances, limites

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

A (short) introduction to phylogenetics

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Point of View. Why Concatenation Fails Near the Anomaly Zone

Phylogenetics in the Age of Genomics: Prospects and Challenges

DNA-based species delimitation

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Comparative Methods on Phylogenetic Networks

Consensus Methods. * You are only responsible for the first two

Gene Tree Parsimony for Incomplete Gene Trees

Species Tree Inference by Minimizing Deep Coalescences

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics. BIOL 7711 Computational Bioscience

Detection and Polarization of Introgression in a Five-Taxon Phylogeny

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

arxiv: v1 [q-bio.pe] 16 Aug 2007

Analysis of a Rapid Evolutionary Radiation Using Ultraconserved Elements: Evidence for a Bias in Some Multispecies Coalescent Methods

1 ATGGGTCTC 2 ATGAGTCTC

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Systematics - Bio 615

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Regular networks are determined by their trees

TheDisk-Covering MethodforTree Reconstruction

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenomics of closely related species and individuals

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Molecular Evolution & Phylogenetics

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

arxiv: v1 [math.st] 22 Jun 2018

Constructing Evolutionary/Phylogenetic Trees

Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting

Tools and Algorithms in Bioinformatics

Phylogenetic Networks, Trees, and Clusters

A phylogenomic toolbox for assembling the tree of life

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

To link to this article: DOI: / URL:

arxiv: v2 [q-bio.pe] 4 Feb 2016

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Supplementary Information

Inference of Parsimonious Species Phylogenies from Multi-locus Data

Comparative Genomics II

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Bayesian inference of species trees from multilocus data

An Investigation of Phylogenetic Likelihood Methods

Transcription:

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Joint work with Tandy Warnow Erfan Sayyari anzee Orangutan

Gene tree discordance The species tree gene 1 gene1000 Orangutan A gene tree Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) gene : recombination-free orthologous stretches of the genome 2

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 3

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 3

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Omnipresent; most likely for short branches or large population sizes Tracing alleles through generations 3

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Omnipresent; most likely for short branches or large population sizes Tracing alleles through generations Multi-species coalescent: models ILS The species tree defines the probability distribution on gene trees, and is identifiable from the distribution on gene tree topologies [Degnan and Salter, Int. J. Org. Evolution, 2005] 3

Multi-gene species tree estimation gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference anzee gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G Gene tree estimation Approach 2: Summary methods gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 gene 2 gene 1000 Summary method Orangutan anzee 4

Multi-gene species tree estimation gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Gene tree estimation Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods anzee Statistically inconsistent [Roch and Steel,2014] gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 gene 2 gene 1000 Summary method Orangutan anzee Can be statistically consistent given true gene trees 4

Multi-gene species tree estimation gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Gene tree estimation Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods anzee Statistically inconsistent [Roch and Steel,2014] gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 STAR, STELLS, GLASS, Summary method Orangutan BUCKy (population tree), gene 2 MP-EST, NJst (ASTRID), ASTRAL, ASTRAL-II gene 1000 anzee Can be statistically consistent given true gene trees 4

Multi-gene species tree estimation gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Gene tree estimation Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Approach 2: Summary methods site-based (SVDQuartets) Orangutan anzee Summary method Phylogeny inference anzee Statistically inconsistent [Roch and Steel,2014] co-estimation (e.g., *BEAST), STAR, STELLS, GLASS, gene 1 BUCKy (population tree), gene 2 MP-EST, NJst (ASTRID), ASTRAL, ASTRAL-II gene 1000 There are also other approaches: Can be statistically consistent given true gene trees 4

Properties of quartet trees in presence of ILS For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010] p 1 = 30% p 2 = 30% p 3 = 40% Dominant 5

Properties of quartet trees in presence of ILS p 1 = 30% p 2 = 30% p 3 = 40% For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010] For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006] 1. Breakup input each gene tree into trees on 4 taxa (quartet trees) n 2. Find all ( 4n 4) dominant quartet topologies n 4 3. Combine dominant quartet trees Dominant 5

Properties of quartet trees in presence of ILS p 1 = 30% p 2 = 30% p 3 = 40% For 4 species, the dominant quartet topology is the species tree [Allman, et al. 2010] For >4 species, the dominant topology may be different from the species tree [Degnan and Rosenberg, 2006] 1. Breakup input each gene tree into trees on 4 taxa (quartet trees) n 2. Find all ( 4n 4) dominant quartet topologies n 4 Dominant 3. Combine dominant quartet trees n Alternative: weight 3( 4n 4) quartet topology by their frequency and find the optimal tree 5

Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014] Optimization Problem (suspected NP-Hard): Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= X t2t Q(T ) \ Q(t) Set of quartet trees induced by T a gene tree all input gene trees Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 6

ASTRAL-I [Mirarab, et al., Bioinformatics, 2014] ASTRAL solves the problem exactly using dynamic programming: Exponential running time (feasible for <18 species) 7

ASTRAL-I [Mirarab, et al., Bioinformatics, 2014] ASTRAL solves the problem exactly using dynamic programming: Exponential running time (feasible for <18 species) Introduced a constrained version of the problem Draws the set of branches in the species tree from a given set X = {all bipartitions in all gene trees} Given many genes, each species tree branch likely appears in at least one of the gene trees Theorem: the constrained version remains statistically consistent Running time: O(n 2 k X 2 ) for n species and k species 7

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015] 1. Faster calculation of the score function inside DP O(nk X ) 2 ) instead of O(n 2 k) k X 2 ) for n species and k genes 8

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015] 1. Faster calculation of the score function inside DP O(nk X ) 2 ) instead of O(n 2 k) k X 2 ) for n species and k genes 2. Add extra bipartitions to the set X using heuristic approaches Resolving consensus trees by subsampling taxa Using quartet-based distances to find likely branches Complete incomplete gene trees 8

ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015] 1. Faster calculation of the score function inside DP O(nk X ) 2 ) instead of O(n 2 k) k X 2 ) for n species and k genes 2. Add extra bipartitions to the set X using heuristic approaches Resolving consensus trees by subsampling taxa Using quartet-based distances to find likely branches Complete incomplete gene trees 3. Ability to take as input gene trees with polytomies 8

Simulation study True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees Using SimPhy. Vary many parameters: Number of species: 10 1000 Number of genes: 50 1000 Amount of ILS: low, medium, high Deep versus recent speciation 11 model conditions (50 replicas each) with heterogenous gene tree error Compare to NJst, MP-EST, concatenation (CA-ML) using FastTree-II Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree 9

Tree error, varying level of ILS 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 0% L M H L M H L M H 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 10

Tree error, varying level of ILS 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 0% L M H L M H L M H 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 10

Impact of gene tree error (using true gene trees) ASTRAL II ASTRAL II (true gt) CA ML Species tree topological error (FN) Species tree topo 0.4 0.3 0.2 0.1 0.0 Low ILS Medium ILS High ILS 50 200 1000 50 200 1000 50 200 1000 genes 11

Impact of gene tree error (using true gene trees) ASTRAL II ASTRAL II (true gt) CA ML Species tree topological error (FN) Species tree topo 0.4 0.3 0.2 0.1 0.0 Low ILS Medium ILS High ILS 50 200 1000 50 200 1000 50 200 1000 genes When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error 11

Impact of HGT [Davidson, et al., BMC-genomics, 2015] 0.3 species tee error 0.2 0.1 HGT rate (expected transfers per gene) 0 0.8 2 8 20 10 50 100 200 400 1000 number of genes Random HGT, with 50 species, moderate levels of ILS ASTRAL does well on ILS+HGT

Going beyond the topology [Sayyari and Mirarab, MBE, 2016] Branch length (BL): ASTRAL can compute BL in coalescent units for internal branches Simply a function of the amount of discordance 13

Going beyond the topology [Sayyari and Mirarab, MBE, 2016] Branch length (BL): ASTRAL can compute BL in coalescent units for internal branches Simply a function of the amount of discordance Branch support: Multi-locus bootstrapping: Slow (requires bootstrapping each gene) Inaccurate [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015] Local posterior probability: ASTRAL s own support values that don t require bootstrapping 13

Branch support: idea Observing quartet topologies in gene trees is just like tossing a three-sided unbalanced die f 3 = 63 f 2 = 57 f 1 = 80 14

Branch support: idea Observing quartet topologies in gene trees is just like tossing a three-sided unbalanced die f 3 = 63 f 2 = 57 f 1 = 80 P (a quartet tree seen f 1 times in k gene trees is in the species tree) = P (a three-sided die tossed k times is biased towards a side that shows up f 1 times) 14

Branch support: idea Observing quartet topologies in gene trees is just like tossing a three-sided unbalanced die f 3 = 63 f 2 = 57 f 1 = 80 P (a quartet tree seen f 1 times in k gene trees is in the species tree) = P (a three-sided die tossed k times is biased towards a side that shows up f 1 times) Can be analytically solved (Bayesian with a prior) 14

Branch support: idea Observing quartet topologies in gene trees is just like tossing a three-sided unbalanced die f 3 = 63 f 2 = 57 f 1 = 80 P (a quartet tree seen f 1 times in k gene trees is in the species tree) = P (a three-sided die tossed k times is biased towards a side that shows up f 1 times) Can be analytically solved (Bayesian with a prior) ML or MAP BL can be computed as a function of f 1 14

Prior We assume a-priori all three topologies are equally likely Pr( 1 > 1 3 )=Pr( 2 > 1 3 )=Pr( 3 > 1 3 )=1 3 We assume branch lengths are generated through a Yule process with rate λ Default: λ =0.5 > branch lengths become uniformly distributed 15

Quartet support versus posterior probability quartet frequency 16

Multiple quartets Locality Assumption: All four clusters around a branch Q are correct Compute support for each branch independently from others b a C 1 =m 1 C 3 =m 3 d e c Q f C 2 =m 2 C 4 =m 4 h g SQ = {ac dh, ac df, bc ef,...} m =#SQ = m 1 m 2 m 3 m 4 F i =(f i1,f i2,f i3 ) for 1 apple i apple m 1717

Multiple quartets Locality Assumption: All four clusters around a branch Q are correct Compute support for each branch independently from others b a C 1 =m 1 C 3 =m 3 d e Assuming independence between quartets would be too liberal (m x k tosses) c Q f C 2 =m 2 C 4 =m 4 h g SQ = {ac dh, ac df, bc ef,...} m =#SQ = m 1 m 2 m 3 m 4 F i =(f i1,f i2,f i3 ) for 1 apple i apple m 1717

Multiple quartets Locality Assumption: All four clusters around a branch Q are correct Compute support for each branch independently from others b a C 1 =m 1 C 3 =m 3 d e Assuming independence between quartets would be too liberal (m x k tosses) Conservative approach: All quartet frequencies are noisy estimates of a single hidden true frequency c C 2 =m 2 Q C 4 =m 4 h g f Simply average quartet frequencies k tosses, each time reading the results m times with some noise SQ = {ac dh, ac df, bc ef,...} m =#SQ = m 1 m 2 m 3 m 4 F i =(f i1,f i2,f i3 ) for 1 apple i apple m 1717

Speed Can analytically compute local posterior probabilities and branch lengths (using results from Allman, et. al, 2012 for BL). 18

Speed Can analytically compute local posterior probabilities and branch lengths (using results from Allman, et. al, 2012 for BL). We don t need to list all n choose 4 quartet frequencies. We can use extensions of algorithmic tricks in ASTRAL to compute average quartet frequencies for each branch in O(nk) 18

Simulation studies Violations of assumptions Estimated gene trees instead of true gene trees Estimated species trees, where the locality assumption is not always correct Datasets: 201-taxon ASTRAL-II dataset An avian simulated dataset Support accuracy: The number of false positive and false negatives above a certain threshold of support 19

Results (Avian, ROC) 1.00 250 MLBS Local PP Recall 0.75 0.50 0.25 0.00 0.00 0.25 0.50 0.75 1.0 False Positive Rate Avian simulated dataset (48 taxa, 1000 genes) 20

Precision and recall at high support A B Downloaded from http://mbe.oxfordjournals.org/ B by guest on May 28, 2016 FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2 S4, Supplementary Material online for other species trees. (A) Precision and recall of201-taxon branches with localdatasets PP above a threshold (simphy) ranging from 0.9 to 1.0 using estimated gene trees (solid) or true gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS. 21

Precision and recall at high support A B 201-taxon datasets (simphy) 21

Branch length accuracy 22

1KP dataset rest of land plants 23

Summary ASTRAL-II can infer species tree from gene trees in datasets with 1000 genes and 1000 taxa in a day of single cpu running time. ASTRAL mostly dominates other summary methods. However, Concatenation is better when gene trees have high error. ASTRAL s way of computing support is more accurate than MLBS while being much faster 24