Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Similar documents
Fast coalescent-based branch support using local quartet frequencies

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

Anatomy of a species tree

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction

Unit 3 Insect Orders

Workshop III: Evolutionary Genomics

Phylogenetic Geometry

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

The impact of missing data on species tree estimation

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Jed Chou. April 13, 2015

Taming the Beast Workshop

Phylogenetics in the Age of Genomics: Prospects and Challenges

A Phylogenetic Network Construction due to Constrained Recombination

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenetic inference

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Estimating phylogenetic trees from genome-scale data

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

On the variance of internode distance under the multispecies coalescent

Phylogenetic Tree Reconstruction

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Estimating phylogenetic trees from genome-scale data

Quartet Inference from SNP Data Under the Coalescent Model

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

The Wonderful World of Insects. James A. Bethke University of California Cooperative Extension Farm Advisor Floriculture and Nursery San Diego County

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Estimating Evolutionary Trees. Phylogenetic Methods

Reconstruire le passé biologique modèles, méthodes, performances, limites

Species Tree Inference using SVDquartets

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Comparative Methods on Phylogenetic Networks

Dr. Amira A. AL-Hosary

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

Organizing Life s Diversity

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

Hokie Bugfest (October 17, 2015)

X X (2) X Pr(X = x θ) (3)

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

Biology ENTOMOLOGY Dr. Tatiana Rossolimo, Class syllabus

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

Hokie BugFest (October 20, 2018)

A phylogenomic toolbox for assembling the tree of life

Large-scale gene family analysis of 76 Arthropods

Phylogenetics: Building Phylogenetic Trees

arxiv: v1 [math.st] 22 Jun 2018

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

A (short) introduction to phylogenetics

8/23/2014. Phylogeny and the Tree of Life

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Algorithms in Bioinformatics

To link to this article: DOI: / URL:

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

Isolating - A New Resampling Method for Gene Order Data

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Concepts and Methods in Molecular Divergence Time Estimation

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

Phylogenetic Networks, Trees, and Clusters

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Hexapoda Origins: Monophyletic, Paraphyletic or Polyphyletic? Rob King and Matt Kretz

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Methods to reconstruct phylogene1c networks accoun1ng for ILS

TheDisk-Covering MethodforTree Reconstruction

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Constructing Evolutionary/Phylogenetic Trees

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

Phylogenetic Assumptions

From Individual-based Population Models to Lineage-based Models of Phylogenies

Multiple Sequence Alignment. Sequences

Transcription:

Reconstruction of species trees from gene trees using ASTRAL Siavash Mirarab University of California, San Diego (ECE)

Phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT Inference error gene here refers to a portion of the genome (not a functional gene) More data (#genes) 2

Gene tree discordance gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3

Gene tree discordance The species tree gene 1 gene1000 Gorilla Human Chimp Orangutan Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. A gene tree 3

Gene tree discordance The species tree gene 1 gene1000 Gorilla Human Chimp Orangutan Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. A gene tree Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) 3

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations Omnipresent: possible for every tree Likely for short branches or large population sizes 4

MSC and Identifiability The statistical process multi-species coalescent (MSC) is used to model ILS. 5

MSC and Identifiability The statistical process multi-species coalescent (MSC) is used to model ILS. Any species tree defines a unique distribution on the set of all possible gene trees 5

MSC and Identifiability The statistical process multi-species coalescent (MSC) is used to model ILS. Any species tree defines a unique distribution on the set of all possible gene trees In principle, the species tree can be identified despite high discordance from the gene tree distribution 5

Multi-gene tree estimation pipelines gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Gorilla Chimpanzee Human gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G Gene tree estimation Approach 2: Summary methods Chimp Gorilla gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 gene 2 gene 1000 Human Gorilla Human Orang. Human Orang. Chimp Orang. Chimp Gorilla Summary method Orangutan Gorilla Chimpanzee Human There are also other very good approaches: co-estimation (e.g., *BEAST), site-based (e.g., SVDQuartets, SNAPP) 6

Multi-gene tree estimation pipelines gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Gene tree estimation Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Gorilla Chimpanzee Statistically inconsistent [Roch and Steel, 2014] Human Chimp Gorilla Error gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 gene 2 gene 1000 Human Gorilla Human Orang. Human Orang. Chimp Orang. Chimp Gorilla Summary method Orangutan Gorilla Chimpanzee Human Data Can be statistically consistent given true gene trees Error-free gene trees 6

Multi-gene tree estimation pipelines Error gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 1 supermatrix gene 2 gene 1000 Gene tree estimation Chimp Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Gorilla Approach 2: Summary methods Summary method Phylogeny inference Gorilla Orangutan Human Orang. BUCKy (population tree), Gorilla Chimp gene 2 MP-EST, NJst (ASTRID), Human Orang. Gorilla Chimpanzee Statistically inconsistent [Roch and Steel, 2014] STAR, STELLS, GLASS, Human Chimpanzee Human ASTRAL, Orang. ASTRAL-II, ASTRAL-III Chimp gene 1000 Human Gorilla Data Can be statistically consistent given true gene trees Error-free gene trees 6

ASTRAL used by the biologists Plants: Wickett, et al., 2014, PNAS Birds: Prum, et al., 2015, Nature Xenoturbella, Cannon et al., 2016, Nature Xenoturbella, Rouse et al., 2016, Nature Flatworms: Laumer, et al., 2015, elife Shrews: Giarla, et al., 2015, Syst. Bio. Frogs: Yuan et al., 2016, Syst. Bio. Tomatoes: Pease, et al., 2016, PLoS Bio. ASTRAL-II ASTRAL Angiosperms: Huang et al., 2016, MBE Worms: Andrade, et al., 2015, MBE

This talk: ASTRAL Outlines of the method Accuracy in simulation studies With strong model violations The impact of Fragmentary sequence data Sampling multiple individuals Visualization 8

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) Chimp Gorilla θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Human Orang. Human Gorilla Human Orang. 9

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) Chimp Gorilla θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Human Orang. Human Gorilla Human Orang. The most frequent gene tree = The most likely species tree 9

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) Chimp Gorilla θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Human Orang. Human Gorilla Human Orang. The most frequent gene tree = The most likely species tree shorter branches more discordance a harder species tree reconstruction problem 9

More than 4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Gorilla Human Chimp Orang. Rhesus 10

More than 4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Gorilla Human Chimp Orang. Rhesus 1. Break gene trees into ( n 4 ) quartets of species 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 10

More than 4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Gorilla Human Chimp Orang. Rhesus Alternative: 1. Break gene trees into ( n 4 ) quartets of species weight all 3( n 4 ) quartet topologies 2. Find the dominant tree for all quartets of taxa by their frequency and find the optimal tree 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 10

Maximum Quartet Support Species Tree Optimization problem: Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= mx Q(T ) \ Q(t i ) 1 the set of quartet trees induced by T a gene tree 11

Maximum Quartet Support Species Tree Optimization problem: Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= mx Q(T ) \ Q(t i ) 1 the set of quartet trees induced by T a gene tree Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 11

ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015] Solve the problem exactly using dynamic programming Constrains the search space to make large datasets feasible The constrained version remains statistically consistent Running time: polynomially increases with the number of genes and the number of species 12

ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015] Solve the problem exactly using dynamic programming Constrains the search space to make large datasets feasible The constrained version remains statistically consistent Running time: polynomially increases with the number of genes and the number of species ASTRAL-II: Increased the search space Improved the running time Can handle polytomies (lack of resolution) in input gene trees 12

Simulation study True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 13

Simulation study True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees Evaluate using the FN rate: the percentage of branches in the true tree that are missing from the estimated tree 13

Number of species impacts estimation error in the species tree 1000 genes, medium levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015] 14

ASTRAL: accurate and scalable Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 1000 genes, medium levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015] 15

ASTRAL: accurate and scalable Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 200 500 1000 number of species 1000 genes, medium levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015] 15

Comparison to concatenation: depends on the ILS level 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 0% L M H L M H L M H 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 16

Comparison to concatenation: depends on the ILS level 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 0% L M H L M H L M H 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 16

Horizontal Gene Transfer (HGT) [R. Davidson et al., BMC Genomics. 16 (2015)] Model violation: the simulated discordance is due to both ILS and randomly distributed HGT ~30% discordance ~35% ~55% ~70% 50 species, 10 genes

Horizontal Gene Transfer (HGT) [R. Davidson et al., BMC Genomics. 16 (2015)] Randomly distributed HGT is tolerated with enough genes # genes 50 species, varying # genes ~30% discordance ~55% ~35% ~70%

UCSD Work 1. Statistical support 2. ASTRAL-III Better running time GPU and CPU Parallelism Multi-Individual datasets 3. Impact of fragmentary data 19

Branch support Traditional approach: Multi-locus bootstrapping (MLBS) Slow: requires bootstrapping all genes (e.g., 100 m ML trees) Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015] We can do better! 20 [Mirarab et al., Sys bio, 2014]

Local posterior probability Recall quartet frequencies follow a multinomial distribution m 1 = 80 m 2 = 63 m 3 = 57 Chimp Gorilla Orang. Chimp Gorilla Chimp Human θ 1 Orang. Human Gorilla Human θ 2 θ 3 Orang. P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) 21

Local posterior probability Recall quartet frequencies follow a multinomial distribution m 1 = 80 m 2 = 63 m 3 = 57 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Gorilla Human Orang. P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) θ 1 Human Can be analytically solved θ 2 θ 3 We implemented this idea in astral and called the measure the local posterior probability (localpp) 21

Quartet support v.s. localpp quartet frequency Increased number of genes (m) increased support Decreased discordance increased support 22

localpp is more accurate than bootstrapping 1.00 MLBS Local PP Recall 0.75 0.50 0.25 100X faster 0.00 0.00 0.25 0.50 0.75 1.0 False Positive Rate Avian simulated dataset (48 taxa, 1000 genes) [Sayyari and Mirarab, MBE, 2016] 23

ASTRAL can also estimate internal branch lengths estimated branch length (log scale) 2.5 0.0 2.5 5.0 7.5 True gene trees 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) With true gene trees, ASTRAL correctly estimates BL 24

ASTRAL can also estimate internal branch lengths estimated branch length (log scale) low gene tree error 2.5 0.0 2.5 Medium g.t. error 5.0 7.5 Moderate g.t. error True gene trees High gene tree error 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) With error-prone With true estimated gene trees, gene ASTRAL trees, correctly ASTRAL estimates underestimates BL BL 24

ASTRAL-III and beyond (unpublished work) Maryam Rabiee Hashemi John Yin 25 Chao Zhang

ASTRAL-III Chao Zhang Improved running time (to be released in July) Can handle polytomies without slowing down Can exploit similarities between gene trees to reduce running time 26

ASTRAL-III Chao Zhang Improved running time (to be released in July) Can handle polytomies without slowing down Can exploit similarities between gene trees to reduce running time To be released in August Handling datasets with multiple individuals per species GPU and CPU parallelism 26

Low support branches Does it help to contract branches with low support? Yes, but only for 1000 very low supports 16% Mean GT error Low (<50%) Theoretical justifications not clear Species tree error (FN ratio) 12% 8% High (>50%) Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 4% non 0 3 5 7 10 20 33 50 75 contraction threshold 27

Low support branches Does it help to contract branches with low support? Yes, but only for very low supports 25% 16% 1000 200 Mean GT error Low (<50%) Theoretical justifications not clear Species tree error (FN ratio) 20% 12% 15% 8% 10% High (>50%) Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 4% non 0 3 5 7 10 20 33 50 75 contraction threshold 27

Multiple individuals Maryam Rabiee Hashemi What if we sample multiple individuals from each species? In recently diverged species individuals may have different trees for each gene Sampling multiple individuals may provide extra signal 28

Multiple individuals helpful? Species tree error (RF distance) 0.3 0.2 0.1 0.0 1 ind. 2 ind. 5 ind. Variable effort 500 genes; 200 species Fixed effort genes x inds = 1000; 200 species Fixed effort 500 genes; species x inds = 200 Yes, it marginally helps accuracy 29

Multiple individuals helpful? Species tree error (RF distance) 0.3 0.2 0.1 0.0 1 ind. 2 ind. 5 ind. Variable effort 500 genes; 200 species Fixed effort genes x inds = 1000; 200 species Fixed effort 500 genes; species x inds = 200 Yes, it marginally helps accuracy But not if sequencing effort is kept fixed 29

Multiple individuals helpful? Species tree error (RF distance) 0.3 0.2 0.1 0.0 1 ind. 2 ind. 5 ind. Variable effort 500 genes; 200 species Fixed effort genes x inds = 1000; 200 species Fixed effort 500 genes; species x inds = 200 Yes, it marginally helps accuracy But not if sequencing effort is kept fixed 29

Parallelism 25 John Yin GPU (1000 species) 20 GPU (500 species) 15 Speed up 10 5 0 1000 species 500 species 1 2 4 8 12 16 20 24 # CPU cores (comet) Can infer trees with 10,000 species & 400 genes in a day 30

Impact of fragmentary data Erfan Sayyari James B. Whitfield 31

Two types of missing data Missing genes: a taxon misses the gene fully (taxon occupancy) Figure from Hosner et al, 2016 Fragmentary data: the species is partially sequenced for some genes 32

Does missing data matter? Missing genes: Evidence suggests summary methods are often robust 33

Does missing data matter? Missing genes: Evidence suggests summary methods are often robust Fragmentary data: Less extensively studied Hosner et al. indicated: Fragments matter They suggest removing genes with many fragments 33

Our study Empirical data: An insect dataset of 144 species and 1478 genes 600 million years of evolution Transcriptomics, so plenty of missing data of both types Simulated data: 100 taxon, medium ILS, 1000 genes Added fragmentation with patterns similar to the insect data 34

Fragmentary data hurt gene trees and species trees 1.00 Gene tree error NRF Distance 0.75 0.50 Species tree error: 2.8% error 0.25 no filtering 10 20 25 33 50 66 75 80 orig seq with fragments Filtering method without fragments 35

Fragmentary data hurt gene trees and species trees 1.00 Fragmentary data dramatically increase gene tree and the species tree error Gene tree error NRF Distance 0.75 0.50 Species tree error: 7.0% error Species tree error: 2.8% error 0.25 no filtering 10 20 25 33 50 66 75 80 orig seq with fragments Filtering method without fragments 35

How do we solve this? Hosner et al. suggested removing entire genes. This is too conservative in our opinion. 36

How do we solve this? Hosner et al. suggested removing entire genes. This is too conservative in our opinion. We simply remove the fragmentary sequence from the gene and keep the rest of the gene What is fragmentary? Anything that has at most X% of sites We study different X values 36

Filtering helps to some level (simulations) FastTree RAxML 0.6 NRF Distance 0.4 0.2 no filtering10 20 25 33 50 66 75 80 orig seq Filtering method Gene tree error 37

Filtering helps to some level (simulations) FastTree RAxML 0.06 NRF Distance 0.6 0.4 0.2 0.05 0.04 0.03 no filtering10 20 25 33 50 66 75 80 orig seq Filtering method no-filter 10 20 25 33 50 66 75 80 Gene tree error Species tree error 37

Filtering helps to some level (simulations) FastTree RAxML 0.06 NRF Distance 0.6 0.4 0.2 0.05 Novel results: RAxML is much better than 0.04 FastTree 0.03 no filtering10 20 25 33 50 66 75 80 orig seq Filtering method no-filter 10 20 25 33 50 66 75 80 Gene tree error Species tree error 37

How about real data? 38

Gene trees seem to improve (less gene tree discordance) 1.00 66% filtering 0.75 0.50 Proportion of relevant gene trees 0.25 0.00 1.00 0.75 0.50 no-filtering 0.25 0.00 Diptera Mecoptera Siphonaptera Lepidoptera Trichoptera Coleoptera Strepsiptera Neuropterida Hymenoptera Psocodea Hemiptera Thysanoptera Blattodea Mantodea Grylloblattodea Orthoptera Phasmida Embiidina Plecoptera Dermaptera Ephemeroptera Odonata Thysanura Archaeognatha Diplura Collembola Important Ingroup A B C B/C D D/B/C E F G H I J K L M N P Strongly Supported Weakly Supported Weakly Reject Strongly Reject 39

Gene trees seem to improve (less gene tree discordance) 1.00 66% filtering 0.75 0.50 Proportion of relevant gene trees 0.25 0.00 1.00 0.75 0.50 no-filtering 0.25 0.00 Diptera Mecoptera Siphonaptera Lepidoptera Trichoptera Coleoptera Strepsiptera Neuropterida Hymenoptera Psocodea Hemiptera Thysanoptera Blattodea Mantodea Grylloblattodea Orthoptera Phasmida Embiidina Plecoptera Dermaptera Ephemeroptera Odonata Thysanura Archaeognatha Diplura Collembola Important Ingroup A B C B/C D D/B/C E F G H I J K L M N P Strongly Supported Weakly Supported Weakly Reject Strongly Reject 39

The species tree also improves ASTRAL trees differed with concatenation initially Ingroup A B C B/C D D/B/C After filtering, ASTRAL and concatenation were very similar E F G H I J K L Differences on controversial nodes M N P no-filtering 20 25 33 50 66 75 80 no-filtering 20 25 33 50 66 75 80 RAxML+ASTRAL is more robust than FastTree+ASTRAL 40 FastTree Weakly rejected (<0.95) Supported Strongly rejected (>=0.95) 0% 50% 100% RAxML

Visualizing discordance: DiscoVista Erfan Sayyari James B. Whitfield 41

DiscoVista Simple and interpretable Easy to run Focus on important branches 42

Quartet-based gene tree discordance Input: A species tree A set of gene trees Mapping of species to groups of interest Output: Quartet frequency per interesting branches Dataset (plants): Wicket et al, PNAS, 2014 43

Compare different datasets Datasets (metazoa): Rouse et al, Nature, 2016, Cannon et al, Nature, 2016 44

More traditional gene tree discordance measures Input: Gene trees Hypothetical clades Output: What percent of gene trees are compatible with those clades Dataset (insects): Misof et al, Science, 2014 45

Species tree comparison Datasets (metazoa): Rouse et al, Nature, 2016, Cannon et al, Nature, 2016 46

Summarize ASTRAL reconstructs species trees from gene trees with both high accuracy scalability to large datasets robust to some levels of model violations Like any other summary method, ASTRAL remains sensitive to gene tree estimation error try best to obtain highly accurate gene trees removing fragmentary data seems to help Visualizing discordance may prove illuminating 47

Future of ASTRAL Can it be changed to use characters directly? More broadly, can alternative scalable methods be developed for better gene tree estimation? 48

Future of ASTRAL Can it be changed to use characters directly? More broadly, can alternative scalable methods be developed for better gene tree estimation? ASTRAL scales to 10K leaves. We have 90K bacterial genomes. Can we scale further? 48

Tandy Warnow James B. Whitfield Maryam Rabiee Hashemi Chao Zhang S.M. Bayzid Théo Zimmermann John Yin Erfan Sayyari Shubhanshu Shekhar

Theoretical sample complexity results How many genes are enough to reconstruct the tree? 60000 50000 ε 0.05 0.1 0.2 0.33 # genes 40000 Balanced Caterpillar Theoretical 30000 20000 10000 0 0 100 1 f 200 50

Gene trees seem to improve (bootstrap) d 50% 40% Percent 30% 20% 10% no filtering20 25 33 50 66 75 80 Fragment filtering threshold Bootstrap support =0 <=33 >=75 =100 51

Gene trees seem to improve (less gene tree discordance) ) FN distance between species and gene trees 1.00 0.75 0.50 0.25 0.00 occupancy 0.6 0.7 0.8 0.9 no-filtering 20 25 33 50 66 75 90 Filtering method 52

Impact of gene tree error (using true gene trees) ASTRAL II ASTRAL II (true gt) CA ML Species tree topological error (FN) Species tree topo 0.4 0.3 0.2 0.1 0.0 Low ILS Medium ILS High ILS 50 200 1000 50 200 1000 50 200 1000 genes 53

Impact of gene tree error (using true gene trees) ASTRAL II ASTRAL II (true gt) CA ML Species tree topological error (FN) Species tree topo 0.4 0.3 0.2 0.1 0.0 Low ILS Medium ILS High ILS 50 200 1000 50 200 1000 50 200 1000 genes When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error 53

Running time (ho 3 0 12 9 6 3 0 ASTRAL-I versus ASTRAL-II 50 200 1000 50 200 1000 50 200 1000 genes 1e 07 ASTRAL I ASTRAL II ASTRAL II + true st Species tree topological error (FN) 0.8 igure S2: 0.6 Comparison of various variants of ASTRAL with 200 taxa and varying tree sha 0.0 Low ILS Medium ILS High ILS nd number 0.4 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALrue st shows 0.2the case where the true species tree is added to the search space; this is included to approxim n ideal (e.g. exact) solution to the quartet problem. 50 200 1000 50 200 1000 50 200 1000 genes s) 12 9 5 200 species, deep ILS 54 1e

Running time (ho 3 0 12 9 6 3 0 ASTRAL-I versus ASTRAL-II 50 200 1000 50 200 1000 50 200 1000 genes 1e 07 ASTRAL I ASTRAL II ASTRAL II + true st Species tree topological error (FN) 0.8 igure S2: 0.6 Comparison of various variants of ASTRAL with 200 taxa and varying tree sha 0.0 Low ILS Medium ILS High ILS nd number 0.4 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALrue st shows 0.2the case where the true species tree is added to the search space; this is included to approxim n ideal (e.g. exact) solution to the quartet problem. 50 200 1000 50 200 1000 50 200 1000 genes s) 12 9 5 200 species, deep ILS 54 1e

Dynamic programming L A L-A 55

Dynamic programming L A L-A A A-A A A-A A A-A Recursively break subsets of species into smaller subsets 55

Dynamic programming L A A L-A u L-A A-A A A-A A A-A A A-A Recursively break subsets of species into smaller subsets w(u) : Compare u against input gene trees and compute quartets from gene trees satisfied by u 55

Dynamic programming L Exponential A A L-A u L-A A-A A A-A A A-A A A-A Recursively break subsets of species into smaller subsets w(u) : Compare u against input gene trees and compute quartets from gene trees satisfied by u 55

Constrained version L A1 A2 A L-A A4 A10 A5 A6 A8 A9 A3 A A-A A A-A A A-A A7 56

Constrained version L A1 A2 A L-A A4 A10 A5 A6 A8 A9 A3 A A-A A A-A A A-A A7 Restrict branches in the species tree to a given constraint set X. 56

Deep ILS 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) 200 species, simulated species trees, deep ILS [Mirarab and Warnow, ISMB, 2015] 57