Reconstruction of species trees from gene trees using ASTRAL Siavash Mirarab University of California, San Diego (ECE)
Phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT Inference error gene here refers to a portion of the genome (not a functional gene) More data (#genes) 2
Gene tree discordance gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3
Gene tree discordance The species tree gene 1 gene1000 Gorilla Human Chimp Orangutan Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. A gene tree 3
Gene tree discordance The species tree gene 1 gene1000 Gorilla Human Chimp Orangutan Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. A gene tree Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) 3
Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4
Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4
Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations Omnipresent: possible for every tree Likely for short branches or large population sizes 4
MSC and Identifiability The statistical process multi-species coalescent (MSC) is used to model ILS. 5
MSC and Identifiability The statistical process multi-species coalescent (MSC) is used to model ILS. Any species tree defines a unique distribution on the set of all possible gene trees 5
MSC and Identifiability The statistical process multi-species coalescent (MSC) is used to model ILS. Any species tree defines a unique distribution on the set of all possible gene trees In principle, the species tree can be identified despite high discordance from the gene tree distribution 5
Multi-gene tree estimation pipelines gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Gorilla Chimpanzee Human gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G Gene tree estimation Approach 2: Summary methods Chimp Gorilla gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 gene 2 gene 1000 Human Gorilla Human Orang. Human Orang. Chimp Orang. Chimp Gorilla Summary method Orangutan Gorilla Chimpanzee Human There are also other very good approaches: co-estimation (e.g., *BEAST), site-based (e.g., SVDQuartets, SNAPP) 6
Multi-gene tree estimation pipelines gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1 supermatrix ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 2 gene 1000 Gene tree estimation Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Phylogeny inference Approach 2: Summary methods Gorilla Chimpanzee Statistically inconsistent [Roch and Steel, 2014] Human Chimp Gorilla Error gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 gene 2 gene 1000 Human Gorilla Human Orang. Human Orang. Chimp Orang. Chimp Gorilla Summary method Orangutan Gorilla Chimpanzee Human Data Can be statistically consistent given true gene trees Error-free gene trees 6
Multi-gene tree estimation pipelines Error gene 1 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG gene 2 CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G gene 1000 CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene 1 ACTGCACACCG CTGAGCATCG ACTGC-CCCCG CTGAGC-TCG AATGC-CCCCG ATGAGC-TC- -CTGCACACGGCTGA-CAC-G gene 1 supermatrix gene 2 gene 1000 Gene tree estimation Chimp Approach 1: Concatenation Orangutan CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT Gorilla Approach 2: Summary methods Summary method Phylogeny inference Gorilla Orangutan Human Orang. BUCKy (population tree), Gorilla Chimp gene 2 MP-EST, NJst (ASTRID), Human Orang. Gorilla Chimpanzee Statistically inconsistent [Roch and Steel, 2014] STAR, STELLS, GLASS, Human Chimpanzee Human ASTRAL, Orang. ASTRAL-II, ASTRAL-III Chimp gene 1000 Human Gorilla Data Can be statistically consistent given true gene trees Error-free gene trees 6
ASTRAL used by the biologists Plants: Wickett, et al., 2014, PNAS Birds: Prum, et al., 2015, Nature Xenoturbella, Cannon et al., 2016, Nature Xenoturbella, Rouse et al., 2016, Nature Flatworms: Laumer, et al., 2015, elife Shrews: Giarla, et al., 2015, Syst. Bio. Frogs: Yuan et al., 2016, Syst. Bio. Tomatoes: Pease, et al., 2016, PLoS Bio. ASTRAL-II ASTRAL Angiosperms: Huang et al., 2016, MBE Worms: Andrade, et al., 2015, MBE
This talk: ASTRAL Outlines of the method Accuracy in simulation studies With strong model violations The impact of Fragmentary sequence data Sampling multiple individuals Visualization 8
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) Chimp Gorilla θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Human Orang. Human Gorilla Human Orang. 9
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) Chimp Gorilla θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Human Orang. Human Gorilla Human Orang. The most frequent gene tree = The most likely species tree 9
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) Chimp Gorilla θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Human Orang. Human Gorilla Human Orang. The most frequent gene tree = The most likely species tree shorter branches more discordance a harder species tree reconstruction problem 9
More than 4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Gorilla Human Chimp Orang. Rhesus 10
More than 4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Gorilla Human Chimp Orang. Rhesus 1. Break gene trees into ( n 4 ) quartets of species 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 10
More than 4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Gorilla Human Chimp Orang. Rhesus Alternative: 1. Break gene trees into ( n 4 ) quartets of species weight all 3( n 4 ) quartet topologies 2. Find the dominant tree for all quartets of taxa by their frequency and find the optimal tree 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 10
Maximum Quartet Support Species Tree Optimization problem: Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= mx Q(T ) \ Q(t i ) 1 the set of quartet trees induced by T a gene tree 11
Maximum Quartet Support Species Tree Optimization problem: Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= mx Q(T ) \ Q(t i ) 1 the set of quartet trees induced by T a gene tree Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 11
ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015] Solve the problem exactly using dynamic programming Constrains the search space to make large datasets feasible The constrained version remains statistically consistent Running time: polynomially increases with the number of genes and the number of species 12
ASTRAL-I and ASTRAL-II [Mirarab, et al., Bioinformatics, 2014] [Mirarab and Warnow, Bioinformatics, 2015] Solve the problem exactly using dynamic programming Constrains the search space to make large datasets feasible The constrained version remains statistically consistent Running time: polynomially increases with the number of genes and the number of species ASTRAL-II: Increased the search space Improved the running time Can handle polytomies (lack of resolution) in input gene trees 12
Simulation study True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 13
Simulation study True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees Evaluate using the FN rate: the percentage of branches in the true tree that are missing from the estimated tree 13
Number of species impacts estimation error in the species tree 1000 genes, medium levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015] 14
ASTRAL: accurate and scalable Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 1000 genes, medium levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015] 15
ASTRAL: accurate and scalable Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 200 500 1000 number of species 1000 genes, medium levels of ILS, simulated species trees [S. Mirarab, T. Warnow, 2015] 15
Comparison to concatenation: depends on the ILS level 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 0% L M H L M H L M H 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 16
Comparison to concatenation: depends on the ILS level 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 0% L M H L M H L M H 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) more ILS more ILS more ILS 200 species, recent ILS 16
Horizontal Gene Transfer (HGT) [R. Davidson et al., BMC Genomics. 16 (2015)] Model violation: the simulated discordance is due to both ILS and randomly distributed HGT ~30% discordance ~35% ~55% ~70% 50 species, 10 genes
Horizontal Gene Transfer (HGT) [R. Davidson et al., BMC Genomics. 16 (2015)] Randomly distributed HGT is tolerated with enough genes # genes 50 species, varying # genes ~30% discordance ~55% ~35% ~70%
UCSD Work 1. Statistical support 2. ASTRAL-III Better running time GPU and CPU Parallelism Multi-Individual datasets 3. Impact of fragmentary data 19
Branch support Traditional approach: Multi-locus bootstrapping (MLBS) Slow: requires bootstrapping all genes (e.g., 100 m ML trees) Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015] We can do better! 20 [Mirarab et al., Sys bio, 2014]
Local posterior probability Recall quartet frequencies follow a multinomial distribution m 1 = 80 m 2 = 63 m 3 = 57 Chimp Gorilla Orang. Chimp Gorilla Chimp Human θ 1 Orang. Human Gorilla Human θ 2 θ 3 Orang. P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) 21
Local posterior probability Recall quartet frequencies follow a multinomial distribution m 1 = 80 m 2 = 63 m 3 = 57 Chimp Gorilla Orang. Chimp Gorilla Chimp Human Orang. Gorilla Human Orang. P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) θ 1 Human Can be analytically solved θ 2 θ 3 We implemented this idea in astral and called the measure the local posterior probability (localpp) 21
Quartet support v.s. localpp quartet frequency Increased number of genes (m) increased support Decreased discordance increased support 22
localpp is more accurate than bootstrapping 1.00 MLBS Local PP Recall 0.75 0.50 0.25 100X faster 0.00 0.00 0.25 0.50 0.75 1.0 False Positive Rate Avian simulated dataset (48 taxa, 1000 genes) [Sayyari and Mirarab, MBE, 2016] 23
ASTRAL can also estimate internal branch lengths estimated branch length (log scale) 2.5 0.0 2.5 5.0 7.5 True gene trees 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) With true gene trees, ASTRAL correctly estimates BL 24
ASTRAL can also estimate internal branch lengths estimated branch length (log scale) low gene tree error 2.5 0.0 2.5 Medium g.t. error 5.0 7.5 Moderate g.t. error True gene trees High gene tree error 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) With error-prone With true estimated gene trees, gene ASTRAL trees, correctly ASTRAL estimates underestimates BL BL 24
ASTRAL-III and beyond (unpublished work) Maryam Rabiee Hashemi John Yin 25 Chao Zhang
ASTRAL-III Chao Zhang Improved running time (to be released in July) Can handle polytomies without slowing down Can exploit similarities between gene trees to reduce running time 26
ASTRAL-III Chao Zhang Improved running time (to be released in July) Can handle polytomies without slowing down Can exploit similarities between gene trees to reduce running time To be released in August Handling datasets with multiple individuals per species GPU and CPU parallelism 26
Low support branches Does it help to contract branches with low support? Yes, but only for 1000 very low supports 16% Mean GT error Low (<50%) Theoretical justifications not clear Species tree error (FN ratio) 12% 8% High (>50%) Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 4% non 0 3 5 7 10 20 33 50 75 contraction threshold 27
Low support branches Does it help to contract branches with low support? Yes, but only for very low supports 25% 16% 1000 200 Mean GT error Low (<50%) Theoretical justifications not clear Species tree error (FN ratio) 20% 12% 15% 8% 10% High (>50%) Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 4% non 0 3 5 7 10 20 33 50 75 contraction threshold 27
Multiple individuals Maryam Rabiee Hashemi What if we sample multiple individuals from each species? In recently diverged species individuals may have different trees for each gene Sampling multiple individuals may provide extra signal 28
Multiple individuals helpful? Species tree error (RF distance) 0.3 0.2 0.1 0.0 1 ind. 2 ind. 5 ind. Variable effort 500 genes; 200 species Fixed effort genes x inds = 1000; 200 species Fixed effort 500 genes; species x inds = 200 Yes, it marginally helps accuracy 29
Multiple individuals helpful? Species tree error (RF distance) 0.3 0.2 0.1 0.0 1 ind. 2 ind. 5 ind. Variable effort 500 genes; 200 species Fixed effort genes x inds = 1000; 200 species Fixed effort 500 genes; species x inds = 200 Yes, it marginally helps accuracy But not if sequencing effort is kept fixed 29
Multiple individuals helpful? Species tree error (RF distance) 0.3 0.2 0.1 0.0 1 ind. 2 ind. 5 ind. Variable effort 500 genes; 200 species Fixed effort genes x inds = 1000; 200 species Fixed effort 500 genes; species x inds = 200 Yes, it marginally helps accuracy But not if sequencing effort is kept fixed 29
Parallelism 25 John Yin GPU (1000 species) 20 GPU (500 species) 15 Speed up 10 5 0 1000 species 500 species 1 2 4 8 12 16 20 24 # CPU cores (comet) Can infer trees with 10,000 species & 400 genes in a day 30
Impact of fragmentary data Erfan Sayyari James B. Whitfield 31
Two types of missing data Missing genes: a taxon misses the gene fully (taxon occupancy) Figure from Hosner et al, 2016 Fragmentary data: the species is partially sequenced for some genes 32
Does missing data matter? Missing genes: Evidence suggests summary methods are often robust 33
Does missing data matter? Missing genes: Evidence suggests summary methods are often robust Fragmentary data: Less extensively studied Hosner et al. indicated: Fragments matter They suggest removing genes with many fragments 33
Our study Empirical data: An insect dataset of 144 species and 1478 genes 600 million years of evolution Transcriptomics, so plenty of missing data of both types Simulated data: 100 taxon, medium ILS, 1000 genes Added fragmentation with patterns similar to the insect data 34
Fragmentary data hurt gene trees and species trees 1.00 Gene tree error NRF Distance 0.75 0.50 Species tree error: 2.8% error 0.25 no filtering 10 20 25 33 50 66 75 80 orig seq with fragments Filtering method without fragments 35
Fragmentary data hurt gene trees and species trees 1.00 Fragmentary data dramatically increase gene tree and the species tree error Gene tree error NRF Distance 0.75 0.50 Species tree error: 7.0% error Species tree error: 2.8% error 0.25 no filtering 10 20 25 33 50 66 75 80 orig seq with fragments Filtering method without fragments 35
How do we solve this? Hosner et al. suggested removing entire genes. This is too conservative in our opinion. 36
How do we solve this? Hosner et al. suggested removing entire genes. This is too conservative in our opinion. We simply remove the fragmentary sequence from the gene and keep the rest of the gene What is fragmentary? Anything that has at most X% of sites We study different X values 36
Filtering helps to some level (simulations) FastTree RAxML 0.6 NRF Distance 0.4 0.2 no filtering10 20 25 33 50 66 75 80 orig seq Filtering method Gene tree error 37
Filtering helps to some level (simulations) FastTree RAxML 0.06 NRF Distance 0.6 0.4 0.2 0.05 0.04 0.03 no filtering10 20 25 33 50 66 75 80 orig seq Filtering method no-filter 10 20 25 33 50 66 75 80 Gene tree error Species tree error 37
Filtering helps to some level (simulations) FastTree RAxML 0.06 NRF Distance 0.6 0.4 0.2 0.05 Novel results: RAxML is much better than 0.04 FastTree 0.03 no filtering10 20 25 33 50 66 75 80 orig seq Filtering method no-filter 10 20 25 33 50 66 75 80 Gene tree error Species tree error 37
How about real data? 38
Gene trees seem to improve (less gene tree discordance) 1.00 66% filtering 0.75 0.50 Proportion of relevant gene trees 0.25 0.00 1.00 0.75 0.50 no-filtering 0.25 0.00 Diptera Mecoptera Siphonaptera Lepidoptera Trichoptera Coleoptera Strepsiptera Neuropterida Hymenoptera Psocodea Hemiptera Thysanoptera Blattodea Mantodea Grylloblattodea Orthoptera Phasmida Embiidina Plecoptera Dermaptera Ephemeroptera Odonata Thysanura Archaeognatha Diplura Collembola Important Ingroup A B C B/C D D/B/C E F G H I J K L M N P Strongly Supported Weakly Supported Weakly Reject Strongly Reject 39
Gene trees seem to improve (less gene tree discordance) 1.00 66% filtering 0.75 0.50 Proportion of relevant gene trees 0.25 0.00 1.00 0.75 0.50 no-filtering 0.25 0.00 Diptera Mecoptera Siphonaptera Lepidoptera Trichoptera Coleoptera Strepsiptera Neuropterida Hymenoptera Psocodea Hemiptera Thysanoptera Blattodea Mantodea Grylloblattodea Orthoptera Phasmida Embiidina Plecoptera Dermaptera Ephemeroptera Odonata Thysanura Archaeognatha Diplura Collembola Important Ingroup A B C B/C D D/B/C E F G H I J K L M N P Strongly Supported Weakly Supported Weakly Reject Strongly Reject 39
The species tree also improves ASTRAL trees differed with concatenation initially Ingroup A B C B/C D D/B/C After filtering, ASTRAL and concatenation were very similar E F G H I J K L Differences on controversial nodes M N P no-filtering 20 25 33 50 66 75 80 no-filtering 20 25 33 50 66 75 80 RAxML+ASTRAL is more robust than FastTree+ASTRAL 40 FastTree Weakly rejected (<0.95) Supported Strongly rejected (>=0.95) 0% 50% 100% RAxML
Visualizing discordance: DiscoVista Erfan Sayyari James B. Whitfield 41
DiscoVista Simple and interpretable Easy to run Focus on important branches 42
Quartet-based gene tree discordance Input: A species tree A set of gene trees Mapping of species to groups of interest Output: Quartet frequency per interesting branches Dataset (plants): Wicket et al, PNAS, 2014 43
Compare different datasets Datasets (metazoa): Rouse et al, Nature, 2016, Cannon et al, Nature, 2016 44
More traditional gene tree discordance measures Input: Gene trees Hypothetical clades Output: What percent of gene trees are compatible with those clades Dataset (insects): Misof et al, Science, 2014 45
Species tree comparison Datasets (metazoa): Rouse et al, Nature, 2016, Cannon et al, Nature, 2016 46
Summarize ASTRAL reconstructs species trees from gene trees with both high accuracy scalability to large datasets robust to some levels of model violations Like any other summary method, ASTRAL remains sensitive to gene tree estimation error try best to obtain highly accurate gene trees removing fragmentary data seems to help Visualizing discordance may prove illuminating 47
Future of ASTRAL Can it be changed to use characters directly? More broadly, can alternative scalable methods be developed for better gene tree estimation? 48
Future of ASTRAL Can it be changed to use characters directly? More broadly, can alternative scalable methods be developed for better gene tree estimation? ASTRAL scales to 10K leaves. We have 90K bacterial genomes. Can we scale further? 48
Tandy Warnow James B. Whitfield Maryam Rabiee Hashemi Chao Zhang S.M. Bayzid Théo Zimmermann John Yin Erfan Sayyari Shubhanshu Shekhar
Theoretical sample complexity results How many genes are enough to reconstruct the tree? 60000 50000 ε 0.05 0.1 0.2 0.33 # genes 40000 Balanced Caterpillar Theoretical 30000 20000 10000 0 0 100 1 f 200 50
Gene trees seem to improve (bootstrap) d 50% 40% Percent 30% 20% 10% no filtering20 25 33 50 66 75 80 Fragment filtering threshold Bootstrap support =0 <=33 >=75 =100 51
Gene trees seem to improve (less gene tree discordance) ) FN distance between species and gene trees 1.00 0.75 0.50 0.25 0.00 occupancy 0.6 0.7 0.8 0.9 no-filtering 20 25 33 50 66 75 90 Filtering method 52
Impact of gene tree error (using true gene trees) ASTRAL II ASTRAL II (true gt) CA ML Species tree topological error (FN) Species tree topo 0.4 0.3 0.2 0.1 0.0 Low ILS Medium ILS High ILS 50 200 1000 50 200 1000 50 200 1000 genes 53
Impact of gene tree error (using true gene trees) ASTRAL II ASTRAL II (true gt) CA ML Species tree topological error (FN) Species tree topo 0.4 0.3 0.2 0.1 0.0 Low ILS Medium ILS High ILS 50 200 1000 50 200 1000 50 200 1000 genes When we divide our 50 replicates into low, medium, or high gene tree estimation error, ASTRAL tends to be better with low error 53
Running time (ho 3 0 12 9 6 3 0 ASTRAL-I versus ASTRAL-II 50 200 1000 50 200 1000 50 200 1000 genes 1e 07 ASTRAL I ASTRAL II ASTRAL II + true st Species tree topological error (FN) 0.8 igure S2: 0.6 Comparison of various variants of ASTRAL with 200 taxa and varying tree sha 0.0 Low ILS Medium ILS High ILS nd number 0.4 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALrue st shows 0.2the case where the true species tree is added to the search space; this is included to approxim n ideal (e.g. exact) solution to the quartet problem. 50 200 1000 50 200 1000 50 200 1000 genes s) 12 9 5 200 species, deep ILS 54 1e
Running time (ho 3 0 12 9 6 3 0 ASTRAL-I versus ASTRAL-II 50 200 1000 50 200 1000 50 200 1000 genes 1e 07 ASTRAL I ASTRAL II ASTRAL II + true st Species tree topological error (FN) 0.8 igure S2: 0.6 Comparison of various variants of ASTRAL with 200 taxa and varying tree sha 0.0 Low ILS Medium ILS High ILS nd number 0.4 of genes. Species tree accuracy (top) and running times (bottom) are shown. ASTRALrue st shows 0.2the case where the true species tree is added to the search space; this is included to approxim n ideal (e.g. exact) solution to the quartet problem. 50 200 1000 50 200 1000 50 200 1000 genes s) 12 9 5 200 species, deep ILS 54 1e
Dynamic programming L A L-A 55
Dynamic programming L A L-A A A-A A A-A A A-A Recursively break subsets of species into smaller subsets 55
Dynamic programming L A A L-A u L-A A-A A A-A A A-A A A-A Recursively break subsets of species into smaller subsets w(u) : Compare u against input gene trees and compute quartets from gene trees satisfied by u 55
Dynamic programming L Exponential A A L-A u L-A A-A A A-A A A-A A A-A Recursively break subsets of species into smaller subsets w(u) : Compare u against input gene trees and compute quartets from gene trees satisfied by u 55
Constrained version L A1 A2 A L-A A4 A10 A5 A6 A8 A9 A3 A A-A A A-A A A-A A7 56
Constrained version L A1 A2 A L-A A4 A10 A5 A6 A8 A9 A3 A A-A A A-A A A-A A7 Restrict branches in the species tree to a given constraint set X. 56
Deep ILS 30% 1000 genes 200 genes 50 genes Species tree topological error (FN) 20% 10% ASTRAL II NJst CA ML 10M 2M 500K 10M 2M 500K 10M 2M 500K tree length (controls the amount of ILS) 200 species, simulated species trees, deep ILS [Mirarab and Warnow, ISMB, 2015] 57