New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

New methods for es-ma-ng species trees from genome-scale data Tandy Warnow The University of Illinois

Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Phylogenomics = Species trees from whole genomes Nothing in biology makes sense except in the light of evolu9on - Dobhzansky

The Tree of Life: Mul$ple Challenges Scien9fic challenges: Ultra-large mul9ple-sequence alignment Alignment-free phylogeny es9ma9on Supertree es9ma9on Es9ma9ng species trees from many gene trees Genome rearrangement phylogeny Re9culate evolu9on Visualiza9on of large trees and alignments Data mining techniques to explore mul9ple op9ma Theore9cal guarantees under Markov models of evolu9on Applica9ons: metagenomics protein structure and func9on predic9on trait evolu9on detec9on of co-evolu9on systems biology Techniques: Graph theory (especially chordal graphs) Probability theory and sta9s9cs Hidden Markov models Combinatorial op9miza9on Heuris9cs Supercompu9ng

phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) I ll use the term gene to refer to c-genes : recombination-free orthologous stretches of the genome 2

Gene tree discordance Incomplete Lineage Sor9ng (ILS) is a dominant cause of gene tree heterogeneity gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3

Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Incomplete Lineage Sor9ng (ILS) Confounds phylogene9c analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. There is substan9al debate about how to analyze phylogenomic datasets in the presence of ILS, focused around sta9s9cal consistency guarantees (theory) and performance on data.

Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus9n S. Mirarab Md. S. Bayzid, UT-Aus9n UT-Aus9n Plus many many other people Approx. 50 species, whole genomes, 14,000 loci Jarvis, Mirarab, et al., Science 2014 Major challenges: Concatena9on analysis took > 250 CPU years, and suggested a rapid radia9on We observed massive gene tree heterogeneity consistent with incomplete lineage sor9ng Very poor resolu9on in the 14,000 gene trees (average bootstrap support 25%) Standard coalescent-based species tree es9ma9on methods contradicted concatena9on analysis and prior studies

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin l 103 plant transcriptomes, 400-800 single copy genes l Next phase will be much bigger l Wickei, Mirarab et al., PNAS 2014 Challenges: Massive gene tree heterogeneity consistent with ILS Could not use exis9ng coalescent methods due to missing data (many gene trees could not be rooted) and large number of species

This talk Gene tree heterogeneity due to incomplete lineage sor9ng, modelled by the mul9-species coalescent (MSC) Sta9s9cally consistent es9ma9on of species trees under the MSC, and the impact of gene tree es9ma9on error New methods in phylogenomics: Sta9s9cal binning (Science 2014) and Weighted Sta9s9cal Binning (PLOS One 2015): improving gene trees ASTRAL (Bioinforma9cs 2014, 2015): quartet-based es9ma9on Open ques9ons

Sampling mul9ple genes from mul9ple species Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

A species tree defines a probability distribu9on on gene trees under the Mul9-Species Coalescent (MSC) Model Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Sta9s9cal Consistency error Data

Main compe9ng approaches Species gene 1 gene 2... gene k... Concatenation Analyze separately... Summary Method

Sta9s9cally consistent under MSC? CA-ML (Concatena9on using unpar99oned maximum likelihood) - NO Most frequent gene tree NO Minimize Deep Coalescences (MDC) NO Greedy Consensus (GC) NO Matrix Representa9on with Parsimony (MRP, supertree method) NO Hence, none of these standard approaches are proven to converge to the true species tree as the number of loci increases. Many of them are posi9vely misleading (will converge to the wrong tree)!

Anomaly zone The most probable gene tree on a set S of species may not be species tree on S (anomaly zone, ask James Degnan and Noah Rosenberg), except for: rooted three-species trees unrooted four-species trees

Summary Methods...

Summary Methods... Compu9ng rooted species tree from rooted gene trees: For every three species {a,b,c}, record most frequent rooted gene tree on {a,b,c} Combine rooted three-leaf gene trees into rooted tree if they are compa9ble Theorem: This algorithm is sta9s9cally consistent under the MSC and runs in polynomial 9me.

Summary Methods... Compu9ng unrooted species tree from unrooted gene trees: For every four species {a,b,c,d}, record most frequent unrooted gene tree on {a,b,c,d} Combine unrooted four-leaf gene trees into unrooted tree if they are compa9ble (recursive algorithm based on finding sibling pairs and removing one sibling) Theorem: This algorithm is sta9s9cally consistent under the MSC and runs in polynomial 9me.

Sta9s9cally consistent under ILS? Coalescent-based summary methods: MP-EST (Liu et al. 2010): maximum pseudo-likelihood es9ma9on of rooted species tree based on rooted triplet tree distribu9on YES BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree es9ma9on YES And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) - YES Co-es-ma-on methods: *BEAST (Heled and Drummond 2009): Bayesian coes9ma9on of gene trees and species trees YES Co-es9ma9on methods are too slow to use on most datasets hence the debate is largely between concatena9on (tradi9onal approach) and summary methods. Single-site methods (SMRT, SVDquartets, METAL, SNAPP, and others) - YES CA-ML (Concatena9on using unpar99oned maximum likelihood) - NO MDC NO GC (Greedy Consensus) NO

Results on 11-taxon datasets with weak ILS 0.25 0.2 Average FN rate 0.15 0.1 0.05 *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC 0 5 genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013

Results on 11-taxon datasets with weak ILS 0.25 Average FN rate 0.2 0.15 0.1 *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC *BEAST MORE ACCURATE than summary methods, because *BEAST gets more accurate gene trees! 0.05 0 5 genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013

Results on 11-taxon datasets with weak ILS 0.25 Average FN rate 0.2 0.15 0.1 *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC Summary methods (BUCKy-pop, MP-EST) are both sta9s9cally consistent under the MSC but are impacted by gene tree es9ma9on error 0.05 0 5 genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013

Results on 11-taxon datasets with weak ILS 0.25 Average FN rate 0.2 0.15 0.1 0.05 *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC Concatena9on (RAxML) best of all methods on these data! (However, for high enough ILS, concatena9on is not as accurate as the best summary methods.) 0 5 genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013

Impact of Gene Tree Es9ma9on Error on MP-EST 0.25 0.2 Average FN rate 0.15 0.1 true estimated 0.05 0 MP EST MP-EST has no error on true gene trees, but MP-EST has 9% error on es-mated gene trees Datasets: 11-taxon strongils condi9ons with 50 genes Similar results for other summary methods (MDC, Greedy, etc.)

TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees Summary methods combine es9mated gene trees, not true gene trees. Mul9ple studies show that summary methods can be less accurate than concatena9on in the presence of high gene tree es9ma9on error. Genome-scale data includes a range of markers, not all of which have substan9al signal. Furthermore, removing sites due to model viola9ons reduces signal. Some researchers also argue that gene trees should be based on very short alignments, to avoid intra-locus recombina9on.

Gene tree es9ma9on error: key issue in the debate Summary methods combine es9mated gene trees, not true gene trees. Mul9ple studies show that summary methods can be less accurate than concatena9on in the presence of high gene tree es9ma9on error. Genome-scale data includes a range of markers, not all of which have substan9al signal. Furthermore, removing sites due to model viola9ons reduces signal. Some researchers also argue that gene trees should be based on very short alignments, to avoid intra-locus recombina9on.

What is the impact of gene tree es9ma9on error on species tree es9ma9on? Ques9on: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Answers: Roch & Warnow, Syst Biol, March 2015: Strict molecular clock: Yes for some new methods, even for a single site per locus No clock: Unknown for all methods, including MP-EST, ASTRAL, etc. S. Roch and T. Warnow. "On the robustness to gene tree es9ma9on error (or lack thereof) of coalescent-based species tree methods", Systema9c Biology, 64(4):663-676, 2015, (PDF)

Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus9n S. Mirarab Md. S. Bayzid, UT-Aus9n UT-Aus9n Plus many many other people Approx. 50 species, whole genomes, 14,000 loci Solu9on: Sta-s-cal Binning Improves coalescent-based species tree es9ma9on by improving gene trees (Mirarab, Bayzid, Boussau, and Warnow, Science 2014), see also weighted sta9s9cal binning (Bayzid et al., PLOS One 2015) Avian species tree es9mated using Sta-s-cal Binning with MP-EST (Jarvis, Mirarab, et al., Science 2014)

Ideas behind sta9s9cal binning Gene tree error tends to decrease with the number of sites in the alignment Number of sites in an alignment Concatena9on (even if not sta9s9cally consistent) tends to be reasonably accurate when there is not too much gene tree heterogeneity

Statistical binning technique our simulation study, statistical binning reduced the topological error of species trees estimated using MP-EST and enabled a coalescent-based analysis that was more accurate than concatenation even when gene tree estimation error was relatively high. Statistical binning also reduced the error in gene tree topology and species tree branch length estimation, especially Traditional pipeline (unbinned) data sets. Thus, statistical binning enables highly accurate species tree estimations, even on genome-scale data sets. The list of author affiliations is available in the full article online. *Corresponding author. E-mail: warnow@illinois.edu Cite this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463 Downloaded from www.sciencema Sequence data Gene alignments Estimated gene trees Species tree Statistical binning pipeline Incompatibility graph Binned supergene alignments Supergene trees Species tree The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for combinabilty, before estimating gene trees. Published by AAAS Note: Supergene trees computed using fully par99oned maximum likelihood Vertex-coloring graph with balanced color classes is NP-hard; we used heuris9c.

Theorem 3 (PLOS One, Bayzid et al. 2015): Unweighted sta9s9cal binning pipelines are not sta9s9cally consistent under GTR+MSC As the number of sites per locus increase: All es9mated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) For each bin, with probability converging to 1, the genes in the bin have the same tree topology (but can have different numeric parameters), and there is only one bin for any given tree topology For each bin, a fully par99oned maximum likelihood (ML) analysis of its supergene alignment converges to a tree with the common gene tree topology. As the number of loci increase: every gene tree topology appears with probability converging to 1. Hence as both the number of loci and number of sites per locus increase, with probability converging to 1, every gene tree topology appears exactly once in the set of supergene trees. It is impossible to infer the species tree from the flat distribu9on of gene trees!

Theorem 2 (PLOS One, Bayzid et al. 2015): WSB pipelines are sta9s9cally consistent under GTR+MSC Easy proof: As the number of sites per locus increase All es9mated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) For every bin, with probability converging to 1, the genes in the bin have the same tree topology Fully par99oned GTR ML analysis of each bin converges to a tree with the common topology of the genes in the bin Hence as the number of sites per locus and number of loci both increase, WSB followed by a sta9s9cally consistent summary method will converge in probability to the true species tree. Q.E.D.

Weighted Sta9s9cal Binning: empirical WSB generally benign to highly beneficial: Improves accuracy of gene tree topology Improves accuracy of species tree topology Improves accuracy of species tree branch length Reduces incidence of highly supported false posi9ve branches

Sta9s9cal binning vs. unbinned 0.25 0.2 Average FN rate 0.15 0.1 Unbinned Statistical 75 0.05 0 MP EST MDC*(75) MRP MRL GC Datasets: 11-taxon strongils datasets with 50 genes from Chung and Ané, Systema9c Biology Binning produces bins with approximate 5 to 7 genes each

Sta-s-cal binning vs. Unbinned and Concatena-on (a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length Species tree es9ma9on error for MP-EST and ASTRAL, and also concatena9on using ML, on avian simulated datasets: 48 taxa, moderately high ILS (AD=47%), 1000 genes, and varying gene sequence length. Bayzid et al., (2015). PLoS ONE 10(6): e0129183

Comparing Binned and Un-binned MP-EST on the Avian Dataset Conflict with other lines of strong evidence 97/97 100/99 Australaves 91/87 88/90 99/99 Cursores Otidimorphae 80/79 Columbea 9 7/94 Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Cariama cristata Coraciimorphae Accipitriformes Tyto alba 59/57 Pelecanus crispus 87 Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Gavia stellata Gavia stellata 94 50/48 Phaethon lepturus Phaethon lepturus 68 58/56 100/99 100/99 Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Phoenicopterus ruber Podiceps cristatus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus Binned MP-EST (unweighted/weighted) Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Coraciimorphae Cariama cristata Accipitriformes Tyto alba Pelecanus crispus Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Phoenicopterus ruber Podiceps cristatus Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus 92 88 68 79 99 73 Unbinned MP-EST 88 98 86 95 67 Unbinned MP-EST strongly rejects Columbea, a major finding by Jarvis, Mirarab,et al. Binned MP-EST is largely consistent with the ML concatena9on analysis. The trees presented in Science 2014 were the ML concatena9on and Binned MP-EST

Running Time Comparison Concatena9on analysis of the Avian dataset: ~250 CPU years and 1Tb memory Sta9s9cal binning analysis: ~5 CPU years, almost all of which was compu9ng maximum likelihood gene trees, much less memory usage Species tree es9ma9on using tradi9onal approaches is more computa9onally expensive, and not as accurate as coalescentbased methods!

Summary (so far) Sta9s9cal binning (weighted or unweighted): improves gene trees, and leads to improved species trees in the presence of ILS compared to unbinned analyses. Sta9s9cal binning pipelines are also more accurate than concatena9on under high ILS. Pipelines using weighted version are sta9s9cally consistent under the mul9-species coalescent model. Sta9s9cal binning pipelines are much faster than concatena9on analyses (e.g. 5 years vs. 250 years for avian dataset).

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin l 103 plant transcriptomes, 400-800 single copy genes l Wickei, Mirarab et al., PNAS 2014 l Next phase will be much bigger (~1000 species and ~1000 genes) Challenges: Massive gene tree heterogeneity consistent with ILS Could not use exis9ng coalescent methods due to missing data (many gene trees could not be rooted) and large number of species

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin l 103 plant transcriptomes, 400-800 single copy genes l Wickei, Mirarab et al., PNAS 2014 l Next phase will be much bigger (~1000 species and ~1000 genes) Solu9on: New coalescent-based method ASTRAL (Mirarab et al., ECCB/ Bioinforma-cs 2014, Mirarab et al., ISMB/Bioinforma-cs 2015) ASTRAL is sta9s9cally consistent, polynomial 9me, and uses unrooted gene trees.

ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014] Optimization Problem (suspected NP-Hard): Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= X t2t Q(T ) \ Q(t) Set of quartet trees induced by T a gene tree all input gene trees Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 15

Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar99ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar99ons from X. Theorems (Mirarab et al., 2014): If X contains the bipar99ons from the input gene trees (and perhaps others), then an exact solu9on to this problem is sta9s9cally consistent under the MSC. The constrained MQST problem can be solved in O( X 2 nk) 9me. (We use dynamic programming, and build the unrooted tree from the boiom-up, based on allowed clades halves of the allowed bipar99ons.) Conjecture: MQST is NP-hard

Simulation study Variable parameters: Number of species: 10 1000 True (model) species tree True gene trees Sequence data Number of genes: 50 1000 Finch Falcon Owl Eagle Pigeon Amount of ILS: low, medium, high Deep versus recent speciation Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 11 model conditions (50 replicas each) with heterogenous gene tree error Compare to NJst, MP-EST, concatenation (CA-ML) Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree Used SimPhy, Mallo and Posada, 2015 14

Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 1000 genes, medium levels of recent ILS 16

Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 200 500 1000 number of species 1000 genes, medium levels of recent ILS 16

Accuracy in the presence of HGT + ILS 200 Estimated Gene Trees Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees, simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-2 7 7 Price, Dehal, Arkin 2015 8 Fletcher, Yang 2009 12 Davidson et al., RECOMB-CG, BMC Genomics 2015

Summary ASTRAL is a summary methods that is sta9s9cally consistent in the presence of ILS, and that run in polynomial 9me. ASTRAL can analyze very large datasets (1000 species and 1000 genes or more) with high accuracy. ASTRAL also performs well with ILS +HGT. Coalescent-based summary methods are much faster than tradi9onal concatena9on approaches, and they can provide improved accuracy in the presence of gene tree heterogeneity. Gene tree es9ma9on error impacts accuracy of species trees but sta9s9cal binning can reduce gene tree es9ma9on error, and lead to improved species tree es9ma9ons (topology, branch lengths, and incidence of false posi9ves).

Future Direc9ons Beier coalescent-based summary methods (that are more robust to gene tree es9ma9on error) Beier techniques for es9ma9ng gene trees given mul9-locus data, or for co-es9ma9ng gene trees and species trees Beier theory about robustness to gene tree es9ma9on error (or lack thereof) for coalescentbased summary methods Beier single site methods (see SMRT, SVDquartets, METAL, and SNAPP)

Acknowledgments NSF grant DBI-1461364 (joint with Noah Rosenberg at Stanford and Luay Nakhleh at Rice): hip://tandy.cs.illinois.edu/phylogenomicsproject.html Papers available at hip://tandy.cs.illinois.edu/papers.html SoTware ASTRAL and sta-s-cal binning: Available at hips://github.com/smirarab ASTRID: Available at hip://pranjalv123.github.io/astrid/ Other Funding: David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Compu9ng Center), Grainger Founda9on, and HHMI (to SM)

Running time when varying the number of species Running time (hours) 20 10 ASTRAL II NJst MP EST 0 10 50 100 200 500 1000 number of species 1000 genes, medium levels of recent ILS 17

ASTRID ASTRID: Accurate species trees using internode distances, Vachaspa9 and Warnow, RECOMB-CG 2015 and BMC Genomics 2015 Algorithmic design: Computes a matrix of average leaf-to-leaf topological distances, and then computes a tree using FastME (more accurate than neighbor Joining and faster, too). Related to NJst (Liu and Yu, 2010), which computes the same matrix but then computes the tree using neighbor joining (NJ). Sta9s9cally consistent under the MSC O(kn 2 + n 3 ) 9me where there are k gene trees and n species

Both ASTRAL and ASTRID substantially outperform MP-EST Avian simulated dataset Mammalian simulated dataset

ASTRID is very fast 48-taxon avian simulated dataset I On the ASTRAL-2 dataset with 1000 taxa, 1000 genes, ASTRID-FastME takes 33 minutes, ASTRAL takes 12 hours.

Scaling methods to large datasets BBCA: combining random binning with *BEAST to enable scalability to large numbers of loci (Zimmermann et al., BMC Genomics 2014) Using divide-and-conquer to scale MP-EST (and other methods) to large numbers of taxa (Bayzid et al., BMC Genomics 2014)

Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 200 500 1000 number of species 1000 genes, medium levels of recent ILS 16

Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II ASTRAL II NJst MP EST MP EST 10 50 100 200 500 1000 number of species 1000 genes, medium levels of recent ILS 16

Exact solution We developed a dynamic programming algorithm to solve the problem exactly Exponential running time (still feasible for <18 species) Developed a constrained version of the problem that can be solved exactly in polynomial time Runs for 1000 species and 1000 genes in about a day Remains statistically consistent 16

ASTRAL-II on biological datasets (ongoing collaborations) 1200 plants with ~ 400 genes (1KP consortium) 250 avian species with 2000 genes (with LSU, UF, and Smithsonian) 200 avian species with whole genomes (with Genome 10K, international) 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane) 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign) 50 Hummingbird species with 2000 genes (with U. Copenhagen and Smithsonian) 40 raptor species (birds) with 10,000 genes (with U. Copenhagen and Berkeley) 38 mammalian species with 10,000 genes (with U. of Bristol, Cambridge, and Nat. Univ. of Ireland) 29

Computational Phylogenomics NP-hard problems Large datasets Complex sta9s9cal es9ma9on problems Metagenomics Protein structure and func9on predic9on Medical forensics Systems biology Popula9on gene9cs

Phylogenomics = Species trees from whole genomes Mul9ple applica9ons, including Metagenomics, Protein Structure and Func9on, Conserva9on Biology Adapta9on

ASTRAL and ASTRAL-2 Es9mates the species tree from unrooted gene trees by finding the species tree that has the maximum quartet support, subject to input constraint set X. ASTRAL lets X be the set of bipar99ons in the input gene trees; ASTRAL-2 includes addi9onal bipar99ons beyond this (to address missing data challenges). Theorem: ASTRAL is sta-s-cally consistent under the MSC, even when solved in constrained mode (drawing bipar99ons from the input gene trees). The constrained version of ASTRAL runs in polynomial 9me Open source so ware at hips://github.com/smirarab Published in Bioinforma9cs 2014 and 2015 Used in Wickei, Mirarab et al. (PNAS 2014)

Sta9s9cal Consistency for summary methods error Data Data are gene trees, presumed to be randomly sampled true gene trees.

Fig 1. Pipeline for unbinned analyses, unweighted sta-s-cal binning, and weighted sta-s-cal binning. Bayzid MS, Mirarab S, Boussau B, Warnow T (2015) Weighted Sta9s9cal Binning: Enabling Sta9s9cally Consistent Genome-Scale Phylogene9c Analyses. PLoS ONE 10(6): e0129183. doi:10.1371/journal.pone.0129183 hip://127.0.0.1:8081/plosone/ar9cle?id=info:doi/10.1371/journal.pone.0129183

Table 1. Model trees used in the Weighted Sta9s9cal Binning study. We show number of taxa, species tree branch length (rela9ve to base model), and average topological discordance between true gene trees and true species tree. Dataset Species tree branch length scaling Average Discordance (%) Avian (48) 2X 35 Avian (48) 1X 47 Avian (48) 0.5X 59 Mammalian (37) 2X 18 Mammalian (37) 1X 32 Mammalian (37) 0.5X 54 10-taxon Lower ILS" 40 10-taxon Higher ILS" 84 15-taxon High ILS" 82 doi:10.1371/journal.pone. 0129183.t001

Binning can improve species tree topology es-ma-on (a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length Species tree es9ma9on error for MP-EST and ASTRAL, and also concatena9on using ML, on avian simulated datasets: 48 taxa, moderately high ILS (AD=47%), 1000 genes, and varying gene sequence length. Bayzid et al., (2015). PLoS ONE 10(6): e0129183

Comparing Binned and Un-binned MP-EST on the Avian Dataset Conflict with other lines of strong evidence 97/97 100/99 Australaves 91/87 88/90 99/99 Cursores Otidimorphae 80/79 Columbea 9 7/94 Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Cariama cristata Coraciimorphae Accipitriformes Tyto alba 59/57 Pelecanus crispus 87 Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Gavia stellata Gavia stellata 94 50/48 Phaethon lepturus Phaethon lepturus 68 58/56 100/99 100/99 Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Phoenicopterus ruber Podiceps cristatus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Coraciimorphae Cariama cristata Accipitriformes Tyto alba Pelecanus crispus Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Phoenicopterus ruber Podiceps cristatus Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus 92 88 68 79 99 73 88 98 86 95 67 Binned MP-EST is largely consistent with the ML concatena9on analysis. The trees presented in Science 2014 were the ML concatena9on and Binned MP-EST Binned MP-EST (unweighted/weighted) Unbinned MP-EST Bayzid et al., (2015). PLoS ONE 10(6): e0129183

Binning can reduce incidence of high support false posi-ve edges Cumula9ve distribu9on of the bootstrap support values of true posi9ve (le ) and false posi9ve (right) edges. If a curve for method X is above the curve for method Y, then X has higher BS for true posi9ves and lower BS for false posi9ves. Values in the shaded area indicate false posi9ve branches with support at 75% or higher. Results are shown for 1000 genes with 500bp, on the avian simulated datasets. Bayzid et al., (2015). PLoS ONE 10(6): e0129183

Weighted Sta9s9cal Binning: empirical However, WSB can reduce accuracy under some condi9ons. Current simula9ons have only established this for model condi9ons that simultaneously have: Very small numbers of species (at most 10) Very high ILS (AD > 80%) Low bootstrap support for gene trees Most likely there are other condi9ons as well.

Species tree es-ma-on error for MP-EST and ASTRAL on 10-taxon datasets AD=40% AD=84% Simphy Model Tree 200 genes with 100bp (GTRGAMMA) 10 replicates per condi9on Notes: Moderate ILS: binning neutral or beneficial using BS=50% Very high ILS: binning neutral for BS=50%, but increases MP-EST error with BS=75% Bayzid MS, Mirarab S, Boussau B, Warnow T (2015). PLoS ONE 10(6): e0129183

Sketch of L&E argument For any sequence length L, there is a model species tree such that nearly all sites on nearly all genes evolve without any changes, and so nearly all gene trees have maximum bootstrap support below the threshold value. As the number of loci increase, the bins produced by WSB will have the same gene tree distribu9on as for the true species tree (or the devia9on will not impact any downstream argument). On each bin, ML concatena9on will converge to some tree that is not the species tree. (Hence, applying a coalescent-based method to these supergene trees will not produce the species tree, even as the number of loci increases.)

Sketch of L&E argument For any sequence length L, there is a model species tree such that nearly all sites on nearly all genes evolve without any changes, and so nearly all gene trees have maximum bootstrap support below the threshold value. As the number of loci increase, the bins produced by WSB will have the same gene tree distribu9on as for the true species tree (or the devia9on will not impact any downstream argument). On each bin, ML concatena9on will converge to some tree that is not the species tree. (Hence, applying a coalescent-based method to these supergene trees will not converge to the species tree as the number of loci increases.)

Liu and Edwards, Comment in Science, October 2015 Aiempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beier results for sta9s9cal binning (both weighted and unweighted) than they reported, They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning) They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171

Liu and Edwards, Comment in Science, October 2015 Aiempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beier results for sta9s9cal binning (both weighted and unweighted) than they reported They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning) They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171

Liu and Edwards, Comment in Science, October 2015 Aiempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beier results for sta9s9cal binning (both weighted and unweighted) than they reported. They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning). They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171

Liu and Edwards, Comment in Science, October 2015 Aiempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beier results for sta9s9cal binning (both weighted and unweighted) than they reported. They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning) They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning. This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171

Liu and Edwards, Comment in Science, October 2015 Aiempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Mirarab et al. response (Science, October 2015): Our re-analysis of their data (with correct supergene alignments) shows that WSB reduces accuracy but not by as much as they report. Our analyses of slightly larger datasets with the same proper9es (pec9nate, very high ILS, strict clock) showed WSB neutral to beneficial. Figure of model tree from L&E, Science 9 October 2015: 171

Liu and Edwards, Comment in Science, October 2015 Aiempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) All the condi9ons in which WSB has been shown to reduce accuracy have the following proper9es: High ILS (AD > 80%) Small numbers of taxa (at most 10) Low bootstrap support on gene trees and most also obeyed the strict molecular clock. Bayzid et al. (PLOS One March 2015) advises against the use of WSB under these condi9ons. Figure of model tree from L&E, Science 9 October 2015: 171

L&E ask a good ques9on: performance on bounded number of sites! Ques9on: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Answers: Roch & Warnow, Syst Biol, March 2015: Strict molecular clock: Yes for some new methods, even for a single site per locus No clock: Unknown for all methods, including MP-EST, ASTRAL, etc.

Conjecture For any posi9ve integer L, if all loci have at most L sites, then coalescent-based summary methods (e.g., MP-EST, ASTRAL, NJst, ASTRID, etc.) cannot be sta9s9cally consistent under the MSC. Comments: All current proofs of sta9s9cal consistency for standard summary methods assume perfect gene trees, and so inherently assume that sequence lengths for all genes go to infinity. The challenge is that for some model gene trees, ML can be biased towards the wrong tree topology for small enough L (think long branch airac9on, Felsenstein Zone).

Rephrasing L&E Technical Comment as a For any posi9ve integer L, if all loci have at most L sites, then WSB pipelines cannot be sta9s9cally consistent under the MSC. Comments: Might well be true! conjecture The proof will not be easy.

Rephrasing L&E Technical Comment as a For any posi9ve integer L, if all loci have at most L sites, then WSB pipelines cannot be sta9s9cally consistent under the MSC. Comments: Open conjecture Will be hard to seile either way

No theore9cal difference between MP-EST, ASTRAL, and WSB (according to current knowledge) * * * Consistency first kind Consistency second kind MP-EST YES UNKNOWN ASTRAL YES UNKNOWN Unpartitioned concatenated maximum likelihood NO ( ) NO ( ) Fully partitioned maximum likelihood UNKNOWN UNKNOWN Unweighted statistical binning followed by consistent summary method (e.g., ASTRAL) Weighted statistical binning followed by consistent summary method (e.g., ASTRAL) 1 1 10 10 NO ( ) NO ( ) YES ( ) UNKNOWN *BEAST YES UNKNOWN 10 Consistency first kind: both number of loci and number of sites go to infinity Consistency second kind: number of loci goes to infinity, number of sites bounded by L (arbitrary constant) Table from PLOS Currents, Warnow 2015

Species tree es-ma-on error for MP-EST and ASTRAL on 15-taxon datasets Model Tree: Very high ILS: AD = 82% Strict molecular clock GTR+Gamma sequence evolu9on (Indelible) 10 replicates per condi9on Notes: BS-75% o en improved accuracy (p=0.04) BS=50% some9mes reduced accuracy, but differences were not sta9s9cally significant. MP-EST more accurate than ASTRAL on these data. Bayzid MS, Mirarab S, Boussau B, Warnow T (2015). PLoS ONE 10(6): e0129183.