Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

Similar documents
Computational Challenges in Constructing the Tree of Life

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

Computa(onal Challenges in Construc(ng the Tree of Life

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

Fast coalescent-based branch support using local quartet frequencies

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Ultra- large Mul,ple Sequence Alignment

On the variance of internode distance under the multispecies coalescent

TheDisk-Covering MethodforTree Reconstruction

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Phylogenetic inference

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic Geometry

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Dr. Amira A. AL-Hosary

Methods to reconstruct phylogene1c networks accoun1ng for ILS

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin

Workshop III: Evolutionary Genomics

Recent Advances in Phylogeny Reconstruction

The impact of missing data on species tree estimation

Anatomy of a species tree

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Tree Reconstruction

Taming the Beast Workshop

Phylogenetic Networks, Trees, and Clusters

Phylogenetics. BIOL 7711 Computational Bioscience

BINF6201/8201. Molecular phylogenetic methods

EVOLUTIONARY DISTANCES

A Phylogenetic Network Construction due to Constrained Recombination

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Molecular Evolution & Phylogenetics

Reconstruire le passé biologique modèles, méthodes, performances, limites

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

Phylogenetics: Building Phylogenetic Trees

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Constructing Evolutionary/Phylogenetic Trees


Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

A phylogenomic toolbox for assembling the tree of life

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Isolating - A New Resampling Method for Gene Order Data

Regular networks are determined by their trees

Quartet Inference from SNP Data Under the Coalescent Model

Letter to the Editor. Department of Biology, Arizona State University

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Constructing Evolutionary/Phylogenetic Trees

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Algorithms in Bioinformatics

Species Tree Inference by Minimizing Deep Coalescences

Estimating Evolutionary Trees. Phylogenetic Methods

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Multiple Sequence Alignment. Sequences

Phylogenetic inference: from sequences to trees

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics in the Age of Genomics: Prospects and Challenges

A (short) introduction to phylogenetics

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

1 ATGGGTCTC 2 ATGAGTCTC

arxiv: v5 [q-bio.pe] 24 Oct 2016

Point of View. Why Concatenation Fails Near the Anomaly Zone

Evolutionary Models. Evolutionary Models

Phylogeny: building the tree of life

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

Transcription:

Constrained Exact Op1miza1on in Phylogene1cs Tandy Warnow The University of Illinois at Urbana-Champaign

Phylogeny (evolu1onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

Phylogeny Problem U V W X Y AGTGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

Phylogeny + genomics = genome-scale phylogeny es1ma1on.

Main compe1ng approaches Species gene 1 gene 2... gene k... Concatenation Analyze separately... Summary Method

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden1fy orthologs Compute mul1ple sequence alignment and gene tree for each locus Compute species tree or phylogene1c network from the gene trees or alignments Get sta1s1cal support on each branch (e.g., bootstrapping) Es1mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Just about everything worth doing is NP-hard Mul1ple sequence alignment Maximum likelihood gene tree es1ma1on Species tree es1ma1on by combining gene trees Supertree es1ma1on and hence divide-and-conquer strategies And Bayesian methods are even more computa1onally intensive

Just about everything worth doing is NP-hard Mul1ple sequence alignment Maximum likelihood gene tree es1ma1on Species tree es1ma1on by combining gene trees Supertree es1ma1on and hence divide-and-conquer strategies And Bayesian methods are even more computa1onally intensive

Local search heuris1cs (a) Tree bisection... (b) reconnection choice... (c) and reconnection Standard heuris1c search: TBR moves through treespace Randomness to exit local op1ma But there are (2n-5)!! trees on n leaves Figure from Huson 2010

Solving maximum likelihood (and other hard optimization problems) is unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 10 20 100 4.5 x 10 190 1000 2.7 x 10 2900

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen Guojie Zhang, BGI Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC Approx. 50 species, whole genomes 14,000 loci Multi-national team (100+ investigators) 8 papers published in special issue of Science 2014 Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing coalescent-based species tree from 14,000 different gene trees

Only 48 species, but heuris1c ML took ~300 CPU years on mul1ple supercomputers and used 1Tb of memory Jarvis,$Mirarab,$et$al.,$examined$48$ bird$species$using$14,000$loci$from$ whole$genomes.$two$trees$were$ presented.$ $ 1.$A$single$dataset$maximum$ likelihood$concatena,on$analysis$ used$~300$cpu$years$and$1tb$of$ distributed$memory,$using$tacc$and$ other$supercomputers$around$the$ world.$$ $ 2.$However,$every%locus%had%a% different%%tree$ $sugges,ve$of$ incomplete$lineage$sor,ng $ $and$ the$noisy$genomehscale$data$required$ the$development$of$a$new$method,$ sta,s,cal$binning.$ $ $ $ $ RESEARCH ARTICLE Whole-genome analyses resolve early branches in the tree of life of modern birds Erich D. Jarvis, 1 * Siavash Mirarab, 2 * Andre J. Aberer, 3 Bo Li, 4,5,6 Peter Houde, 7 Cai Li, 4,6 Simon Y. W. Ho, 8 Brant C. Faircloth, 9,10 Benoit Nabholz, 11 Jason T. Howard, 1 Alexander Suh, 12 Claudia C. Weber, 12 Rute R. da Fonseca, 6 Jianwen Li, 4 Fang Zhang, 4 Hui Li, 4 Long Zhou, 4 Nitish Narula, 7,13 Liang Liu, 14 Ganesh Ganapathy, 1 Bastien Boussau, 15 Md. Shamsuzzoha Bayzid, 2 Volodymyr Zavidovych, 1 Sankar Subramanian, 16 Toni Gabaldón, 17,18,19 Salvador Capella-Gutiérrez, 17,18 Jaime Huerta-Cepas, 17,18 Bhanu Rekepalli, 20 Kasper Munch, 21 Mikkel Schierup, 21 Bent Lindow, 6 Wesley C. Warren, 22 David Ray, 23,24,25 Richard E. Green, 26 Michael W. Bruford, 27 Xiangjiang Zhan, 27,28 Andrew Dixon, 29 Shengbin Li, 30 Ning Li, 31 Yinhua Huang, 31 Elizabeth P. Derryberry, 32,33 Mads Frost Bertelsen, 34 Frederick H. Sheldon, 33 Robb T. Brumfield, 33 Claudio V. Mello, 35,36 Peter V. Lovell, 35 Morgan Wirthlin, 35 Maria Paula Cruz Schneider, 36,37 Francisco Prosdocimi, 36,38 José Alfredo Samaniego, 6 Amhed Missael Vargas Velazquez, 6 Alonzo Alfaro-Núñez, 6 Paula F. Campos, 6 Bent Petersen, 39 Thomas Sicheritz-Ponten, 39 An Pas, 40 Tom Bailey, 41 Paul Scofield, 42 Michael Bunce, 43 David M. Lambert, 16 Qi Zhou, 44 Polina Perelman, 45,46 Amy C. Driskell, 47 Beth Shapiro, 26 Zijun Xiong, 4 Yongli Zeng, 4 Shiping Liu, 4 Zhenyu Li, 4 Binghang Liu, 4 Kui Wu, 4 Jin Xiao, 4 Xiong Yinqi, 4 Qiuemei Zheng, 4 Yong Zhang, 4 Huanming Yang, 48 Jian Wang, 48 Linnea Smeds, 12 Frank E. Rheindt, 49 Michael Braun, 50 Jon Fjeldsa, 51 Ludovic Orlando, 6 F. Keith Barker, 52 Knud Andreas Jønsson, 51,53,54 Warren Johnson, 55 Klaus-Peter Koepfli, 56 Stephen O Brien, 57,58 David Haussler, 59 Oliver A. Ryder, 60 Carsten Rahbek, 51,54 Eske Willerslev, 6 Gary R. Graves, 51,61 Travis C. Glenn, 62 John McCormack, 63 Dave Burt, 64 Hans Ellegren, 12 Per Alström, 65,66 Scott V. Edwards, 67 Alexandros Stamatakis, 3,68 David P. Mindell, 69 Joel Cracraft, 70 Edward L. Braun, 71 Tandy Warnow, 2,72 Wang Jun, 48,73,74,75,76 M. Thomas P. Gilbert, 6,43 Guojie Zhang 4,77 To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD And many others l 103 plant transcriptomes, 400-800 single copy genes l Wicked, Mirarab et al., PNAS 2014 l Next phase will be much bigger l l ~1000 species and ~1000 genes, and will require the inference of mul?ple sequence alignments and trees on more than 100,000 sequences

Muir, 2016

This Talk Model-based tree es1ma1on and NP-hard problems How divide-and-conquer can improve tree es1ma1on Constrained op1miza1on Open problems

Markov Model of Site Evolu1on Simplest (Jukes-Cantor, 1969): The model tree T is binary and has subs1tu1on probabili1es p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleo1des) If a site (posi1on) changes on an edge, it changes with equal probability to each of the remaining states. The evolu1onary process is Markovian. The different sites are assumed to evolve independently and iden1cally down the tree (with rates that are drawn from a gamma distribu1on). More complex models (such as the General Markov model) are also considered, ojen with lidle change to the theory.

Ques1ons Is the model tree iden1fiable? Which es1ma1on methods are sta1s1cally consistent under this model? How much data does the method need to es1mate the model tree correctly (with high probability), and how well do the methods perform in prac1ce?

Sta1s1cal Consistency error Data

Answers? We know a lot about which site evolu1on models are iden1fiable, and which methods are sta1s1cally consistent. We know a lidle bit about the sequence length requirements for standard methods. Extensive studies show that even the best methods produce gene trees with some error.

Statistically consistent phylogenetic reconstruction methods 1 Hill-climbing heuristics for NP-hard optimization criteria (e.g., Maximum Likelihood) Cost Local optimum Phylogenetic trees Global optimum 2 Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

Distance-based es1ma1on

Quan1fying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor Joining (NJ) on large trees Error Rate 0.8 0.6 0.4 0.2 NJ Simula1on study based upon fixed edge lengths, K2P model of evolu1on, sequence lengths fixed to 1000 nucleo1des. Error rates reflect propor1on of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0 0 400 800 1200 1600 No. Taxa

In other words error Data Sta1s1cal consistency doesn t guarantee accuracy w.h.p. unless the sequences are long enough.

Boos1ng phylogeny reconstruc1on methods DCMs boost the performance of phylogeny reconstruc1on methods. Base method M DCM DCM-M

Divide-and-conquer for phylogeny es1ma1on

Markov Model of Site Evolu1on Simplest (Jukes-Cantor, 1969): The model tree T is binary and has subs1tu1on probabili1es p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleo1des) If a site (posi1on) changes on an edge, it changes with equal probability to each of the remaining states. The evolu1onary process is Markovian. The different sites are assumed to evolve independently and iden1cally down the tree (with rates that are drawn from a gamma distribu1on). More complex models (such as the General Markov model) are also considered, ojen with lidle change to the theory.

Sequence length requirements The sequence length (number of sites) that a phylogeny reconstruc1on method M needs to reconstruct the true tree with probability at least 1-ε depends on M (the method) ε f = min p(e), g = max p(e), and n, the number of leaves We fix everything but n.

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Distance-based es1ma1on

Neighbor Joining s sequence length requirement is exponen1al! Adeson 1999: Let T be a Jukes-Cantor model tree defining addi1ve matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length O(lg n e max Dij ). Lacey and Chang 2009: Matching lower bound

Divide-and-conquer for phylogeny es1ma1on Construct subset trees Supertree Step Refinement Step

DCM1 Decomposi1ons Input: Set S of sequences, distance matrix d, threshold value q { dij} 1. Compute threshold graph Gq = ( V, E), V = S, E = {( i, j) : d( i, j) q} 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated). DCM1 decomposition : Compute maximal cliques

DCM1-boos1ng: Warnow, St. John, and Moret, SODA 2001 Exponentially converging (base) method DCM1 SQS Absolute fast converging (DCM1-boosted) method The DCM1 phase produces a collec1on of trees (one for each threshold), and the SQS phase picks the best tree. For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa.

Neighbor Joining on large diameter trees Error Rate 0.8 0.6 0.4 0.2 NJ Simula1on study based upon fixed edge lengths, K2P model of evolu1on, sequence lengths fixed to 1000 nucleo1des. Error rates reflect propor1on of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0 0 400 800 1200 1600 No. Taxa

DCM1-boos1ng distance-based methods [Nakhleh et al. ISMB 2001] Error Rate 0.8 0.6 0.4 NJ DCM1-NJ Theorem (Warnow et al., SODA 2001): DCM1-NJ converges to the true tree from polynomial length sequences 0.2 0 0 400 800 1200 1600 No. Taxa

DCM1-boos1ng DCM1-boos1ng: reducing sequence length requirements for gene tree accuracy from exponen1al to polynomial Key algorithmic ingredients: Construct triangulated graph Apply tree construc1on methods to subsets Combine subset trees together using supertree method Select best tree from a set of trees

DCM1-boos1ng DCM1-boos1ng: reducing sequence length requirements for gene tree accuracy from exponen1al to polynomial Key algorithmic ingredients: Construct triangulated graph Apply tree construc1on methods to subsets Combine subset trees together using supertree method Select best tree from a set of trees

DACTAL-boos1ng: Other applica1ons of divide-and-conquer almost alignment-free tree es1ma1on Improving mul1-locus species tree es1ma1on DCM2-boos1ng: improving heuris1c searches for maximum likelihood Superfine-boos1ng: improving supertree methods

Divide-and-conquer for phylogeny es1ma1on Construct subset trees Supertree Step Refinement Step

Supertree Problems Quartet Median Supertree Robinson-Foulds Supertree Matrix Representa1on with Parsimony Matrix Representa1on with Likelihood Etc. All are NP-hard because even tes1ng compa1bility of unrooted trees is NP-complete.

Supertree Problems Quartet Median Supertree Robinson-Foulds Supertree Matrix Representa1on with Parsimony Matrix Representa1on with Likelihood Etc. All are NP-hard because even tes1ng compa1bility of unrooted trees is NP-complete.

phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) I ll use the term gene to refer to c-genes : recombination-free orthologous stretches of the genome 2

Gene tree discordance Incomplete Lineage Sor1ng (ILS) is a dominant cause of gene tree heterogeneity gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3

Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Incomplete Lineage Sor1ng (ILS) Confounds phylogene1c analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. There is substan1al debate about how to analyze phylogenomic datasets in the presence of ILS, focused around sta1s1cal consistency guarantees (theory) and performance on data.

Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014] Optimization Problem (NP-Hard): Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= X t2t Q(T ) \ Q(t) Set of quartet trees induced by T a gene tree all input gene trees Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 15

Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar11ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar11ons from X. Theorems: Mirarab et al. 2014: If X contains the bipar11ons from the input gene trees (and perhaps others), then an exact solu1on to this problem is sta1s1cally consistent under the MSC. Mirarab and Warnow 2015: The constrained MQST problem can be solved in O( X 2 nk) 1me. (We use dynamic programming, and build the unrooted tree from the bodom-up, based on allowed clades halves of the allowed bipar11ons.)

Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar11ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar11ons from X. Theorems: Mirarab et al. 2014: If X contains the bipar11ons from the input gene trees (and perhaps others), then an exact solu1on to this problem is sta1s1cally consistent under the MSC. Mirarab and Warnow 2015: The constrained MQST problem can be solved in O( X 2 nk) 1me. (We use dynamic programming, and build the unrooted tree from the bodom-up, based on allowed clades halves of the allowed bipar11ons.)

Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar11ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar11ons from X. Theorems: Mirarab et al. 2014: If X contains the bipar11ons from the input gene trees (and perhaps others), then an exact solu1on to this problem is sta1s1cally consistent under the MSC. Mirarab and Warnow 2015: The constrained MQST problem can be solved in O( X 2 nk) 1me. (We use dynamic programming, and build the unrooted tree from the bodom-up, based on allowed clades halves of the allowed bipar11ons.)

Simulation study Variable parameters: Number of species: 10 1000 True (model) species tree True gene trees Sequence data Number of genes: 50 1000 Finch Falcon Owl Eagle Pigeon Amount of ILS: low, medium, high Deep versus recent speciation Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 11 model conditions (50 replicas each) with heterogenous gene tree error Compare to NJst, MP-EST, concatenation (CA-ML) Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree Used SimPhy, Mallo and Posada, 2015 14

Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 1000 genes, medium levels of recent ILS 16

Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST 10 50 100 200 500 1000 number of species 1000 genes, medium levels of recent ILS 16

Other polynomial 1me algorithms for constrained op1miza1on problems Minimize Duplica1on/Loss Supertree (Halled and Lagergren 2000) Quartet Support (Bryant and Steel 2001) Constrained Minimize Deep Coalescence (Than and Nakhleh 2000, Yu, Warnow, and Nakhleh 2011). Gene tree es1ma1on under Duplica1on and Loss (Szöllősi et al. 2013) Constrained Robinson-Foulds Supertree (Vachaspa1 and Warnow 2016)

Summary NP-hard op1miza1on problems abound in phylogeny reconstruc1on, and in computa1onal biology in general, and need very accurate solu1ons. Divide-and-conquer techniques can provide greatly improved accuracy and scalability, and excellent sta1s1cal performance guarantees. But divide-andconquer methods depend on having good supertree methods. Constrained exact op1miza1on has been surprisingly beneficial for phylogenomic analysis.

Summary NP-hard op1miza1on problems abound in phylogeny reconstruc1on, and in computa1onal biology in general, and need very accurate solu1ons. Divide-and-conquer techniques can provide greatly improved accuracy and scalability, and excellent sta1s1cal performance guarantees. But divide-andconquer methods depend on having good supertree methods. Constrained exact op1miza1on has been surprisingly beneficial for phylogenomic analysis.

Summary NP-hard op1miza1on problems abound in phylogeny reconstruc1on, and in computa1onal biology in general, and need very accurate solu1ons. Divide-and-conquer techniques can provide greatly improved accuracy and scalability, and excellent sta1s1cal performance guarantees. But divide-andconquer methods depend on having good supertree methods. Constrained exact op1miza1on has been surprisingly beneficial for phylogenomic analysis.

Open Problems What other op1miza1on problems are tractable, given constraint sets? How should we define the set X of constraints from a set of source trees, especially when the source trees are incomplete? Are there other ways of constraining the search space that are effec1ve and useful? What are other effec1ve ways of approaching truly largescale phylogeny es1ma1on? Can divide-and-conquer be employed with Bayesian methods? (Or, more generally, how can we make Bayesian methods more scalable?)

Acknowledgments NSF grant DBI-1461364 (joint with Noah Rosenberg at Stanford and Luay Nakhleh at Rice): hdp://tandy.cs.illinois.edu/phylogenomicsproject.html Papers available at hdp://tandy.cs.illinois.edu/papers.html ASTRAL: Available at hdps://github.com/smirarab FastRFS: Available at hdps://github.com/pranjalv123/fastrfs

Constrained Robinson-Foulds Supertree Input: set T of source trees, and set X of allowed bipar11ons Output: tree T that minimizes the total Robinson- Foulds distance to the input source trees, and that draws its bipar11ons from X. Theorem (Vachaspa1 and Warnow 2016):The onstrained RF Supertree can be solved in O( X 2 nk) 1me, where there are k source trees and n species.

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen Guojie Zhang, BGI Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC Approx. 50 species, whole genomes 14,000 loci Multi-national team (100+ investigators) 8 papers published in special issue of Science 2014 Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing coalescent-based species tree from 14,000 different gene trees

Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true mul1ple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD And many others l 103 plant transcriptomes, 400-800 single copy genes l Wicked, Mirarab et al., PNAS 2014 l Next phase will be much bigger l l ~1000 species and ~1000 genes, and will require the inference of mul?ple sequence alignments and trees on more than 100,000 sequences

Constrained op1miza1on First proposed in Hallet and Lagergren 2000, for the duplica1on-loss species tree problem Algorithms for constrained op1miza1on use dynamic programming to find op1mal solu1ons, construc1ng the best tree from the bodom-up The set X is usually defined to be the bipar11ons from the source trees.

Unaligned Sequences BLASTbased precdcm3 DACTAL Overlapping subsets Existing Method: RAxML(MAFFT) A tree for the entire dataset New supertree method: SuperFine A tree for each subset

Results on three biological datasets 6000 to 28,000 sequences. We show results with 5 DACTAL iterakons

DACTAL Arbitrary input Decompose using tree Overlapping subsets Construct subset trees A tree for the entire dataset Construct supertree A tree for each subset

Triangulated Graphs Defini1on: A graph is triangulated if it has no simple cycles of size four or more.

Sampling mul1ple genes from mul1ple species Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Standard heuris1c search T Random perturba1on Hill-climbing T

Problems with current techniques for MP Shown here is the performance of the TNT software for maximum parsimony on a real dataset of almost 14,000 sequences. The required level of accuracy with respect to MP score is no more than 0.01% error (otherwise high topological error results). ( Optimal here means best score to date, using any method for any amount of time.) 0.2 0.18 0.16 Performance of TNT with time Average MP score above optimal, shown as a percentage of the optimal 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 4 8 12 16 20 24 Hours

Solving NP-hard problems exactly is unlikely Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia #leaves #trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 10 20 100 4.5 x 10 190 1000 2.7 x 10 2900

Gene tree discordance gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3

Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus1n Plus many many other people S. Mirarab Md. S. Bayzid, UT-Aus1n UT-Aus1n Approx. 50 species, whole genomes, 14,000 loci Challenges: Concatena1on analysis used >200 CPU years But also Massive gene tree conflict sugges1ve of ILS, and most gene trees had very low bootstrap support Coalescent-based analysis using MP-EST produced tree that conflicted with concatena1on analysis

Big datasets and hard problems Computa1onally intensive problems: Mul1ple sequence alignment Maximum likelihood gene tree es1ma1on Species tree es1ma1on from mul1ple gene trees Fast methods are generally not sufficiently accurate Accurate methods are generally computa1onally intensive

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Error Rate 0.8 0.6 0.4 NJ Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining! 0.2 0 0 400 800 No. Taxa 1200 1600

Species Tree Estimation requires multiple genes! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Species tree es1ma1on: difficult, even for small datasets! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

a Strict Consensus Merger (SCM) e b a e b c a f g b d c a c f g b d a c e h i j b f g d c h i j d h i j d

The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes BigData complexity Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima

The Tree of Life: Mul?ple Challenges Scien1fic challenges: Ultra-large mul1ple-sequence alignment Alignment-free phylogeny es1ma1on Supertree es1ma1on Es1ma1ng species trees from many gene trees Genome rearrangement phylogeny Re1culate evolu1on Visualiza1on of large trees and alignments Data mining techniques to explore mul1ple op1ma Theore1cal guarantees under Markov models of evolu1on Applica1ons: metagenomics protein structure and func1on predic1on trait evolu1on detec1on of co-evolu1on systems biology Techniques: Graph theory (especially chordal graphs) Probability theory and sta1s1cs Hidden Markov models Combinatorial op1miza1on Heuris1cs Supercompu1ng

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) l First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci): Species tree es1ma1on under the mul1-species coalescent from hundreds of conflic1ng gene trees on >1000 species (we will use ASTRAL Mirarab et al. 2014, Mirarab & Warnow 2015) Mul1ple sequence alignment of >100,000 sequences (with lots of fragments!) we will use UPP (Nguyen et al., 2015)

phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) I ll use the term gene to refer to c-genes : recombination-free orthologous stretches of the genome 2