Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

Size: px
Start display at page:

Download "Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign"

Transcription

1 Constrained Exact Op1miza1on in Phylogene1cs Tandy Warnow The University of Illinois at Urbana-Champaign

2 Phylogeny (evolu1onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

3 DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs AGGGCAT TAGCCCT AGCACTT -1 mil yrs AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT today

4 Phylogeny Problem U V W X Y AGTGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

5 Phylogeny + genomics = genome-scale phylogeny es1ma1on.

6 Main compe1ng approaches Species gene 1 gene 2... gene k... Concatenation Analyze separately... Summary Method

7 Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly iden1fy orthologs Compute mul1ple sequence alignment and gene tree for each locus Compute species tree or phylogene1c network from the gene trees or alignments Get sta1s1cal support on each branch (e.g., bootstrapping) Es1mate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

8 Just about everything worth doing is NP-hard Mul1ple sequence alignment Maximum likelihood gene tree es1ma1on Species tree es1ma1on by combining gene trees Supertree es1ma1on and hence divide-and-conquer strategies And Bayesian methods are even more computa1onally intensive

9 Just about everything worth doing is NP-hard Mul1ple sequence alignment Maximum likelihood gene tree es1ma1on Species tree es1ma1on by combining gene trees Supertree es1ma1on and hence divide-and-conquer strategies And Bayesian methods are even more computa1onally intensive

10 Local search heuris1cs (a) Tree bisection... (b) reconnection choice... (c) and reconnection Standard heuris1c search: TBR moves through treespace Randomness to exit local op1ma But there are (2n-5)!! trees on n leaves Figure from Huson 2010

11 Solving maximum likelihood (and other hard optimization problems) is unlikely # of Taxa # of Unrooted Trees x x x

12 Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen Guojie Zhang, BGI Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC Approx. 50 species, whole genomes 14,000 loci Multi-national team (100+ investigators) 8 papers published in special issue of Science 2014 Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing coalescent-based species tree from 14,000 different gene trees

13 Only 48 species, but heuris1c ML took ~300 CPU years on mul1ple supercomputers and used 1Tb of memory Jarvis,$Mirarab,$et$al.,$examined$48$ bird$species$using$14,000$loci$from$ whole$genomes.$two$trees$were$ presented.$ $ 1.$A$single$dataset$maximum$ likelihood$concatena,on$analysis$ used$~300$cpu$years$and$1tb$of$ distributed$memory,$using$tacc$and$ other$supercomputers$around$the$ world.$$ $ 2.$However,$every%locus%had%a% different%%tree$ $sugges,ve$of$ incomplete$lineage$sor,ng $ $and$ the$noisy$genomehscale$data$required$ the$development$of$a$new$method,$ sta,s,cal$binning.$ $ $ $ $ RESEARCH ARTICLE Whole-genome analyses resolve early branches in the tree of life of modern birds Erich D. Jarvis, 1 * Siavash Mirarab, 2 * Andre J. Aberer, 3 Bo Li, 4,5,6 Peter Houde, 7 Cai Li, 4,6 Simon Y. W. Ho, 8 Brant C. Faircloth, 9,10 Benoit Nabholz, 11 Jason T. Howard, 1 Alexander Suh, 12 Claudia C. Weber, 12 Rute R. da Fonseca, 6 Jianwen Li, 4 Fang Zhang, 4 Hui Li, 4 Long Zhou, 4 Nitish Narula, 7,13 Liang Liu, 14 Ganesh Ganapathy, 1 Bastien Boussau, 15 Md. Shamsuzzoha Bayzid, 2 Volodymyr Zavidovych, 1 Sankar Subramanian, 16 Toni Gabaldón, 17,18,19 Salvador Capella-Gutiérrez, 17,18 Jaime Huerta-Cepas, 17,18 Bhanu Rekepalli, 20 Kasper Munch, 21 Mikkel Schierup, 21 Bent Lindow, 6 Wesley C. Warren, 22 David Ray, 23,24,25 Richard E. Green, 26 Michael W. Bruford, 27 Xiangjiang Zhan, 27,28 Andrew Dixon, 29 Shengbin Li, 30 Ning Li, 31 Yinhua Huang, 31 Elizabeth P. Derryberry, 32,33 Mads Frost Bertelsen, 34 Frederick H. Sheldon, 33 Robb T. Brumfield, 33 Claudio V. Mello, 35,36 Peter V. Lovell, 35 Morgan Wirthlin, 35 Maria Paula Cruz Schneider, 36,37 Francisco Prosdocimi, 36,38 José Alfredo Samaniego, 6 Amhed Missael Vargas Velazquez, 6 Alonzo Alfaro-Núñez, 6 Paula F. Campos, 6 Bent Petersen, 39 Thomas Sicheritz-Ponten, 39 An Pas, 40 Tom Bailey, 41 Paul Scofield, 42 Michael Bunce, 43 David M. Lambert, 16 Qi Zhou, 44 Polina Perelman, 45,46 Amy C. Driskell, 47 Beth Shapiro, 26 Zijun Xiong, 4 Yongli Zeng, 4 Shiping Liu, 4 Zhenyu Li, 4 Binghang Liu, 4 Kui Wu, 4 Jin Xiao, 4 Xiong Yinqi, 4 Qiuemei Zheng, 4 Yong Zhang, 4 Huanming Yang, 48 Jian Wang, 48 Linnea Smeds, 12 Frank E. Rheindt, 49 Michael Braun, 50 Jon Fjeldsa, 51 Ludovic Orlando, 6 F. Keith Barker, 52 Knud Andreas Jønsson, 51,53,54 Warren Johnson, 55 Klaus-Peter Koepfli, 56 Stephen O Brien, 57,58 David Haussler, 59 Oliver A. Ryder, 60 Carsten Rahbek, 51,54 Eske Willerslev, 6 Gary R. Graves, 51,61 Travis C. Glenn, 62 John McCormack, 63 Dave Burt, 64 Hans Ellegren, 12 Per Alström, 65,66 Scott V. Edwards, 67 Alexandros Stamatakis, 3,68 David P. Mindell, 69 Joel Cracraft, 70 Edward L. Braun, 71 Tandy Warnow, 2,72 Wang Jun, 48,73,74,75,76 M. Thomas P. Gilbert, 6,43 Guojie Zhang 4,77 To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in

14 1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD And many others l 103 plant transcriptomes, single copy genes l Wicked, Mirarab et al., PNAS 2014 l Next phase will be much bigger l l ~1000 species and ~1000 genes, and will require the inference of mul?ple sequence alignments and trees on more than 100,000 sequences

15 Muir, 2016

16 This Talk Model-based tree es1ma1on and NP-hard problems How divide-and-conquer can improve tree es1ma1on Constrained op1miza1on Open problems

17 Markov Model of Site Evolu1on Simplest (Jukes-Cantor, 1969): The model tree T is binary and has subs1tu1on probabili1es p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleo1des) If a site (posi1on) changes on an edge, it changes with equal probability to each of the remaining states. The evolu1onary process is Markovian. The different sites are assumed to evolve independently and iden1cally down the tree (with rates that are drawn from a gamma distribu1on). More complex models (such as the General Markov model) are also considered, ojen with lidle change to the theory.

18 Ques1ons Is the model tree iden1fiable? Which es1ma1on methods are sta1s1cally consistent under this model? How much data does the method need to es1mate the model tree correctly (with high probability), and how well do the methods perform in prac1ce?

19 Sta1s1cal Consistency error Data

20 Answers? We know a lot about which site evolu1on models are iden1fiable, and which methods are sta1s1cally consistent. We know a lidle bit about the sequence length requirements for standard methods. Extensive studies show that even the best methods produce gene trees with some error.

21 Statistically consistent phylogenetic reconstruction methods 1 Hill-climbing heuristics for NP-hard optimization criteria (e.g., Maximum Likelihood) Cost Local optimum Phylogenetic trees Global optimum 2 Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

22 Distance-based es1ma1on

23 Quan1fying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

24 Neighbor Joining (NJ) on large trees Error Rate NJ Simula1on study based upon fixed edge lengths, K2P model of evolu1on, sequence lengths fixed to 1000 nucleo1des. Error rates reflect propor1on of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] No. Taxa

25 In other words error Data Sta1s1cal consistency doesn t guarantee accuracy w.h.p. unless the sequences are long enough.

26 Boos1ng phylogeny reconstruc1on methods DCMs boost the performance of phylogeny reconstruc1on methods. Base method M DCM DCM-M

27 Divide-and-conquer for phylogeny es1ma1on

28 Markov Model of Site Evolu1on Simplest (Jukes-Cantor, 1969): The model tree T is binary and has subs1tu1on probabili1es p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleo1des) If a site (posi1on) changes on an edge, it changes with equal probability to each of the remaining states. The evolu1onary process is Markovian. The different sites are assumed to evolve independently and iden1cally down the tree (with rates that are drawn from a gamma distribu1on). More complex models (such as the General Markov model) are also considered, ojen with lidle change to the theory.

29 Sequence length requirements The sequence length (number of sites) that a phylogeny reconstruc1on method M needs to reconstruct the true tree with probability at least 1-ε depends on M (the method) ε f = min p(e), g = max p(e), and n, the number of leaves We fix everything but n.

30 Statistical consistency, exponential convergence, and absolute fast convergence (afc)

31 Distance-based es1ma1on

32 Neighbor Joining s sequence length requirement is exponen1al! Adeson 1999: Let T be a Jukes-Cantor model tree defining addi1ve matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length O(lg n e max Dij ). Lacey and Chang 2009: Matching lower bound

33 Divide-and-conquer for phylogeny es1ma1on Construct subset trees Supertree Step Refinement Step

34 DCM1 Decomposi1ons Input: Set S of sequences, distance matrix d, threshold value q { dij} 1. Compute threshold graph Gq = ( V, E), V = S, E = {( i, j) : d( i, j) q} 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated). DCM1 decomposition : Compute maximal cliques

35 DCM1-boos1ng: Warnow, St. John, and Moret, SODA 2001 Exponentially converging (base) method DCM1 SQS Absolute fast converging (DCM1-boosted) method The DCM1 phase produces a collec1on of trees (one for each threshold), and the SQS phase picks the best tree. For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa.

36 Neighbor Joining on large diameter trees Error Rate NJ Simula1on study based upon fixed edge lengths, K2P model of evolu1on, sequence lengths fixed to 1000 nucleo1des. Error rates reflect propor1on of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] No. Taxa

37 DCM1-boos1ng distance-based methods [Nakhleh et al. ISMB 2001] Error Rate NJ DCM1-NJ Theorem (Warnow et al., SODA 2001): DCM1-NJ converges to the true tree from polynomial length sequences No. Taxa

38 DCM1-boos1ng DCM1-boos1ng: reducing sequence length requirements for gene tree accuracy from exponen1al to polynomial Key algorithmic ingredients: Construct triangulated graph Apply tree construc1on methods to subsets Combine subset trees together using supertree method Select best tree from a set of trees

39 DCM1-boos1ng DCM1-boos1ng: reducing sequence length requirements for gene tree accuracy from exponen1al to polynomial Key algorithmic ingredients: Construct triangulated graph Apply tree construc1on methods to subsets Combine subset trees together using supertree method Select best tree from a set of trees

40 DACTAL-boos1ng: Other applica1ons of divide-and-conquer almost alignment-free tree es1ma1on Improving mul1-locus species tree es1ma1on DCM2-boos1ng: improving heuris1c searches for maximum likelihood Superfine-boos1ng: improving supertree methods

41 Divide-and-conquer for phylogeny es1ma1on Construct subset trees Supertree Step Refinement Step

42 Supertree Problems Quartet Median Supertree Robinson-Foulds Supertree Matrix Representa1on with Parsimony Matrix Representa1on with Likelihood Etc. All are NP-hard because even tes1ng compa1bility of unrooted trees is NP-complete.

43 Supertree Problems Quartet Median Supertree Robinson-Foulds Supertree Matrix Representa1on with Parsimony Matrix Representa1on with Likelihood Etc. All are NP-hard because even tes1ng compa1bility of unrooted trees is NP-complete.

44 phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) I ll use the term gene to refer to c-genes : recombination-free orthologous stretches of the genome 2

45 Gene tree discordance Incomplete Lineage Sor1ng (ILS) is a dominant cause of gene tree heterogeneity gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3

46 Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

47 Incomplete Lineage Sor1ng (ILS) Confounds phylogene1c analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. There is substan1al debate about how to analyze phylogenomic datasets in the presence of ILS, focused around sta1s1cal consistency guarantees (theory) and performance on data.

48 Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

49 Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

50 Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

51 Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

52 Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul1- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.

53 ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014] Optimization Problem (NP-Hard): Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= X t2t Q(T ) \ Q(t) Set of quartet trees induced by T a gene tree all input gene trees Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 15

54 Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar11ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar11ons from X. Theorems: Mirarab et al. 2014: If X contains the bipar11ons from the input gene trees (and perhaps others), then an exact solu1on to this problem is sta1s1cally consistent under the MSC. Mirarab and Warnow 2015: The constrained MQST problem can be solved in O( X 2 nk) 1me. (We use dynamic programming, and build the unrooted tree from the bodom-up, based on allowed clades halves of the allowed bipar11ons.)

55 Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar11ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar11ons from X. Theorems: Mirarab et al. 2014: If X contains the bipar11ons from the input gene trees (and perhaps others), then an exact solu1on to this problem is sta1s1cally consistent under the MSC. Mirarab and Warnow 2015: The constrained MQST problem can be solved in O( X 2 nk) 1me. (We use dynamic programming, and build the unrooted tree from the bodom-up, based on allowed clades halves of the allowed bipar11ons.)

56 Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar11ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar11ons from X. Theorems: Mirarab et al. 2014: If X contains the bipar11ons from the input gene trees (and perhaps others), then an exact solu1on to this problem is sta1s1cally consistent under the MSC. Mirarab and Warnow 2015: The constrained MQST problem can be solved in O( X 2 nk) 1me. (We use dynamic programming, and build the unrooted tree from the bodom-up, based on allowed clades halves of the allowed bipar11ons.)

57 Simulation study Variable parameters: Number of species: True (model) species tree True gene trees Sequence data Number of genes: Finch Falcon Owl Eagle Pigeon Amount of ILS: low, medium, high Deep versus recent speciation Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 11 model conditions (50 replicas each) with heterogenous gene tree error Compare to NJst, MP-EST, concatenation (CA-ML) Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree Used SimPhy, Mallo and Posada,

58 Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST genes, medium levels of recent ILS 16

59 Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST number of species 1000 genes, medium levels of recent ILS 16

60

61

62 Other polynomial 1me algorithms for constrained op1miza1on problems Minimize Duplica1on/Loss Supertree (Halled and Lagergren 2000) Quartet Support (Bryant and Steel 2001) Constrained Minimize Deep Coalescence (Than and Nakhleh 2000, Yu, Warnow, and Nakhleh 2011). Gene tree es1ma1on under Duplica1on and Loss (Szöllősi et al. 2013) Constrained Robinson-Foulds Supertree (Vachaspa1 and Warnow 2016)

63 Summary NP-hard op1miza1on problems abound in phylogeny reconstruc1on, and in computa1onal biology in general, and need very accurate solu1ons. Divide-and-conquer techniques can provide greatly improved accuracy and scalability, and excellent sta1s1cal performance guarantees. But divide-andconquer methods depend on having good supertree methods. Constrained exact op1miza1on has been surprisingly beneficial for phylogenomic analysis.

64 Summary NP-hard op1miza1on problems abound in phylogeny reconstruc1on, and in computa1onal biology in general, and need very accurate solu1ons. Divide-and-conquer techniques can provide greatly improved accuracy and scalability, and excellent sta1s1cal performance guarantees. But divide-andconquer methods depend on having good supertree methods. Constrained exact op1miza1on has been surprisingly beneficial for phylogenomic analysis.

65 Summary NP-hard op1miza1on problems abound in phylogeny reconstruc1on, and in computa1onal biology in general, and need very accurate solu1ons. Divide-and-conquer techniques can provide greatly improved accuracy and scalability, and excellent sta1s1cal performance guarantees. But divide-andconquer methods depend on having good supertree methods. Constrained exact op1miza1on has been surprisingly beneficial for phylogenomic analysis.

66 Open Problems What other op1miza1on problems are tractable, given constraint sets? How should we define the set X of constraints from a set of source trees, especially when the source trees are incomplete? Are there other ways of constraining the search space that are effec1ve and useful? What are other effec1ve ways of approaching truly largescale phylogeny es1ma1on? Can divide-and-conquer be employed with Bayesian methods? (Or, more generally, how can we make Bayesian methods more scalable?)

67 Acknowledgments NSF grant DBI (joint with Noah Rosenberg at Stanford and Luay Nakhleh at Rice): hdp://tandy.cs.illinois.edu/phylogenomicsproject.html Papers available at hdp://tandy.cs.illinois.edu/papers.html ASTRAL: Available at hdps://github.com/smirarab FastRFS: Available at hdps://github.com/pranjalv123/fastrfs

68 Constrained Robinson-Foulds Supertree Input: set T of source trees, and set X of allowed bipar11ons Output: tree T that minimizes the total Robinson- Foulds distance to the input source trees, and that draws its bipar11ons from X. Theorem (Vachaspa1 and Warnow 2016):The onstrained RF Supertree can be solved in O( X 2 nk) 1me, where there are k source trees and n species.

69 Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen Guojie Zhang, BGI Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC Approx. 50 species, whole genomes 14,000 loci Multi-national team (100+ investigators) 8 papers published in special issue of Science 2014 Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing coalescent-based species tree from 14,000 different gene trees

70 Deletion Substitution ACGGTGCAGTTACCA Insertion ACCAGTCACCTA ACGGTGCAGTTACC-A AC----CAGTCACCTA The true mul1ple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

71 1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD And many others l 103 plant transcriptomes, single copy genes l Wicked, Mirarab et al., PNAS 2014 l Next phase will be much bigger l l ~1000 species and ~1000 genes, and will require the inference of mul?ple sequence alignments and trees on more than 100,000 sequences

72 Constrained op1miza1on First proposed in Hallet and Lagergren 2000, for the duplica1on-loss species tree problem Algorithms for constrained op1miza1on use dynamic programming to find op1mal solu1ons, construc1ng the best tree from the bodom-up The set X is usually defined to be the bipar11ons from the source trees.

73 Unaligned Sequences BLASTbased precdcm3 DACTAL Overlapping subsets Existing Method: RAxML(MAFFT) A tree for the entire dataset New supertree method: SuperFine A tree for each subset

74 Results on three biological datasets 6000 to 28,000 sequences. We show results with 5 DACTAL iterakons

75 DACTAL Arbitrary input Decompose using tree Overlapping subsets Construct subset trees A tree for the entire dataset Construct supertree A tree for each subset

76 Triangulated Graphs Defini1on: A graph is triangulated if it has no simple cycles of size four or more.

77

78 Sampling mul1ple genes from mul1ple species Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

79 Standard heuris1c search T Random perturba1on Hill-climbing T

80 Problems with current techniques for MP Shown here is the performance of the TNT software for maximum parsimony on a real dataset of almost 14,000 sequences. The required level of accuracy with respect to MP score is no more than 0.01% error (otherwise high topological error results). ( Optimal here means best score to date, using any method for any amount of time.) Performance of TNT with time Average MP score above optimal, shown as a percentage of the optimal Hours

81 Solving NP-hard problems exactly is unlikely Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in seconds, we would find the best tree in 2890 millennia #leaves #trees x x x

82 Gene tree discordance gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3

83 Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus1n Plus many many other people S. Mirarab Md. S. Bayzid, UT-Aus1n UT-Aus1n Approx. 50 species, whole genomes, 14,000 loci Challenges: Concatena1on analysis used >200 CPU years But also Massive gene tree conflict sugges1ve of ILS, and most gene trees had very low bootstrap support Coalescent-based analysis using MP-EST produced tree that conflicted with concatena1on analysis

84 Big datasets and hard problems Computa1onally intensive problems: Mul1ple sequence alignment Maximum likelihood gene tree es1ma1on Species tree es1ma1on from mul1ple gene trees Fast methods are generally not sufficiently accurate Accurate methods are generally computa1onally intensive

85 Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

86 Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Error Rate NJ Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining! No. Taxa

87 Species Tree Estimation requires multiple genes! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

88 Species tree es1ma1on: difficult, even for small datasets! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

89 Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

90 Major Challenges: large datasets, fragmentary sequences Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

91 a Strict Consensus Merger (SCM) e b a e b c a f g b d c a c f g b d a c e h i j b f g d c h i j d h i j d

92 The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes BigData complexity Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima

93 The Tree of Life: Mul?ple Challenges Scien1fic challenges: Ultra-large mul1ple-sequence alignment Alignment-free phylogeny es1ma1on Supertree es1ma1on Es1ma1ng species trees from many gene trees Genome rearrangement phylogeny Re1culate evolu1on Visualiza1on of large trees and alignments Data mining techniques to explore mul1ple op1ma Theore1cal guarantees under Markov models of evolu1on Applica1ons: metagenomics protein structure and func1on predic1on trait evolu1on detec1on of co-evolu1on systems biology Techniques: Graph theory (especially chordal graphs) Probability theory and sta1s1cs Hidden Markov models Combinatorial op1miza1on Heuris1cs Supercompu1ng

94 1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people l Plant Tree of Life based on transcriptomes of ~1200 species l More than 13,000 gene families (most not single copy) l First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci): Species tree es1ma1on under the mul1-species coalescent from hundreds of conflic1ng gene trees on >1000 species (we will use ASTRAL Mirarab et al. 2014, Mirarab & Warnow 2015) Mul1ple sequence alignment of >100,000 sequences (with lots of fragments!) we will use UPP (Nguyen et al., 2015)

95 phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) I ll use the term gene to refer to c-genes : recombination-free orthologous stretches of the genome 2

Computational Challenges in Constructing the Tree of Life

Computational Challenges in Constructing the Tree of Life Computational Challenges in Constructing the Tree of Life Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu Warnow Lab (PhD students)

More information

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,

More information

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Joint work with Tandy Warnow Erfan Sayyari

More information

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Today Explain the course Introduce some of the research in this area Describe some open problems Talk about

More information

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Mul$ple Sequence Alignment Methods Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life

More information

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Course Staff Professor Tandy Warnow Office hours Tuesdays after class (2-3 PM) in Siebel 3235 Email address:

More information

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n From Gene Trees to Species Trees Tandy Warnow The University of Texas at Aus

More information

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees 394C, October 2, 2013 Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees Mul9ple Sequence Alignment Mul9ple Sequence Alignments and Evolu9onary Histories (the meaning of homologous

More information

Computa(onal Challenges in Construc(ng the Tree of Life

Computa(onal Challenges in Construc(ng the Tree of Life Computa(onal Challenges in Construc(ng the Tree of Life Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign h?p://tandy.cs.illinois.edu Phylogeny (evolueonary tree)

More information

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois The Mathema)cs of Es)ma)ng the Tree of Life Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

More information

Fast coalescent-based branch support using local quartet frequencies

Fast coalescent-based branch support using local quartet frequencies Fast coalescent-based branch support using local quartet frequencies Molecular Biology and Evolution (2016) 33 (7): 1654 68 Erfan Sayyari, Siavash Mirarab University of California, San Diego (ECE) anzee

More information

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego Upcoming challenges in phylogenomics Siavash Mirarab University of California, San Diego Gene tree discordance The species tree gene1000 Causes of gene tree discordance include: Incomplete Lineage Sorting

More information

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE) Reconstruction of species trees from gene trees using ASTRAL Siavash Mirarab University of California, San Diego (ECE) Phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human

More information

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012 CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012 Biology: 21st Century Science! When the human genome was sequenced seven years ago, scientists knew that most of the major scientific

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois Genome-scale Es-ma-on of the Tree of Life Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

More information

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois New methods for es-ma-ng species trees from genome-scale data Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,

More information

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes

More information

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas Phylogeny (evolutionary tree) Orangutan From the Tree of the Life Website, University of Arizona Gorilla

More information

Ultra- large Mul,ple Sequence Alignment

Ultra- large Mul,ple Sequence Alignment Ultra- large Mul,ple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana- Champaign hep://tandy.cs.illinois.edu Phylogeny (evolu,onary tree) Orangutan

More information

On the variance of internode distance under the multispecies coalescent

On the variance of internode distance under the multispecies coalescent On the variance of internode distance under the multispecies coalescent Sébastien Roch 1[0000 000 7608 8550 Department of Mathematics University of Wisconsin Madison Madison, WI 53706 roch@math.wisc.edu

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign Phylogenomics, Multiple Sequence Alignment, and Metagenomics Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the

More information

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois WABI 2017 Factsheet 55 proceedings paper submissions, 27 accepted (49% acceptance rate) 9 additional poster-only submissions (+8 posters accompanying papers) 56 PC members Each paper reviewed by at least

More information

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees Concatenation Analyses in the Presence of Incomplete Lineage Sorting May 22, 2015 Tree of Life Tandy Warnow Warnow T. Concatenation Analyses in the Presence of Incomplete Lineage Sorting.. 2015 May 22.

More information

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science!1 Adapted from U.S. Department of Energy Genomic Science

More information

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki Phylogene)cs IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, 2016 Joyce Nzioki Phylogenetics The study of evolutionary relatedness of organisms. Derived from two Greek words:» Phle/Phylon: Tribe/Race» Genetikos:

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Phylogenetic Geometry

Phylogenetic Geometry Phylogenetic Geometry Ruth Davidson University of Illinois Urbana-Champaign Department of Mathematics Mathematics and Statistics Seminar Washington State University-Vancouver September 26, 2016 Phylogenies

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Methods to reconstruct phylogene1c networks accoun1ng for ILS Methods to reconstruct phylogene1c networks accoun1ng for ILS Céline Scornavacca some slides have been kindly provided by Fabio Pardi ISE-M, Equipe Phylogénie & Evolu1on Moléculaires Montpellier, France

More information

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin

CS 394C March 21, Tandy Warnow Department of Computer Sciences University of Texas at Austin CS 394C March 21, 2012 Tandy Warnow Department of Computer Sciences University of Texas at Austin Phylogeny From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human Phylogeny

More information

Workshop III: Evolutionary Genomics

Workshop III: Evolutionary Genomics Identifying Species Trees from Gene Trees Elizabeth S. Allman University of Alaska IPAM Los Angeles, CA November 17, 2011 Workshop III: Evolutionary Genomics Collaborators The work in today s talk is joint

More information

Recent Advances in Phylogeny Reconstruction

Recent Advances in Phylogeny Reconstruction Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators

More information

The impact of missing data on species tree estimation

The impact of missing data on species tree estimation MBE Advance Access published November 2, 215 The impact of missing data on species tree estimation Zhenxiang Xi, 1 Liang Liu, 2,3 and Charles C. Davis*,1 1 Department of Organismic and Evolutionary Biology,

More information

Anatomy of a species tree

Anatomy of a species tree Anatomy of a species tree T 1 Size of current and ancestral Populations (N) N Confidence in branches of species tree t/2n = 1 coalescent unit T 2 Branch lengths and divergence times of species & populations

More information

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species DNA sequences II Analyses of multiple sequence data datasets, incongruence tests, gene trees vs. species tree reconstruction, networks, detection of hybrid species DNA sequences II Test of congruence of

More information

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1 Ben Raphael January 21, 2009 Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. Func3onal Genomics: ~25% lectures

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Taming the Beast Workshop

Taming the Beast Workshop Workshop and Chi Zhang June 28, 2016 1 / 19 Species tree Species tree the phylogeny representing the relationships among a group of species Figure adapted from [Rogers and Gibbs, 2014] Gene tree the phylogeny

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees. Describe the relationship between objects, e.g. species or genes Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies The evolutionary relationships between

More information

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 6, Numbers 3/4, 1999 Mary Ann Liebert, Inc. Pp. 369 386 Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction DANIEL H. HUSON, 1 SCOTT M.

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data Mary E. Cosner Dept. of Plant Biology Ohio State University Li-San Wang Dept.

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore Non-binary Tree Reconciliation Louxin Zhang Department of Mathematics National University of Singapore matzlx@nus.edu.sg Introduction: Gene Duplication Inference Consider a duplication gene family G Species

More information

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1 BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 8 Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data Jijun Tang 1 and Bernard M.E. Moret 1 1 Department of Computer Science, University of New

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW REVIEW Challenges in Species Tree Estimation Under the Multispecies Coalescent Model Bo Xu* and Ziheng Yang*,,1 *Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China and Department

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1 Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm

More information

A phylogenomic toolbox for assembling the tree of life

A phylogenomic toolbox for assembling the tree of life A phylogenomic toolbox for assembling the tree of life or, The Phylota Project (http://www.phylota.org) UC Davis Mike Sanderson Amy Driskell U Pennsylvania Junhyong Kim Iowa State Oliver Eulenstein David

More information

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization) STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization) Laura Salter Kubatko Departments of Statistics and Evolution, Ecology, and Organismal Biology The Ohio State University kubatko.2@osu.edu

More information

Isolating - A New Resampling Method for Gene Order Data

Isolating - A New Resampling Method for Gene Order Data Isolating - A New Resampling Method for Gene Order Data Jian Shi, William Arndt, Fei Hu and Jijun Tang Abstract The purpose of using resampling methods on phylogenetic data is to estimate the confidence

More information

Regular networks are determined by their trees

Regular networks are determined by their trees Regular networks are determined by their trees Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu February 17, 2009 Abstract. A rooted acyclic digraph

More information

Quartet Inference from SNP Data Under the Coalescent Model

Quartet Inference from SNP Data Under the Coalescent Model Bioinformatics Advance Access published August 7, 2014 Quartet Inference from SNP Data Under the Coalescent Model Julia Chifman 1 and Laura Kubatko 2,3 1 Department of Cancer Biology, Wake Forest School

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 2009 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 200 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2 WenEtAl-biorxiv 0// 0: page # Inferring Phylogenetic Networks Using PhyloNet Dingqiao Wen, Yun Yu, Jiafan Zhu, Luay Nakhleh,, Computer Science, Rice University, Houston, TX, USA; BioSciences, Rice University,

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Species Tree Inference by Minimizing Deep Coalescences

Species Tree Inference by Minimizing Deep Coalescences Species Tree Inference by Minimizing Deep Coalescences Cuong Than, Luay Nakhleh* Department of Computer Science, Rice University, Houston, Texas, United States of America Abstract In a 1997 seminal paper,

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Multiple Sequence Alignment. Sequences

Multiple Sequence Alignment. Sequences Multiple Sequence Alignment Sequences > YOR020c mstllksaksivplmdrvlvqrikaqaktasglylpe knveklnqaevvavgpgftdangnkvvpqvkvgdqvl ipqfggstiklgnddevilfrdaeilakiakd > crassa mattvrsvksliplldrvlvqrvkaeaktasgiflpe

More information

Phylogenetic inference: from sequences to trees

Phylogenetic inference: from sequences to trees W ESTFÄLISCHE W ESTFÄLISCHE W ILHELMS -U NIVERSITÄT NIVERSITÄT WILHELMS-U ÜNSTER MM ÜNSTER VOLUTIONARY FUNCTIONAL UNCTIONAL GENOMICS ENOMICS EVOLUTIONARY Bioinformatics 1 Phylogenetic inference: from sequences

More information

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published

More information

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees. Describe the relationship between objects, e.g. species or genes Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies Anatomical features were the dominant

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim Evolu&on, Popula&on Gene&cs, and Natural Selec&on 02-710 Computa.onal Genomics Seyoung Kim Phylogeny of Mammals Phylogene&cs vs. Popula&on Gene&cs Phylogene.cs Assumes a single correct species phylogeny

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogenetics in the Age of Genomics: Prospects and Challenges Phylogenetics in the Age of Genomics: Prospects and Challenges Antonis Rokas Department of Biological Sciences, Vanderbilt University http://as.vanderbilt.edu/rokaslab http://pubmed2wordle.appspot.com/

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting arxiv:1509.06075v3 [q-bio.pe] 12 Feb 2016 Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting Claudia Solís-Lemus 1 and Cécile Ané 1,2 1 Department of Statistics,

More information

1 ATGGGTCTC 2 ATGAGTCTC

1 ATGGGTCTC 2 ATGAGTCTC We need an optimality criterion to choose a best estimate (tree) Other optimality criteria used to choose a best estimate (tree) Parsimony: begins with the assumption that the simplest hypothesis that

More information

arxiv: v5 [q-bio.pe] 24 Oct 2016

arxiv: v5 [q-bio.pe] 24 Oct 2016 On the Quirks of Maximum Parsimony and Likelihood on Phylogenetic Networks Christopher Bryant a, Mareike Fischer b, Simone Linz c, Charles Semple d arxiv:1505.06898v5 [q-bio.pe] 24 Oct 2016 a Statistics

More information

Point of View. Why Concatenation Fails Near the Anomaly Zone

Point of View. Why Concatenation Fails Near the Anomaly Zone Point of View Syst. Biol. 67():58 69, 208 The Author(s) 207. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email:

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Phylogeny: building the tree of life

Phylogeny: building the tree of life Phylogeny: building the tree of life Dr. Fayyaz ul Amir Afsar Minhas Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences PO Nilore, Islamabad, Pakistan

More information

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection Mark T. Holder and Jordan M. Koch Department of Ecology and Evolutionary Biology, University of

More information

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES H. J. Park and G. Jin and L. Nakhleh Department of Computer Science, Rice University, 6 Main Street, Houston,

More information