Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois
|
|
- Hilda Sutton
- 5 years ago
- Views:
Transcription
1 WABI 2017 Factsheet 55 proceedings paper submissions, 27 accepted (49% acceptance rate) 9 additional poster-only submissions (+8 posters accompanying papers) 56 PC members Each paper reviewed by at least 3 PC members
2
3 Genome-scale Es-ma-on of the Tree of Life Tandy Warnow The University of Illinois
4 Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona
5 Phylogenomics = Species trees from whole genomes Nothing in biology makes sense except in the light of evolu9on - Dobhzansky
6 The Tree of Life: Mul$ple Challenges Scien9fic challenges: Ultra-large mul9ple-sequence alignment Alignment-free phylogeny es9ma9on Supertree es9ma9on Es9ma9ng species trees from many gene trees Genome rearrangement phylogeny Re9culate evolu9on Visualiza9on of large trees and alignments Data mining techniques to explore mul9ple op9ma Theore9cal guarantees under Markov models of evolu9on Applica9ons: metagenomics protein structure and func9on predic9on trait evolu9on detec9on of co-evolu9on systems biology Techniques: Graph theory (especially chordal graphs) Probability theory and sta9s9cs Hidden Markov models Combinatorial op9miza9on Heuris9cs Supercompu9ng
7 phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) I ll use the term gene to refer to c-genes : recombination-free orthologous stretches of the genome 2
8 Gene tree discordance Incomplete Lineage Sor9ng (ILS) is a dominant cause of gene tree heterogeneity gene 1 gene1000 Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. 3
9 Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.
10 Incomplete Lineage Sor9ng (ILS) Confounds phylogene9c analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. There is substan9al debate about how to analyze phylogenomic datasets in the presence of ILS, focused around sta9s9cal consistency guarantees (theory) and performance on data.
11 Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus9n S. Mirarab Md. S. Bayzid, UT-Aus9n UT-Aus9n Plus many many other people Approx. 50 species, whole genomes, 14,000 loci Jarvis, Mirarab, et al., Science 2014 Major challenges: Concatena9on analysis took > 250 CPU years, and suggested a rapid radia9on Massive gene tree heterogeneity consistent with incomplete lineage sor9ng Standard coalescent-based species tree es9ma9on methods contradicted concatena9on analysis and prior studies
12 1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin l 103 plant transcriptomes, single copy genes l Next phase will be much bigger l Wickeh, Mirarab et al., PNAS 2014 Challenges: Massive gene tree heterogeneity consistent with ILS Could not use exis9ng coalescent methods due to missing data (many gene trees could not be rooted) and large number of species
13 This talk Gene tree heterogeneity due to incomplete lineage sor9ng, modelled by the mul9-species coalescent (MSC) Sta9s9cally consistent es9ma9on of species trees under the MSC, and the impact of gene tree es9ma9on error New methods in phylogenomics: Sta9s9cal binning (Science 2014) and Weighted Sta9s9cal Binning (PLOS One 2015): improving gene trees ASTRAL (Bioinforma9cs 2014, 2015): quartet-based es9ma9on Open ques9ons
14 Big Phylogenomic Data Thousands to millions of sequences and species, millions of sites per species Big phylogenomic data are not the same as the usual data: Model misspecifica9on Missing data Errors in the data (e.g., from NGS technologies) Extreme rates of evolu9on Rela9ve performance of methods change with dataset size and heterogeneity Even moderate-sized inputs can create huge outputs
15 Big Data Phylogenomics Not just an issue of scaling exis9ng methods to large datasets! We also need: New algorithms New op9miza9on problems New sta9s9cal theory to deal with tradeoffs between dataset size and dataset quality New machine learning and data mining techniques to explore solu9on space New visualiza9on tools (etc)
16 Sampling mul9ple genes from mul9ple species Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona
17 A species tree defines a probability distribu9on on gene trees under the Mul9-Species Coalescent (MSC) Model Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.
18 Sta9s9cal Consistency error Data
19 Main compe9ng approaches Species gene 1 gene 2... gene k... Concatenation Analyze separately... Summary Method
20
21 Sta9s9cally consistent under MSC? CA-ML (Concatena9on using unpar99oned maximum likelihood) - NO Most frequent gene tree NO Minimize Deep Coalescences (MDC) NO Greedy Consensus (GC) NO Matrix Representa9on with Parsimony (MRP, supertree method) NO Hence, none of these standard approaches are proven to converge to the true species tree as the number of loci increases. Many of them are posi9vely misleading (will converge to the wrong tree)!
22 Anomaly zone An anomalous gene tree (AGT) is one that is more probable than the true species tree under the mul9- species coalescent model. Theorem (Hudson 1983): There are no rooted 3-leaf AGTs. Theorem (Allman et al. 2011, Degnan 2013): There are no unrooted 4-leaf AGTs. Theorem (Degnan 2013, Rosenberg 2013): For n>3, there are model species trees with rooted AGTs, and for n>4 there are model species trees with unrooted AGTs.
23 Summary Methods...
24 Summary Methods... Compu9ng rooted species tree from rooted gene trees: For every three species {a,b,c}, record most frequent rooted gene tree on {a,b,c} Combine rooted three-leaf gene trees into rooted tree if they are compa9ble Theorem: This algorithm is sta9s9cally consistent under the MSC and runs in polynomial 9me.
25 Summary Methods... Compu9ng unrooted species tree from unrooted gene trees: For every four species {a,b,c,d}, record most frequent unrooted gene tree on {a,b,c,d} Combine unrooted four-leaf gene trees into unrooted tree if they are compa9ble (recursive algorithm based on finding sibling pairs and removing one sibling) Theorem: This algorithm is sta9s9cally consistent under the MSC and runs in polynomial 9me.
26 Sta9s9cally consistent under ILS? Coalescent-based summary methods: MP-EST (Liu et al. 2010): maximum pseudo-likelihood es9ma9on of rooted species tree based on rooted triplet tree distribu9on YES BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree es9ma9on YES And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) - YES Co-es-ma-on methods: *BEAST (Heled and Drummond 2009): Bayesian coes9ma9on of gene trees and species trees YES Co-es9ma9on methods are too slow to use on most datasets hence the debate is largely between concatena9on (tradi9onal approach) and summary methods. Single-site methods (SMRT, SVDquartets, METAL, SNAPP, and others) - YES CA-ML (Concatena9on using unpar99oned maximum likelihood) - NO MDC NO GC (Greedy Consensus) NO
27 Sta9s9cally consistent under ILS? Coalescent-based summary methods: MP-EST (Liu et al. 2010): maximum pseudo-likelihood es9ma9on of rooted species tree based on rooted triplet tree distribu9on YES BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree es9ma9on YES And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) - YES Co-es-ma-on methods: *BEAST (Heled and Drummond 2009): Bayesian coes9ma9on of gene trees and species trees YES Co-es9ma9on methods are too slow to use on most datasets hence the debate is largely between concatena9on (tradi9onal approach) and summary methods. Single-site methods (SMRT, SVDquartets, METAL, SNAPP, and others) - YES CA-ML (Concatena9on using unpar99oned maximum likelihood) - NO MDC NO GC (Greedy Consensus) NO
28 Sta9s9cally consistent under ILS? Coalescent-based summary methods: MP-EST (Liu et al. 2010): maximum pseudo-likelihood es9ma9on of rooted species tree based on rooted triplet tree distribu9on YES BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree es9ma9on YES And many others (ASTRAL, ASTRID, NJst, GLASS, etc.) - YES Co-es-ma-on methods: *BEAST (Heled and Drummond 2009): Bayesian coes9ma9on of gene trees and species trees YES Co-es9ma9on methods are too slow to use on most datasets hence the debate is largely between concatena9on (tradi9onal approach) and summary methods. Single-site methods (SMRT, SVDquartets, METAL, SNAPP, and others) - YES CA-ML (Concatena9on using unpar99oned maximum likelihood) - NO MDC NO GC (Greedy Consensus) NO
29 Results on 11-taxon datasets with weak ILS Average FN rate *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC 0 5 genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013
30 Results on 11-taxon datasets with weak ILS 0.25 Average FN rate *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC *BEAST MORE ACCURATE than summary methods, because *BEAST gets more accurate gene trees! genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013
31 Results on 11-taxon datasets with weak ILS 0.25 Average FN rate *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC Summary methods (BUCKy-pop, MP-EST) are both sta9s9cally consistent under the MSC but are impacted by gene tree es9ma9on error genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013
32 Results on 11-taxon datasets with weak ILS 0.25 Average FN rate *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC Concatena9on (RAxML) best of all methods on these data! (However, for high enough ILS, concatena9on is not as accurate as the best summary methods.) 0 5 genes 10 genes 25 genes 50 genes *BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinforma9cs 2013
33 Impact of Gene Tree Es9ma9on Error on MP-EST Average FN rate true estimated MP EST MP-EST has no error on true gene trees, but MP-EST has 9% error on es-mated gene trees Datasets: 11-taxon strongils condi9ons with 50 genes Similar results for other summary methods (MDC, Greedy, etc.)
34 TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees Summary methods combine es9mated gene trees, not true gene trees. Mul9ple studies show that summary methods can be less accurate than concatena9on in the presence of high gene tree es9ma9on error. Genome-scale data includes a range of markers, not all of which have substan9al signal. Furthermore, removing sites due to model viola9ons reduces signal. Some researchers also argue that gene trees should be based on very short alignments, to avoid intra-locus recombina9on.
35 Gene tree es9ma9on error: key issue in the debate Summary methods combine es9mated gene trees, not true gene trees. Mul9ple studies show that summary methods can be less accurate than concatena9on in the presence of high gene tree es9ma9on error. Genome-scale data includes a range of markers, not all of which have substan9al signal. Furthermore, removing sites due to model viola9ons reduces signal. Some researchers also argue that gene trees should be based on very short alignments, to avoid intra-locus recombina9on.
36 What is the impact of gene tree es9ma9on error on species tree es9ma9on? Ques9on: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Answers: Roch & Warnow, Syst Biol, March 2015: Strict molecular clock: Yes for some new methods, even for a single site per locus No clock: Unknown for all methods, including MP-EST, ASTRAL, etc. S. Roch and T. Warnow. "On the robustness to gene tree es9ma9on error (or lack thereof) of coalescent-based species tree methods", Systema9c Biology, 64(4): , 2015, (PDF)
37 Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen Guojie Zhang, BGI Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC Approx. 50 species, whole genomes 14,000 loci Multi-national team (100+ investigators) 8 papers published in special issue of Science 2014 Biggest computational challenges: 1. Multi-million site maximum likelihood analysis (~300 CPU years, and 1Tb of distributed memory, at supercomputers around world) 2. Constructing coalescent-based species tree from 14,000 different gene trees
38 Only 48 species, but heuris9c ML took ~300 CPU years on mul9ple supercomputers and used 1Tb of memory Jarvis,$Mirarab,$et$al.,$examined$48$ bird$species$using$14,000$loci$from$ whole$genomes.$two$trees$were$ presented.$ $ 1.$A$single$dataset$maximum$ likelihood$concatena,on$analysis$ used$~300$cpu$years$and$1tb$of$ distributed$memory,$using$tacc$and$ other$supercomputers$around$the$ world.$$ $ 2.$However,$every%locus%had%a% different%%tree$ $sugges,ve$of$ incomplete$lineage$sor,ng $ $and$ the$noisy$genomehscale$data$required$ the$development$of$a$new$method,$ sta,s,cal$binning.$ $ $ $ $ RESEARCH ARTICLE Whole-genome analyses resolve early branches in the tree of life of modern birds Erich D. Jarvis, 1 * Siavash Mirarab, 2 * Andre J. Aberer, 3 Bo Li, 4,5,6 Peter Houde, 7 Cai Li, 4,6 Simon Y. W. Ho, 8 Brant C. Faircloth, 9,10 Benoit Nabholz, 11 Jason T. Howard, 1 Alexander Suh, 12 Claudia C. Weber, 12 Rute R. da Fonseca, 6 Jianwen Li, 4 Fang Zhang, 4 Hui Li, 4 Long Zhou, 4 Nitish Narula, 7,13 Liang Liu, 14 Ganesh Ganapathy, 1 Bastien Boussau, 15 Md. Shamsuzzoha Bayzid, 2 Volodymyr Zavidovych, 1 Sankar Subramanian, 16 Toni Gabaldón, 17,18,19 Salvador Capella-Gutiérrez, 17,18 Jaime Huerta-Cepas, 17,18 Bhanu Rekepalli, 20 Kasper Munch, 21 Mikkel Schierup, 21 Bent Lindow, 6 Wesley C. Warren, 22 David Ray, 23,24,25 Richard E. Green, 26 Michael W. Bruford, 27 Xiangjiang Zhan, 27,28 Andrew Dixon, 29 Shengbin Li, 30 Ning Li, 31 Yinhua Huang, 31 Elizabeth P. Derryberry, 32,33 Mads Frost Bertelsen, 34 Frederick H. Sheldon, 33 Robb T. Brumfield, 33 Claudio V. Mello, 35,36 Peter V. Lovell, 35 Morgan Wirthlin, 35 Maria Paula Cruz Schneider, 36,37 Francisco Prosdocimi, 36,38 José Alfredo Samaniego, 6 Amhed Missael Vargas Velazquez, 6 Alonzo Alfaro-Núñez, 6 Paula F. Campos, 6 Bent Petersen, 39 Thomas Sicheritz-Ponten, 39 An Pas, 40 Tom Bailey, 41 Paul Scofield, 42 Michael Bunce, 43 David M. Lambert, 16 Qi Zhou, 44 Polina Perelman, 45,46 Amy C. Driskell, 47 Beth Shapiro, 26 Zijun Xiong, 4 Yongli Zeng, 4 Shiping Liu, 4 Zhenyu Li, 4 Binghang Liu, 4 Kui Wu, 4 Jin Xiao, 4 Xiong Yinqi, 4 Qiuemei Zheng, 4 Yong Zhang, 4 Huanming Yang, 48 Jian Wang, 48 Linnea Smeds, 12 Frank E. Rheindt, 49 Michael Braun, 50 Jon Fjeldsa, 51 Ludovic Orlando, 6 F. Keith Barker, 52 Knud Andreas Jønsson, 51,53,54 Warren Johnson, 55 Klaus-Peter Koepfli, 56 Stephen O Brien, 57,58 David Haussler, 59 Oliver A. Ryder, 60 Carsten Rahbek, 51,54 Eske Willerslev, 6 Gary R. Graves, 51,61 Travis C. Glenn, 62 John McCormack, 63 Dave Burt, 64 Hans Ellegren, 12 Per Alström, 65,66 Scott V. Edwards, 67 Alexandros Stamatakis, 3,68 David P. Mindell, 69 Joel Cracraft, 70 Edward L. Braun, 71 Tandy Warnow, 2,72 Wang Jun, 48,73,74,75,76 M. Thomas P. Gilbert, 6,43 Guojie Zhang 4,77 To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in
39 We$used$100$CPU$$ years$(mostly$on$$ TACC)$to$develop$$ and$test$this$$ method.$ RESEARCH ARTICLE SUMMARY AVIAN GENOMICS Statistical binning enables an accurate coalescent-based estimation of the avian tree Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow* INTRODUCTION: Reconstructing species trees for rapid radiations, as in the early diversification of birds, is complicated by biological processes such as incomplete ON OUR WEB SITE Read the full article at science lineage sorting (ILS) that can cause different parts of the genome to have different evolutionary histories. Statistical methods, based on the multispecies coalescent model and that combine gene trees, can be highly accurate even in the presence of massive ILS; however, these methods can produce species trees that are topologically far from the species tree when estimated gene trees have error. We have developed a statistical binning technique to address gene tree estimation error and have explored its use in genomescale species tree estimation with MP-EST, a popular coalescent-based species tree estimation method. Statistical binning technique RATIONALE: In statistical binning, phylogenetic trees on different genes are estimated and then placed into bins, so that the differences between trees in the same bin can be explained by estimation error (see the figure). A new tree is then estimated for each bin by applying maximum likelihood to a concatenated alignment of the multiple sequence alignments of its genes, and a species tree is estimated using a coalescent-based species tree method from these supergene trees. RESULTS: Under realistic conditions in our simulation study, statistical binning reduced the topological error of species trees estimated using MP-EST and enabled a coalescent-based analysis that was more accurate than concatenation even when gene tree estimation error was relatively high. Statistical binning also reduced the error in gene tree topology and species tree branch length estimation, especially Traditional pipeline (unbinned) when the phylogenetic signal in gene sequence alignments was low. Species trees estimated using MP-EST with statistical binning on four biological data sets showed increased concordance with the biological literature. When MP-EST was used to analyze 14,446 gene trees in the avian phylogenomics project, it produced a species tree that was discordant with the concatenation analysis and conflicted with prior literature. However, the statistical binning analysis produced a tree that was highly congruent with the concatenation analysis and was consistent with the prior scientific literature. CONCLUSIONS: Statistical binning reduces the error in species tree topology and branch length estimation because it reduces gene tree estimation error. These improvements are greatest when gene trees have reduced bootstrap support, which was the case for the avian phylogenomics project. Because using unbinned gene trees can result in overestimation of ILS, statistical binning may be helpful in providing more accurate estimations of ILS levels in biological data sets. Thus, statistical binning enables highly accurate species tree estimations, even on genome-scale data sets. The list of author affiliations is available in the full article online. *Corresponding author. warnow@illinois.edu Cite this article as S. Mirarab et al., Science 346, (2014). DOI: /science Downloaded from on January 7, 2015 Sequence data Gene alignments Estimated gene trees Species tree Statistical binning pipeline Incompatibility graph Binned supergene alignments Supergene trees Species tree The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for combinabilty, before estimating gene trees. SCIENCE sciencemag.org 12 DECEMBER 2014 VOL 346 ISSUE Published by AAAS
40 Ideas behind sta9s9cal binning Gene tree error tends to decrease with the number of sites in the alignment Number of sites in an alignment Concatena9on (even if not sta9s9cally consistent) tends to be reasonably accurate when there is not too much gene tree heterogeneity
41 Statistical binning technique a coalescent-based analysis that was more accurate than concatenation even when gene tree estimation error was relatively high. Statistical binning also reduced the error in gene tree topology and species tree branch length estimation, especially Traditional pipeline (unbinned) The list of author affiliations is available in the full article online. *Corresponding author. warnow@illinois.edu Cite this article as S. Mirarab et al., Science 346, (2014). DOI: /science Downloaded from Sequence data Gene alignments Estimated gene trees Species tree Statistical binning pipeline Incompatibility graph Binned supergene alignments Supergene trees Species tree The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for combinabilty, before estimating gene trees. Published by AAAS Note: Supergene trees computed using fully par99oned maximum likelihood Vertex-coloring graph with balanced color classes is NP-hard; we used heuris9c.
42 Sta9s9cal binning vs. unbinned Average FN rate Unbinned Statistical MP EST MDC*(75) MRP MRL GC Datasets: 11-taxon strongils datasets with 50 genes from Chung and Ané, Systema9c Biology Binning produces bins with approximate 5 to 7 genes each
43 Theorem 3 (PLOS One, Bayzid et al. 2015): Unweighted sta9s9cal binning pipelines are not sta9s9cally consistent under GTR+MSC As the number of sites per locus increase: All es9mated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) For each bin, with probability converging to 1, the genes in the bin have the same tree topology (but can have different numeric parameters), and there is only one bin for any given tree topology For each bin, a fully par99oned maximum likelihood (ML) analysis of its supergene alignment converges to a tree with the common gene tree topology. As the number of loci increase: every gene tree topology appears with probability converging to 1. Hence as both the number of loci and number of sites per locus increase, with probability converging to 1, every gene tree topology appears exactly once in the set of supergene trees. It is impossible to infer the species tree from the flat distribu9on of gene trees!
44
45 Theorem 2 (PLOS One, Bayzid et al. 2015): WSB pipelines are sta9s9cally consistent under GTR+MSC Easy proof: As the number of sites per locus increase All es9mated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) For every bin, with probability converging to 1, the genes in the bin have the same tree topology Fully par99oned GTR ML analysis of each bin converges to a tree with the common topology of the genes in the bin Hence as the number of sites per locus and number of loci both increase, WSB followed by a sta9s9cally consistent summary method will converge in probability to the true species tree. Q.E.D.
46 Weighted Sta9s9cal Binning: empirical WSB generally benign to highly beneficial: Improves accuracy of gene tree topology Improves accuracy of species tree topology Improves accuracy of species tree branch length Reduces incidence of highly supported false posi9ve branches
47 Sta-s-cal binning vs. Unbinned and Concatena-on (a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length Species tree es9ma9on error for MP-EST and ASTRAL, and also concatena9on using ML, on avian simulated datasets: 48 taxa, moderately high ILS (AD=47%), 1000 genes, and varying gene sequence length. Bayzid et al., (2015). PLoS ONE 10(6): e
48 Comparing Binned and Un-binned MP-EST on the Avian Dataset Conflict with other lines of strong evidence 97/97 100/99 Australaves 91/87 88/90 99/99 Cursores Otidimorphae 80/79 Columbea 9 7/94 Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Cariama cristata Coraciimorphae Accipitriformes Tyto alba 59/57 Pelecanus crispus 87 Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Gavia stellata Gavia stellata 94 50/48 Phaethon lepturus Phaethon lepturus 68 58/56 100/99 100/99 Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Phoenicopterus ruber Podiceps cristatus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus Binned MP-EST (unweighted/weighted) Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Coraciimorphae Cariama cristata Accipitriformes Tyto alba Pelecanus crispus Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Phoenicopterus ruber Podiceps cristatus Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus Unbinned MP-EST Unbinned MP-EST strongly rejects Columbea, a major finding by Jarvis, Mirarab,et al. Binned MP-EST is largely consistent with the ML concatena9on analysis. The trees presented in Science 2014 were the ML concatena9on and Binned MP-EST
49 Running Time Comparison Concatena9on analysis of the Avian dataset: ~250 CPU years and 1Tb memory Sta9s9cal binning analysis: ~5 CPU years, almost all of which was compu9ng maximum likelihood gene trees, much less memory usage Species tree es9ma9on using tradi9onal approaches is more computa9onally expensive, and not as accurate as coalescentbased methods!
50 Summary (so far) Sta9s9cal binning (weighted or unweighted): improves gene trees, and leads to improved species trees in the presence of ILS compared to unbinned analyses. Sta9s9cal binning pipelines are also more accurate than concatena9on under high ILS. Pipelines using weighted version are sta9s9cally consistent under the mul9-species coalescent model. Sta9s9cal binning pipelines are much faster than concatena9on analyses (e.g. 5 years vs. 250 years for avian dataset).
51 1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin l 103 plant transcriptomes, single copy genes l Wickeh, Mirarab et al., PNAS 2014 l Next phase will be much bigger (~1000 species and ~1000 genes) Challenges: Massive gene tree heterogeneity consistent with ILS Could not use exis9ng coalescent methods due to missing data (many gene trees could not be rooted) and large number of species
52 1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin l 103 plant transcriptomes, single copy genes l Wickeh, Mirarab et al., PNAS 2014 l Next phase will be much bigger (~1000 species and ~1000 genes) Solu9on: New coalescent-based method ASTRAL (Mirarab et al., ECCB/ Bioinforma-cs 2014, Mirarab et al., ISMB/Bioinforma-cs 2015) ASTRAL is sta9s9cally consistent, polynomial 9me, and uses unrooted gene trees.
53 ASTRAL [Mirarab, et al., ECCB/Bioinformatics, 2014] Optimization Problem (NP-Hard): Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees Score(T )= X t2t Q(T ) \ Q(t) Set of quartet trees induced by T a gene tree all input gene trees Theorem: Statistically consistent under the multispecies coalescent model when solved exactly 15
54 Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar99ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar99ons from X. Theorems (Mirarab et al., 2014): If X contains the bipar99ons from the input gene trees (and perhaps others), then an exact solu9on to this problem is sta9s9cally consistent under the MSC. The constrained MQST problem can be solved in O( X 2 nk) 9me. (We use dynamic programming, and build the unrooted tree from the bohom-up, based on allowed clades halves of the allowed bipar99ons.)
55 Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar99ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar99ons from X. Theorems (Mirarab et al., 2014): If X contains the bipar99ons from the input gene trees (and perhaps others), then an exact solu9on to this problem is sta9s9cally consistent under the MSC. The constrained MQST problem can be solved in O( X 2 nk) 9me. (We use dynamic programming, and build the unrooted tree from the bohom-up, based on allowed clades halves of the allowed bipar99ons.)
56 Constrained Maximum Quartet Support Tree Input: Set T = {t 1,t 2,,t k } of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipar99ons Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipar99ons from X. Theorems (Mirarab et al., 2014): If X contains the bipar99ons from the input gene trees (and perhaps others), then an exact solu9on to this problem is sta9s9cally consistent under the MSC. The constrained MQST problem can be solved in O( X 2 nk) 9me. (We use dynamic programming, and build the unrooted tree from the bohom-up, based on allowed clades halves of the allowed bipar99ons.)
57 Simulation study Variable parameters: Number of species: True (model) species tree True gene trees Sequence data Number of genes: Finch Falcon Owl Eagle Pigeon Amount of ILS: low, medium, high Deep versus recent speciation Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 11 model conditions (50 replicas each) with heterogenous gene tree error Compare to NJst, MP-EST, concatenation (CA-ML) Evaluate accuracy using FN rate: the percentage of branches in the true tree that are missing from the estimated tree Used SimPhy, Mallo and Posada,
58 Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST genes, medium levels of recent ILS 16
59 Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST number of species 1000 genes, medium levels of recent ILS 16
60
61 Accuracy in the presence of HGT + ILS 200 Estimated Gene Trees Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees, simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree Price, Dehal, Arkin Fletcher, Yang Davidson et al., RECOMB-CG, BMC Genomics 2015
62
63
64 Summary ASTRAL is a summary methods that is sta9s9cally consistent in the presence of ILS, and that run in polynomial 9me. ASTRAL can analyze very large datasets (1000 species and 1000 genes or more) with high accuracy. Coalescent-based summary methods are much faster than tradi9onal concatena9on approaches, and they can provide improved accuracy in the presence of gene tree heterogeneity. Gene tree es9ma9on error impacts accuracy of species trees but sta9s9cal binning can reduce gene tree es9ma9on error, and lead to improved species tree es9ma9ons (topology, branch lengths, and incidence of false posi9ves).
65 Future Direc9ons Beher coalescent-based summary methods (that are more robust to gene tree es9ma9on error) Beher techniques for es9ma9ng gene trees given mul9-locus data, or for co-es9ma9ng gene trees and species trees Beher theory about robustness to gene tree es9ma9on error (or lack thereof) for coalescentbased summary methods Beher single site methods (see SMRT, SVDquartets, METAL, and SNAPP)
66 The Tree of Life: Mul$ple Challenges Scien9fic challenges: Ultra-large mul9ple-sequence alignment Alignment-free phylogeny es9ma9on Supertree es9ma9on Es9ma9ng species trees from many gene trees Genome rearrangement phylogeny Re9culate evolu9on Visualiza9on of large trees and alignments Data mining techniques to explore mul9ple op9ma Theore9cal guarantees under Markov models of evolu9on Applica9ons: metagenomics protein structure and func9on predic9on trait evolu9on detec9on of co-evolu9on systems biology Techniques: Graph theory (especially chordal graphs) Probability theory and sta9s9cs Hidden Markov models Combinatorial op9miza9on Heuris9cs Supercompu9ng
67 Big Phylogenomic Data Thousands to millions of sequences and species, millions of sites per species Big phylogenomic data are not the same as the usual data Rela9ve performance of methods change with dataset size and heterogeneity Even moderate-sized inputs can create huge outputs We need new methods, new op$miza$on problems, new sta$s$cal models, new sta$s$cal theory, compression methods, visualiza$on methods,.
68 Acknowledgments NSF grant DBI (joint with Noah Rosenberg at Stanford and Luay Nakhleh at Rice): hhp://tandy.cs.illinois.edu/phylogenomicsproject.html Papers available at hhp://tandy.cs.illinois.edu/papers.html SoSware ASTRAL and sta-s-cal binning: Available at hhps://github.com/smirarab Others at hhp://tandy.cs.illinois.edu/so ware.html Other Funding: David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Compu9ng Center), Grainger Founda9on, and HHMI (to SM)
69 Computational Phylogenomics NP-hard problems Large datasets Complex sta9s9cal es9ma9on problems Metagenomics Protein structure and func9on predic9on Medical forensics Systems biology Popula9on gene9cs
70 1kp$(hbp:// Gane Ka-Shu Wong U Alberta Jim Leebens-Mack U Georgia Norm Wickett Northwestern Naim Matasci iplant U Arizona Siavash Mirarab, Tandy Warnow, Texas Texas and UIUC! Transcriptomes$of$approx.$1200$species$$! More$than$13,000$gene$families$(most$not$single$copy)$! Mul,Hins,tu,onal$project$(10+$universi,es)$! First$paper$published$in$PNAS$2014$(Wickeb,$Mirarab,$et$al.)$ Upcoming$challenges:$$ $1.$es,ma,ng$very$large$gene$alignments$and$trees$(100,000+$sequences)$ $2.$concatena,on$maximum$likelihood$analysis$with$~1000$species$and$ more$than$100,000$sites$ $3.$construc,ng$species$tree$from$incongruent$gene$trees$
71 Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Aus9n S. Mirarab Md. S. Bayzid, UT-Aus9n UT-Aus9n Plus many many other people Approx. 50 species, whole genomes, 14,000 loci Solu9on: Sta-s-cal Binning Improves coalescent-based species tree es9ma9on by improving gene trees (Mirarab, Bayzid, Boussau, and Warnow, Science 2014), see also weighted sta9s9cal binning (Bayzid et al., PLOS One 2015) Avian species tree es9mated using Sta-s-cal Binning with MP-EST (Jarvis, Mirarab, et al., Science 2014)
72
73 ASTRID ASTRID: Accurate species trees using internode distances, Vachaspa9 and Warnow, RECOMB-CG 2015 and BMC Genomics 2015 Algorithmic design: Computes a matrix of average leaf-to-leaf topological distances, and then computes a tree using FastME (more accurate than neighbor Joining and faster, too). Related to NJst (Liu and Yu, 2010), which computes the same matrix but then computes the tree using neighbor joining (NJ). Sta9s9cally consistent under the MSC O(kn 2 + n 3 ) 9me where there are k gene trees and n species
74 ASTRID ASTRID: Accurate species trees using internode distances, Vachaspa9 and Warnow, RECOMB-CG 2015 and BMC Genomics 2015 Algorithmic design: Computes a matrix of average leaf-to-leaf topological distances, and then computes a tree using FastME (more accurate than neighbor Joining and faster, too). Related to NJst (Liu and Yu, 2010), which computes the same matrix but then computes the tree using neighbor joining (NJ). Sta9s9cally consistent under the MSC O(kn 2 + n 3 ) 9me where there are k gene trees and n species
75 ASTRID ASTRID: Accurate species trees using internode distances, Vachaspa9 and Warnow, RECOMB-CG 2015 and BMC Genomics 2015 Algorithmic design: Computes a matrix of average leaf-to-leaf topological distances, and then computes a tree using FastME (more accurate than neighbor Joining and faster, too). Related to NJst (Liu and Yu, 2010), which computes the same matrix but then computes the tree using neighbor joining (NJ). Sta9s9cally consistent under the MSC O(kn 2 + n 3 ) 9me where there are k gene trees and n species
76 Both ASTRAL and ASTRID substantially outperform MP-EST Avian simulated dataset Mammalian simulated dataset
77 ASTRID is very fast 48-taxon avian simulated dataset I On the ASTRAL-2 dataset with 1000 taxa, 1000 genes, ASTRID-FastME takes 33 minutes, ASTRAL takes 12 hours.
78 Scaling methods to large datasets BBCA: combining random binning with *BEAST to enable scalability to large numbers of loci (Zimmermann et al., BMC Genomics 2014) Using divide-and-conquer to scale MP-EST (and other methods) to large numbers of taxa (Bayzid et al., BMC Genomics 2014)
79 Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II MP EST number of species 1000 genes, medium levels of recent ILS 16
80 Tree accuracy when varying the number of species Species tree topological error (FN) 16% 12% 8% 4% ASTRAL II ASTRAL II NJst MP EST MP EST number of species 1000 genes, medium levels of recent ILS 16
81 Exact solution We developed a dynamic programming algorithm to solve the problem exactly Exponential running time (still feasible for <18 species) Developed a constrained version of the problem that can be solved exactly in polynomial time Runs for 1000 species and 1000 genes in about a day Remains statistically consistent 16
82 ASTRAL-II on biological datasets (ongoing collaborations) 1200 plants with ~ 400 genes (1KP consortium) 250 avian species with 2000 genes (with LSU, UF, and Smithsonian) 200 avian species with whole genomes (with Genome 10K, international) 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane) 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign) 50 Hummingbird species with 2000 genes (with U. Copenhagen and Smithsonian) 40 raptor species (birds) with 10,000 genes (with U. Copenhagen and Berkeley) 38 mammalian species with 10,000 genes (with U. of Bristol, Cambridge, and Nat. Univ. of Ireland) 29
83 Phylogenomics = Species trees from whole genomes Mul9ple applica9ons, including Metagenomics, Protein Structure and Func9on, Conserva9on Biology Adapta9on
84 ASTRAL and ASTRAL-2 Es9mates the species tree from unrooted gene trees by finding the species tree that has the maximum quartet support, subject to input constraint set X. ASTRAL lets X be the set of bipar99ons in the input gene trees; ASTRAL-2 includes addi9onal bipar99ons beyond this (to address missing data challenges). Theorem: ASTRAL is sta-s-cally consistent under the MSC, even when solved in constrained mode (drawing bipar99ons from the input gene trees). The constrained version of ASTRAL runs in polynomial 9me Open source so ware at hhps://github.com/smirarab Published in Bioinforma9cs 2014 and 2015 Used in Wickeh, Mirarab et al. (PNAS 2014)
85 Sta9s9cal Consistency for summary methods error Data Data are gene trees, presumed to be randomly sampled true gene trees.
86 Fig 1. Pipeline for unbinned analyses, unweighted sta-s-cal binning, and weighted sta-s-cal binning. Bayzid MS, Mirarab S, Boussau B, Warnow T (2015) Weighted Sta9s9cal Binning: Enabling Sta9s9cally Consistent Genome-Scale Phylogene9c Analyses. PLoS ONE 10(6): e doi: /journal.pone hhp:// :8081/plosone/ar9cle?id=info:doi/ /journal.pone
87 Table 1. Model trees used in the Weighted Sta9s9cal Binning study. We show number of taxa, species tree branch length (rela9ve to base model), and average topological discordance between true gene trees and true species tree. Dataset Species tree branch length scaling Average Discordance (%) Avian (48) 2X 35 Avian (48) 1X 47 Avian (48) 0.5X 59 Mammalian (37) 2X 18 Mammalian (37) 1X 32 Mammalian (37) 0.5X taxon Lower ILS" taxon Higher ILS" taxon High ILS" 82 doi: /journal.pone t001
88 Binning can improve species tree topology es-ma-on (a) MP-EST on varying gene sequence length (b) ASTRAL on varying gene sequence length Species tree es9ma9on error for MP-EST and ASTRAL, and also concatena9on using ML, on avian simulated datasets: 48 taxa, moderately high ILS (AD=47%), 1000 genes, and varying gene sequence length. Bayzid et al., (2015). PLoS ONE 10(6): e
89 Comparing Binned and Un-binned MP-EST on the Avian Dataset Conflict with other lines of strong evidence 97/97 100/99 Australaves 91/87 88/90 99/99 Cursores Otidimorphae 80/79 Columbea 9 7/94 Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Cariama cristata Coraciimorphae Accipitriformes Tyto alba 59/57 Pelecanus crispus 87 Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Gavia stellata Gavia stellata 94 50/48 Phaethon lepturus Phaethon lepturus 68 58/56 100/99 100/99 Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Phoenicopterus ruber Podiceps cristatus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus Calypte anna Chaetura pelagica Antrostomus carolinensis Passeriformes Psittaciformes Falco peregrinus Coraciimorphae Cariama cristata Accipitriformes Tyto alba Pelecanus crispus Egrett agarzetta Nipponia nippon Phalacrocorax carbo Procellariimorphae Eurypyga helias Balearica regulorum Charadrius vociferus Opisthocomus hoazin Phoenicopterus ruber Podiceps cristatus Tauraco erythrolophus Chlamydotis macqueenii Cuculus canorus Columbal ivia Pterocles gutturalis Mesitornis unicolor Meleagris gallopavo Gallus gallus Anas platyrhynchos Tinamus guttatus Struthio camelus Binned MP-EST is largely consistent with the ML concatena9on analysis. The trees presented in Science 2014 were the ML concatena9on and Binned MP-EST Binned MP-EST (unweighted/weighted) Unbinned MP-EST Bayzid et al., (2015). PLoS ONE 10(6): e
90 Binning can reduce incidence of high support false posi-ve edges Cumula9ve distribu9on of the bootstrap support values of true posi9ve (le ) and false posi9ve (right) edges. If a curve for method X is above the curve for method Y, then X has higher BS for true posi9ves and lower BS for false posi9ves. Values in the shaded area indicate false posi9ve branches with support at 75% or higher. Results are shown for 1000 genes with 500bp, on the avian simulated datasets. Bayzid et al., (2015). PLoS ONE 10(6): e
91 Weighted Sta9s9cal Binning: empirical However, WSB can reduce accuracy under some condi9ons. Current simula9ons have only established this for model condi9ons that simultaneously have: Very small numbers of species (at most 10) Very high ILS (AD > 80%) Low bootstrap support for gene trees Most likely there are other condi9ons as well.
92 Species tree es-ma-on error for MP-EST and ASTRAL on 10-taxon datasets AD=40% AD=84% Simphy Model Tree 200 genes with 100bp (GTRGAMMA) 10 replicates per condi9on Notes: Moderate ILS: binning neutral or beneficial using BS=50% Very high ILS: binning neutral for BS=50%, but increases MP-EST error with BS=75% Bayzid MS, Mirarab S, Boussau B, Warnow T (2015). PLoS ONE 10(6): e
93 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) performed WSB using unpar99oned ML instead of fully par99oned ML. erroneous (extopic) data in supergene alignments, biasing against WSB Our re-analysis of their data produced beher results than they reported, but WSB did reduce accuracy on their data.
94 Sketch of L&E argument For any sequence length L, there is a model species tree such that nearly all sites on nearly all genes evolve without any changes, and so nearly all gene trees have maximum bootstrap support below the threshold value. As the number of loci increase, the bins produced by WSB will have the same gene tree distribu9on as for the true species tree (or the devia9on will not impact any downstream argument). On each bin, ML concatena9on will converge to some tree that is not the species tree. (Hence, applying a coalescent-based method to these supergene trees will not produce the species tree, even as the number of loci increases.)
95 Sketch of L&E argument For any sequence length L, there is a model species tree such that nearly all sites on nearly all genes evolve without any changes, and so nearly all gene trees have maximum bootstrap support below the threshold value. As the number of loci increase, the bins produced by WSB will have the same gene tree distribu9on as for the true species tree (or the devia9on will not impact any downstream argument). On each bin, ML concatena9on will converge to some tree that is not the species tree. (Hence, applying a coalescent-based method to these supergene trees will not produce the species tree, even as the number of loci increases.)
96 Sketch of L&E argument For any sequence length L, there is a model species tree such that nearly all sites on nearly all genes evolve without any changes, and so nearly all gene trees have maximum bootstrap support below the threshold value. As the number of loci increase, the bins produced by WSB will have the same gene tree distribu9on as for the true species tree (or the devia9on will not impact any downstream argument). On each bin, ML concatena9on will converge to some tree that is not the species tree. (Hence, applying a coalescent-based method to these supergene trees will not produce the species tree, even as the number of loci increases.)
97 Sketch of L&E argument For any sequence length L, there is a model species tree such that nearly all sites on nearly all genes evolve without any changes, and so nearly all gene trees have maximum bootstrap support below the threshold value. As the number of loci increase, the bins produced by WSB will have the same gene tree distribu9on as for the true species tree (or the devia9on will not impact any downstream argument). On each bin, ML concatena9on will converge to some tree that is not the species tree. (Hence, applying a coalescent-based method to these supergene trees will not converge to the species tree as the number of loci increases.)
98 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) performed WSB using unpar99oned ML instead of fully par99oned ML. erroneous (extopic) data in supergene alignments, biasing against WSB Our re-analysis of their data produced beher results than they reported, but WSB did reduce accuracy on their data.
99 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beher results for sta9s9cal binning (both weighted and unweighted) than they reported, They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning) They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171
100 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beher results for sta9s9cal binning (both weighted and unweighted) than they reported They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning) They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171
101 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beher results for sta9s9cal binning (both weighted and unweighted) than they reported. They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning). They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171
102 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Our re-analysis of their data produced beher results for sta9s9cal binning (both weighted and unweighted) than they reported. They performed WSB using unpar99oned ML instead of fully par99oned ML (biasing against sta9s9cal binning) They had erroneous (ectopic) data in their supergene alignments, biasing against sta9s9cal binning. This model tree fits into the category of condi9ons described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS). Figure of model tree from L&E, Science 9 October 2015: 171
103 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Figure of model tree from L&E, Science 9 October 2015: 171
104 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) Mirarab et al. response (Science, October 2015): Our re-analysis of their data (with correct supergene alignments) shows that WSB reduces accuracy but not by as much as they report. Our analyses of slightly larger datasets with the same proper9es (pec9nate, very high ILS, strict clock) showed WSB neutral to beneficial. Figure of model tree from L&E, Science 9 October 2015: 171
105 Liu and Edwards, Comment in Science, October 2015 Ahempted proof that WSB pipelines are sta9s9cally inconsistent for bounded number of sites per locus: The proof fails for mul9ple reasons, including the use of unpar99oned ML instead of fully par99oned ML Simula9on study 5-taxon, strict molecular clock, very high ILS (AD=82%) All the condi9ons in which WSB has been shown to reduce accuracy have the following proper9es: High ILS (AD > 80%) Small numbers of taxa (at most 10) Low bootstrap support on gene trees and most also obeyed the strict molecular clock. Bayzid et al. (PLOS One March 2015) advises against the use of WSB under these condi9ons. Figure of model tree from L&E, Science 9 October 2015: 171
106 L&E ask a good ques9on: performance on bounded number of sites! Ques9on: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Answers: Roch & Warnow, Syst Biol, March 2015: Strict molecular clock: Yes for some new methods, even for a single site per locus No clock: Unknown for all methods, including MP-EST, ASTRAL, etc.
107 Conjecture For any posi9ve integer L, if all loci have at most L sites, then coalescent-based summary methods (e.g., MP-EST, ASTRAL, NJst, ASTRID, etc.) cannot be sta9s9cally consistent under the MSC. Comments: All current proofs of sta9s9cal consistency for standard summary methods assume perfect gene trees, and so inherently assume that sequence lengths for all genes go to infinity. The challenge is that for some model gene trees, ML can be biased towards the wrong tree topology for small enough L (think long branch ahrac9on, Felsenstein Zone).
108 Conjecture For any posi9ve integer L, if all loci have at most L sites, then coalescent-based summary methods (e.g., MP-EST, ASTRAL, NJst, ASTRID, etc.) cannot be sta9s9cally consistent under the MSC. Comments: All current proofs of sta9s9cal consistency for standard summary methods assume perfect gene trees, and so inherently assume that sequence lengths for all genes go to infinity. The challenge is that for some model gene trees, ML can be biased towards the wrong tree topology for small enough L (think long branch ahrac9on, Felsenstein Zone).
109 Conjecture For any posi9ve integer L, if all loci have at most L sites, then coalescent-based summary methods (e.g., MP-EST, ASTRAL, NJst, ASTRID, etc.) cannot be sta9s9cally consistent under the MSC. Comments: All current proofs of sta9s9cal consistency for standard summary methods assume perfect gene trees, and so inherently assume that sequence lengths for all genes go to infinity. The challenge is that for some model gene trees, ML can be biased towards the wrong tree topology for small enough L (think long branch ahrac9on, Felsenstein Zone).
110 Rephrasing L&E Technical Comment as a For any posi9ve integer L, if all loci have at most L sites, then WSB pipelines cannot be sta9s9cally consistent under the MSC. Comments: Might well be true! conjecture The proof will not be easy.
111 Rephrasing L&E Technical Comment as a For any posi9ve integer L, if all loci have at most L sites, then WSB pipelines cannot be sta9s9cally consistent under the MSC. Comments: Open conjecture Will be hard to sehle either way
112 No theore9cal difference between MP-EST, ASTRAL, and WSB (according to current knowledge) * * * Consistency first kind Consistency second kind MP-EST YES UNKNOWN ASTRAL YES UNKNOWN Unpartitioned concatenated maximum likelihood NO ( ) NO ( ) Fully partitioned maximum likelihood UNKNOWN UNKNOWN Unweighted statistical binning followed by consistent summary method (e.g., ASTRAL) Weighted statistical binning followed by consistent summary method (e.g., ASTRAL) NO ( ) NO ( ) YES ( ) UNKNOWN *BEAST YES UNKNOWN 10 Consistency first kind: both number of loci and number of sites go to infinity Consistency second kind: number of loci goes to infinity, number of sites bounded by L (arbitrary constant) Table from PLOS Currents, Warnow 2015
113 Future Direc9ons Beher coalescent-based summary methods (that are more robust to gene tree es9ma9on error) Beher techniques for es9ma9ng gene trees given mul9-locus data, or for co-es9ma9ng gene trees and species trees Beher theory about robustness to gene tree es9ma9on error (or lack thereof) for coalescentbased summary methods Beher single site methods
114 Future Direc9ons Beher coalescent-based summary methods (that are more robust to gene tree es9ma9on error) Beher techniques for es9ma9ng gene trees given mul9-locus data, or for co-es9ma9ng gene trees and species trees Beher theory about robustness to gene tree es9ma9on error (or lack thereof) for coalescentbased summary methods Beher single site methods
115 Future Direc9ons Beher coalescent-based summary methods (that are more robust to gene tree es9ma9on error) Beher techniques for es9ma9ng gene trees given mul9-locus data, or for co-es9ma9ng gene trees and species trees Beher theory about robustness to gene tree es9ma9on error (or lack thereof) for coalescentbased summary methods Beher single site methods
116 Species tree es-ma-on error for MP-EST and ASTRAL on 15-taxon datasets Model Tree: Very high ILS: AD = 82% Strict molecular clock GTR+Gamma sequence evolu9on (Indelible) 10 replicates per condi9on Notes: BS-75% o en improved accuracy (p=0.04) BS=50% some9mes reduced accuracy, but differences were not sta9s9cally significant. MP-EST more accurate than ASTRAL on these data. Bayzid MS, Mirarab S, Boussau B, Warnow T (2015). PLoS ONE 10(6): e
New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois
New methods for es-ma-ng species trees from genome-scale data Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,
More informationThe Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois
The Mathema)cs of Es)ma)ng the Tree of Life Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona
More informationGenome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois
Genome-scale Es-ma-on of the Tree of Life Tandy Warnow The University of Illinois Phylogeny (evolu9onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona
More informationComputational Challenges in Constructing the Tree of Life
Computational Challenges in Constructing the Tree of Life Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu Warnow Lab (PhD students)
More informationFrom Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Aus
More informationASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support
ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Joint work with Tandy Warnow Erfan Sayyari
More informationConstruc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign
Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,
More informationConstrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign
Constrained Exact Op1miza1on in Phylogene1cs Tandy Warnow The University of Illinois at Urbana-Champaign Phylogeny (evolu1onary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website,
More informationFast coalescent-based branch support using local quartet frequencies
Fast coalescent-based branch support using local quartet frequencies Molecular Biology and Evolution (2016) 33 (7): 1654 68 Erfan Sayyari, Siavash Mirarab University of California, San Diego (ECE) anzee
More informationReconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)
Reconstruction of species trees from gene trees using ASTRAL Siavash Mirarab University of California, San Diego (ECE) Phylogenomics Orangutan Chimpanzee gene 1 gene 2 gene 999 gene 1000 Gorilla Human
More informationfirst (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees
Concatenation Analyses in the Presence of Incomplete Lineage Sorting May 22, 2015 Tree of Life Tandy Warnow Warnow T. Concatenation Analyses in the Presence of Incomplete Lineage Sorting.. 2015 May 22.
More informationUpcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego
Upcoming challenges in phylogenomics Siavash Mirarab University of California, San Diego Gene tree discordance The species tree gene1000 Causes of gene tree discordance include: Incomplete Lineage Sorting
More informationNJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees
NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana
More information394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees
394C, October 2, 2013 Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees Mul9ple Sequence Alignment Mul9ple Sequence Alignments and Evolu9onary Histories (the meaning of homologous
More informationMul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu
Mul$ple Sequence Alignment Methods Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life
More informationComputa(onal Challenges in Construc(ng the Tree of Life
Computa(onal Challenges in Construc(ng the Tree of Life Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign h?p://tandy.cs.illinois.edu Phylogeny (evolueonary tree)
More informationCS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.
CS 581 / BIOE 540: Algorithmic Computa
More informationUsing Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics
Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu
More informationOn the variance of internode distance under the multispecies coalescent
On the variance of internode distance under the multispecies coalescent Sébastien Roch 1[0000 000 7608 8550 Department of Mathematics University of Wisconsin Madison Madison, WI 53706 roch@math.wisc.edu
More informationAnatomy of a species tree
Anatomy of a species tree T 1 Size of current and ancestral Populations (N) N Confidence in branches of species tree t/2n = 1 coalescent unit T 2 Branch lengths and divergence times of species & populations
More informationPhylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign
Phylogenomics, Multiple Sequence Alignment, and Metagenomics Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the
More informationCS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign
CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Today Explain the course Introduce some of the research in this area Describe some open problems Talk about
More informationCS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign
CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Course Staff Professor Tandy Warnow Office hours Tuesdays after class (2-3 PM) in Siebel 3235 Email address:
More informationPhylogenetic Geometry
Phylogenetic Geometry Ruth Davidson University of Illinois Urbana-Champaign Department of Mathematics Mathematics and Statistics Seminar Washington State University-Vancouver September 26, 2016 Phylogenies
More informationUltra- large Mul,ple Sequence Alignment
Ultra- large Mul,ple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana- Champaign hep://tandy.cs.illinois.edu Phylogeny (evolu,onary tree) Orangutan
More informationThe impact of missing data on species tree estimation
MBE Advance Access published November 2, 215 The impact of missing data on species tree estimation Zhenxiang Xi, 1 Liang Liu, 2,3 and Charles C. Davis*,1 1 Department of Organismic and Evolutionary Biology,
More informationToday's project. Test input data Six alignments (from six independent markers) of Curcuma species
DNA sequences II Analyses of multiple sequence data datasets, incongruence tests, gene trees vs. species tree reconstruction, networks, detection of hybrid species DNA sequences II Test of congruence of
More informationPhylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki
Phylogene)cs IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, 2016 Joyce Nzioki Phylogenetics The study of evolutionary relatedness of organisms. Derived from two Greek words:» Phle/Phylon: Tribe/Race» Genetikos:
More informationTaming the Beast Workshop
Workshop and Chi Zhang June 28, 2016 1 / 19 Species tree Species tree the phylogeny representing the relationships among a group of species Figure adapted from [Rogers and Gibbs, 2014] Gene tree the phylogeny
More informationWorkshop III: Evolutionary Genomics
Identifying Species Trees from Gene Trees Elizabeth S. Allman University of Alaska IPAM Los Angeles, CA November 17, 2011 Workshop III: Evolutionary Genomics Collaborators The work in today s talk is joint
More informationCS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012 Biology: 21st Century Science! When the human genome was sequenced seven years ago, scientists knew that most of the major scientific
More informationIn comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW
REVIEW Challenges in Species Tree Estimation Under the Multispecies Coalescent Model Bo Xu* and Ziheng Yang*,,1 *Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China and Department
More informationTheDisk-Covering MethodforTree Reconstruction
TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document
More informationPhylogenetics in the Age of Genomics: Prospects and Challenges
Phylogenetics in the Age of Genomics: Prospects and Challenges Antonis Rokas Department of Biological Sciences, Vanderbilt University http://as.vanderbilt.edu/rokaslab http://pubmed2wordle.appspot.com/
More informationUsing phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)
Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures
More informationDr. Amira A. AL-Hosary
Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological
More informationA Phylogenetic Network Construction due to Constrained Recombination
A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer
More informationMethods to reconstruct phylogene1c networks accoun1ng for ILS
Methods to reconstruct phylogene1c networks accoun1ng for ILS Céline Scornavacca some slides have been kindly provided by Fabio Pardi ISE-M, Equipe Phylogénie & Evolu1on Moléculaires Montpellier, France
More informationSNPs versus sequences for phylogeography an explora:on using simula:ons and massively parallel sequencing in a non- model bird
SNPs versus sequences for phylogeography an explora:on using simula:ons and massively parallel sequencing in a non- model bird Michael G. Harvey, Brian T. Smith, Brant C. Faircloth, Travis C. Glenn, and
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationEvolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim
Evolu&on, Popula&on Gene&cs, and Natural Selec&on 02-710 Computa.onal Genomics Seyoung Kim Phylogeny of Mammals Phylogene&cs vs. Popula&on Gene&cs Phylogene.cs Assumes a single correct species phylogeny
More informationBioinformatics tools for phylogeny and visualization. Yanbin Yin
Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and
More informationSmith et al. American Journal of Botany 98(3): Data Supplement S2 page 1
Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm
More informationPOPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics
POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the
More informationQuartet Inference from SNP Data Under the Coalescent Model
Bioinformatics Advance Access published August 7, 2014 Quartet Inference from SNP Data Under the Coalescent Model Julia Chifman 1 and Laura Kubatko 2,3 1 Department of Cancer Biology, Wake Forest School
More informationEstimating phylogenetic trees from genome-scale data
Ann. N.Y. Acad. Sci. ISSN 0077-8923 ANNALS OF THE NEW YORK ACADEMY OF SCIENCES Issue: The Year in Evolutionary Biology Estimating phylogenetic trees from genome-scale data Liang Liu, 1,2 Zhenxiang Xi,
More informationPhylogenetic inference
Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types
More informationConstructing Evolutionary/Phylogenetic Trees
Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood
More informationThe statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection
The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection Mark T. Holder and Jordan M. Koch Department of Ecology and Evolutionary Biology, University of
More informationPhylogenetics: Building Phylogenetic Trees
1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should
More informationPhylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University
Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary
More informationProperties of Consensus Methods for Inferring Species Trees from Gene Trees
Syst. Biol. 58(1):35 54, 2009 Copyright c Society of Systematic Biologists DOI:10.1093/sysbio/syp008 Properties of Consensus Methods for Inferring Species Trees from Gene Trees JAMES H. DEGNAN 1,4,,MICHAEL
More informationEstimating Evolutionary Trees. Phylogenetic Methods
Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent
More informationPhylogenomic analyses data of the avian phylogenomics project
Phylogenomic analyses data of the avian phylogenomics project The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Jarvis,
More informationSEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas Phylogeny (evolutionary tree) Orangutan From the Tree of the Life Website, University of Arizona Gorilla
More informationIncomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1-2010 Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci Elchanan Mossel University of
More informationBayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies
Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development
More informationPoint of View. Why Concatenation Fails Near the Anomaly Zone
Point of View Syst. Biol. 67():58 69, 208 The Author(s) 207. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email:
More informationEstimating phylogenetic trees from genome-scale data
Estimating phylogenetic trees from genome-scale data Liang Liu 1,2, Zhenxiang Xi 3, Shaoyuan Wu 4, Charles Davis 3, and Scott V. Edwards 4* 1 Department of Statistics, University of Georgia, Athens, GA
More informationWenEtAl-biorxiv 2017/12/21 10:55 page 2 #2
WenEtAl-biorxiv 0// 0: page # Inferring Phylogenetic Networks Using PhyloNet Dingqiao Wen, Yun Yu, Jiafan Zhu, Luay Nakhleh,, Computer Science, Rice University, Houston, TX, USA; BioSciences, Rice University,
More informationA phylogenomic toolbox for assembling the tree of life
A phylogenomic toolbox for assembling the tree of life or, The Phylota Project (http://www.phylota.org) UC Davis Mike Sanderson Amy Driskell U Pennsylvania Junhyong Kim Iowa State Oliver Eulenstein David
More informationLetter to the Editor. Department of Biology, Arizona State University
Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona
More informationPhylogenetic Tree Reconstruction
I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven
More informationCSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars
CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1 Ben Raphael January 21, 2009 Course Par3culars Three major topics 1. Phylogeny: ~50% lectures 2. Func3onal Genomics: ~25% lectures
More informationSTEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)
STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization) Laura Salter Kubatko Departments of Statistics and Evolution, Ecology, and Organismal Biology The Ohio State University kubatko.2@osu.edu
More informationJed Chou. April 13, 2015
of of CS598 AGB April 13, 2015 Overview of 1 2 3 4 5 Competing Approaches of Two competing approaches to species tree inference: Summary methods: estimate a tree on each gene alignment then combine gene
More informationAnalysis of a Rapid Evolutionary Radiation Using Ultraconserved Elements: Evidence for a Bias in Some Multispecies Coalescent Methods
Syst. Biol. 65(4):612 627, 2016 The Author(s) 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com
More informationPhylogenetic inference: from sequences to trees
W ESTFÄLISCHE W ESTFÄLISCHE W ILHELMS -U NIVERSITÄT NIVERSITÄT WILHELMS-U ÜNSTER MM ÜNSTER VOLUTIONARY FUNCTIONAL UNCTIONAL GENOMICS ENOMICS EVOLUTIONARY Bioinformatics 1 Phylogenetic inference: from sequences
More informationReconstruire le passé biologique modèles, méthodes, performances, limites
Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire
More informationCSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary
CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch
More informationRegular networks are determined by their trees
Regular networks are determined by their trees Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu February 17, 2009 Abstract. A rooted acyclic digraph
More informationIdentifiability of the GTR+Γ substitution model (and other models) of DNA evolution
Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Elizabeth S. Allman Dept. of Mathematics and Statistics University of Alaska Fairbanks TM Current Challenges and Problems
More informationConsistency Index (CI)
Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)
More informationSEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes
More informationPhylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center
Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods
More informationRecent Advances in Phylogeny Reconstruction
Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators
More informationEvaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets
Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets Xiaofan Zhou, 1,2 Xing-Xing Shen, 3 Chris Todd Hittinger, 4 and Antonis Rokas*,3 1 Integrative Microbiology
More informationTo link to this article: DOI: / URL:
This article was downloaded by:[ohio State University Libraries] [Ohio State University Libraries] On: 22 February 2007 Access Details: [subscription number 731699053] Publisher: Taylor & Francis Informa
More informationA (short) introduction to phylogenetics
A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field
More informationComparative Methods on Phylogenetic Networks
Comparative Methods on Phylogenetic Networks Claudia Solís-Lemus Emory University Joint Statistical Meetings August 1, 2018 Phylogenetic What? Networks Why? Part I Part II How? When? PCM What? Phylogenetic
More informationUnderstanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang
Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History by Huateng Huang A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor
More informationBINF6201/8201. Molecular phylogenetic methods
BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics
More informationPhylogenetics. BIOL 7711 Computational Bioscience
Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium
More informationPhylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University
Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X
More informationFine-Scale Phylogenetic Discordance across the House Mouse Genome
Fine-Scale Phylogenetic Discordance across the House Mouse Genome Michael A. White 1,Cécile Ané 2,3, Colin N. Dewey 4,5,6, Bret R. Larget 2,3, Bret A. Payseur 1 * 1 Laboratory of Genetics, University of
More informationC3020 Molecular Evolution. Exercises #3: Phylogenetics
C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from
More informationConcepts and Methods in Molecular Divergence Time Estimation
Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks
More informationPhylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline
Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying
More informationPhylogenetic Networks, Trees, and Clusters
Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University
More informationAssessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition
Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National
More informationMolecular Evolution & Phylogenetics
Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic
More informationNetworks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource
Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental
More information7. Tests for selection
Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info
More informationParameter Es*ma*on: Cracking Incomplete Data
Parameter Es*ma*on: Cracking Incomplete Data Khaled S. Refaat Collaborators: Arthur Choi and Adnan Darwiche Agenda Learning Graphical Models Complete vs. Incomplete Data Exploi*ng Data for Decomposi*on
More informationEfficient Bayesian Species Tree Inference under the Multispecies Coalescent
Syst. Biol. 66(5):823 842, 2017 The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com
More informationI. Short Answer Questions DO ALL QUESTIONS
EVOLUTION 313 FINAL EXAM Part 1 Saturday, 7 May 2005 page 1 I. Short Answer Questions DO ALL QUESTIONS SAQ #1. Please state and BRIEFLY explain the major objectives of this course in evolution. Recall
More informationConsensus Methods. * You are only responsible for the first two
Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is
More informationConstructing Evolutionary/Phylogenetic Trees
Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood
More informationMolecular phylogeny How to infer phylogenetic trees using molecular sequences
Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 2009 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues
More informationBiology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):
Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:
More information