Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Similar documents
Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg

Molecular Evolution & Phylogenetics

A phylogenomic toolbox for assembling the tree of life

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

A (short) introduction to phylogenetics

a-dB. Code assigned:

Anatomy of a species tree

a-fB. Code assigned:

Constructing Evolutionary/Phylogenetic Trees

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

A Fitness Distance Correlation Measure for Evolutionary Trees

Dr. Amira A. AL-Hosary

Session 5: Phylogenomics

SUPPLEMENTARY INFORMATION

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Large-scale phylogenetic analysis on current HPC architectures

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetics: Building Phylogenetic Trees

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Constructing Evolutionary/Phylogenetic Trees

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetic Tree Reconstruction

Effects of Gap Open and Gap Extension Penalties

Consistency Index (CI)

Phylogenetics in the Age of Genomics: Prospects and Challenges

Consensus Methods. * You are only responsible for the first two

VMD: Visual Molecular Dynamics. Our Microscope is Made of...

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Impact of recurrent gene duplication on adaptation of plant genomes

M ultiple sequence alignment (MSA) has long been a standard stage in phylogenetic workflows1,2. In this

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

The Exelixis Lab, Dept. of Computer Science, Technische Universität München, Germany. Organismic Botany, Eberhard-Karls-University, Tübingen, Germany

An Evaluation of Different Partitioning Strategies for Bayesian Estimation of Species Divergence Times

Phylogenetic analyses. Kirsi Kostamo

林仲彥. Dec 4,

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Phylogenetic Inference using RevBayes

SUPPLEMENTARY INFORMATION

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Estimating maximum likelihood phylogenies with PhyML

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Consensus methods. Strict consensus methods

An Investigation of Phylogenetic Likelihood Methods

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Isolating - A New Resampling Method for Gene Order Data

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Computational Biology Course Descriptions 12-14

Unsupervised Learning in Spectral Genome Analysis

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Phylogenetic inference

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Taming the Beast Workshop

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny

molecular evolution and phylogenetics

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Life in an unusual intracellular niche a bacterial symbiont infecting the nucleus of amoebae

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13

Molecular evidence for multiple origins of Insectivora and for a new order of endemic African insectivore mammals

Systematics - Bio 615

Bayesian Models for Phylogenetic Trees

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

TheDisk-Covering MethodforTree Reconstruction

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

SEQUENCING NUCLEAR MARKERS IN FRESHWATER GREEN ALGAE: CHARA SUBSECTION WILLDENOWIA

Parsimony via Consensus

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

C.DARWIN ( )

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Introduction to Bioinformatics

Fast Hash-Based Algorithms for Analyzing Tens of Thousands of Evolutionary Trees

7. Tests for selection

Chapter 26 Phylogeny and the Tree of Life

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Transcription:

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany 98(3): 404-414. Appendix S1. The consensus tree of the maximum likelihood phylogenies for 55,473 species of seed plants and the location of significant shifts in diversification rate across the tree. The data matrix was constructed using the megaphylogeny method and includes DNA sequences for six gene regions: atpb, matk, trnk, trnl, rbcl, and nuclear ribosomal ITS. Black branches denote inferred shifts in diversification rate based on i statistic.

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S2 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany 98(3): 404-414. Appendix S2. Phylogenetic construction. We constructed the phylogenetic dataset for 55,473 seed plants using PHLAWD and the methodology described in Smith et al., (2009). The dataset included flowering plants and Acrogymnospermae (for rooting) using available data from GenBank for the gene regions atpb, matk, trnk, trnl, rbcl, and ITS. These gene regions were concatenated creating a dataset with 9853 sites. Phylogenetic inference has been facing two major challenges over the last couple of years (i) the immense pace of accumulation of molecular data and (ii) the multi-core revolution, that is, the introduction of parallel computing at the laptop and desktop level for building ever more powerful computers. In general, work on phylogeny reconstruction methods over the past years can be regarded as a continuous struggle to maintain program scalability with input dataset growth, both with respect to increasing numbers of taxa and genes. In fact, the memory consumption of phylogenomic analyses, especially for memory-intensive likelihood-based (Maximum Likelihood and Bayesian) programs, is becoming an increasing concern (RAxML memory requirements of up to 190GB have been reported; Ziheng Yang, pers. comm., May 2010), in particular with the announcement of largescale genome sequencing projects such as the 10K Vertebrate project (http://www.genome10k.org/). With the increasing number of cores available on current multi-core and supercomputing systems, the fine-grained parallelization of the likelihood function is becoming more important for accelerating programs, handling their memory requirements, and exploiting modern computer and accelerator architectures such as Graphics Processing Units (GPUs) (Ott et al., 2008; Stamatakis and Ott, 2008; Pratas et al., 2009; Suchard and Rambaut, 2009). With respect to heuristic tree search approaches, there has been a convergence of the strategies deployed (Guindon et al., 2010) in programs such as GARLI

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S2 page 2 (Zwickl, 2006), PHYML (Guindon et al., 2010), or RaxML (Stamatakis, 2006b) (and future versions of MrBayes (Ronquist and Huelsenbeck, 2003); John Huelsenbeck, pers. comm., May 2010) that are all based on some flavor of so-called lazy SPR moves (Stamatakis et al., 2005). Algorithmic progress has also been achieved with fast approaches for computing support values such as the RAxML rapid bootstrap algorithm (Stamatakis et al., 2008) or the alrt (Anisimova and Gascul, 2006) test (implemented in PHYML 3.0 [Guindon et al., 2010], FastTree 2.0 [Price et al., 2010], and RAxML v727). Finally, the community is also facing two technical challenges (i) the increasing code complexity of phylogeny reconstruction programs that offer a plethora of substitution models and can analyze binary, DNA, morphological, secondary structure, protein, and Codon data. This increase in code complexity poses several difficult software engineering challenges. (ii) The absence of checkpoint mechanisms in most programs, that is, the ability to stop and restart long-running analyses may hinder such analyses in the future. The personal view of the authors is that the most urgent issue that needs to be addressed is memory consumption though (see Stamatakis and Alachiotis, 2010 for a novel solution to reduce memory consumption for "gappy" phylogenomic datasets). Phylogenetic trees were inferred using the Pthreads-based and SSE3-vectorized RAxML (Stamatakis, 2006b) version 7.2.6. The post-analysis steps (consensus tree building, evaluating the final trees under the GTR+GAMMA model etc.) were carried out with RAxML v7.2.7. We used the standard RAxML search algorithm with the asymptotic stopping rule and the low memory consumption flag (-F and -D options) to infer 223 ML trees on the original alignment under the GTR+CAT approximation of rate heterogeneity (Stamatakis, 2006a) and a partitioned model (we estimated the GTR and alpha parameters separately for each gene) with a joint branch length estimate. The usage of the GAMMA model of rate heterogeneity was not possible on all multi-core systems we used for the analysis because of memory limitations (a run under GTR+CAT required approximately 30GB of main

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S2 page 3 memory, a run under GAMMA requires approximately four times more memory). Branch lengths and likelihood scores under GTR+GAMMA for all 223 ML trees were computed using the -f n option. We also inferred 244 bootstrap trees using the RAxML rapid bootstrap algorithm (Stamatakis et al., 2008). We then plotted BS support values onto the best-scoring ML tree and also computed strict, majority-rule, and extended majority rule consensus trees for the bootstrap replicates and the ML trees on the original alignment. We also applied the bootstopping (bootstrap convergence) tests (Pattengale et al., 2010) a posteriori to the bootstrap trees. The test indicated that an insufficient number of BS replicates has been computed to guarantee stable support values. Finally we computed pair-wise Robinson Foulds (RF) distances between all ML trees (average relative RF: 21.79%) and all bootstrap trees (average relative RF: 53.32\%). For our analyses we used two 32-core Sun x4600 multi-core systems with 64GB of main memory and several AMD Barcelona 16-core systems with 64GB and 128GB of main memory at the Laboratory for Computational Biology and Bioinformatics at the Swiss Federal Institute of Technology and the Texas Advanced Computing Center (TACC). We executed one multi-threaded RAxML run per node at a time (note that, a single ML search required approximately 6 days on 16 cores). In order to make use of the resources at TACC which is a typical supercomputer system with job run-time restrictions, we deployed appropriate scripts and the newly integrated restart capability of RAxML that makes use of the dmtcp library (http://dmtcp.sourceforge.net/). To compute the likelihood scores and estimate the branch lengths of the final ML trees under GTR+GAMMA we completely re-designed the numerical scaling procedure in RAxML that is used to prevent numerical underflow in the likelihood function. For trees of this size it turned out that, the previous scaling procedure implemented in RAxML was not sufficient for obtaining numerically stable results. Since the adapted scaling procedure is more

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S2 page 4 compute-intensive (more arithmetic operations are required for scaling) the RAxML version that contains this scaling procedure for large-scale datasets can be compiled using a separate Makefile and has been made available in RAxML v7.2.7. Finally, the consensus tree building step was also conducted with RAxML v7.2.7 that uses a faster tree parsing mechanism as well as some algorithmic optimizations for building extended majority rule consensus trees (Aberer et al., 2010). Supplementary References Aberer, A. J., N. D. Pattengale, and A. Stamatakis. 2010. Parallel computation of phylogenetic consensus trees. Procedia Computer Science 1:1059-1067. Anisimova, M. and O. Gascuel. 2006. Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative. Systematic Biology 55: 539-552. Guindon, S., J. F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, O. Gascuel. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic Biology 59: 307. Ott, M., J. Zola, S. Aluru, A.D. Johnson, D. Janies and A. Stamatakis. 2008. Large-scale Phylogenetic Analysis on Current HPC Architectures. Scientific Programming 16: 255-270. Pattengale, N. D., M. Alipour, O. R. P. Bininda-Emonds, B. M. E. Moret, and A. Stamatakis. 2008. How many bootstrap replicates are necessary? Journal of Computational Biology 17: 337-354. Pratas, F., P. Trancoso, A. Stamatakis, and L. Sousa. 2009. Fine-grain Parallelism for the Phylogenetic Likelihood Functions on Multi-cores, Cell/BE, and GPUs. Proceedings of ICPP 2009. Price, M. N., P. S. Dehal and A. P. Arkin. 2010. FastTree 2--Approximately Maximum-Likelihood Trees for Large Alignments. PloS One. Ronquist, F. and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.

Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S2 page 5 Stamatakis, A. 2006a. Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective. Proceedings of IPDPS 2006. HICOMB Workshop, Proceedings on CD, Rhodos, Greece. Stamatakis, A. 2006b. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688-2690. Stamatakis, A. and N. Alachiotis. 2010. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data. Bioinformatics 26: 1132. Stamatakis, A. and M. Ott. 2008. Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study. PRIB eds. Madhu Chetty and Alioune Ngom and Shandar Ahmad. Pgs. 424-435. Lecture Notes in Computer Science. Stamatakis, A., T. Ludwig and H. Meier. 2005. RAxML-III: A Fast Program for Maximum Likelihood-based Inference of Large Phylogenetic Trees. Bioinformatics 21:456-463. Stamatakis, A., P. Hoover and J. Rougemont. 2008. A Rapid Bootstrap Algorithm for the RAxML Web Servers. Systematic Biology 57:758-771. Suchard, M. A. and A. Rambaut. 2009. Many-core algorithms for statistical phylogenetics. Bioinformatics 25: 1370. Zwickl, D. 2006. Genetic Algorithm Approaches for the Phylogenetic Analysis of Large Biological Sequence Datasets under the Maximum Likelihood Criterion. Ph. D. Thesis University of Texas at Austin.