SUPPLEMENTARY INFORMATION

Similar documents
LETTER. Epistasis as the primary factor in molecular evolution

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Quantifying sequence similarity

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Dr. Amira A. AL-Hosary

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Session 5: Phylogenomics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Evolution of sex-dependent mtdna transmission in freshwater mussels (Bivalvia: Unionida)

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Phylogenetic inference

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Introduction to Bioinformatics Online Course: IBT

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence Alignments

Introduction to Bioinformatics Introduction to Bioinformatics

SUPPLEMENTARY INFORMATION

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

Computational Biology

Multiple Sequence Alignment

Processes of Evolution

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Big Idea #1: The process of evolution drives the diversity and unity of life

Multiple sequence alignment

Nature Genetics: doi: /ng Supplementary Figure 1. The phenotypes of PI , BR121, and Harosoy under short-day conditions.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Tools and Algorithms in Bioinformatics

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Lecture 8 Multiple Alignment and Phylogeny

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

EVOLUTIONARY DISTANCES

C.DARWIN ( )

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

SUPPLEMENTARY INFORMATION

It all depends on barriers that prevent members of two species from producing viable, fertile hybrids.

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009


Decision Procedures An Algorithmic Point of View

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

MegAlign Pro Pairwise Alignment Tutorials

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

SUPPLEMENTARY INFORMATION

Semantic Integration of Biological Entities in Phylogeny Visualization: Ontology Approach

USE OF CLUSTERING TECHNIQUES FOR PROTEIN DOMAIN ANALYSIS

Supplementary Figure S1

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

SUPPLEMENTARY INFORMATION

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Phylogenetic Tree Reconstruction

Evolutionary Rate Covariation of Domain Families

Supplementary information

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

Consensus methods. Strict consensus methods

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Inferring Molecular Phylogeny

TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations

Genome-scale approaches to resolving incongruence in molecular phylogenies

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Intraspecific gene genealogies: trees grafting into networks

Genetic Engineering and Creative Design

Natural selection on the molecular level

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Effects of Gap Open and Gap Extension Penalties

Probabilistic modeling and molecular phylogeny

SUPPLEMENTARY INFORMATION

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Supplementary Figure 1 Histogram of the marginal probabilities of the ancestral sequence reconstruction without gaps (insertions and deletions).

Phylogenetic Networks, Trees, and Clusters

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Cladistics. The deterministic effects of alignment bias in phylogenetic inference. Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T.

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Open a Word document to record answers to any italicized questions. You will the final document to me at

Practical considerations of working with sequencing data

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Hands-On Nine The PAX6 Gene and Protein

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Supplementary Figure 1 Schematic overview of ASTNs in neuronal migration. (a) Schematic of roles played by ASTNs 1 and 2. ASTN-1-mediated adhesions

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

7. Tests for selection

Transcription:

SUPPLEMENTARY INFORMATION doi:10.1038/nature11510 Supplementary Table 1. Indel Index Removal Gene Number of Starting Sequences Number of Final Sequences Percentage of Sequences Removed based on the Indel Contribution Index Number Non- Redundant Sequences Number Unique Species ATP6 22545 20433 9.36% 5359 3021 ATP8 14810 13724 7.33% 1638 1244 COX1 24557 19403 21.05% 6327 4450 COX2 35381 24009 32.14% 6357 4204 COX3 16171 15300 5.38% 2822 2191 CYTB 54628 53066 2.86% 20976 7654 ND1 16256 14590 10.24% 2881 2056 ND2 38346 30853 19.54% 11547 5963 ND3 19799 18296 7.59% 3653 2852 ND4 16550 14926 9.81% 2979 2041 ND4L 16595 15355 7.47% 2102 1785 ND5 12250 10221 16.56% 1849 949 ND6 13011 12471 4.15% 1345 1015 eef1a1 3433 2653 22.72% 1880 1743 H3.2 11098 8171 26.30% 1315 1228 RuBisCo 22656 19441 14.19% 16322 13912 WWW.NATURE.COM/NATURE 1

RESEARCH SUPPLEMENTARY INFORMATION Supplementary Table 2. Consistency of alignment length and comparison of amino acid usage across three different multiple sequence alignment methods. Number of sites occupied by amino acids in more than half of species in a KM-Coffee alignment Protein sequence length in human and thale cress for RuBisCO Clustal Omega MAFFT KM-Coffee KM-Coffee vs Clustal Omega KM-Coffee vs Mafft Gene ATP6 227 226 9.260 9.246 9.386 99.05 98.75 99.1 ATP8 55 68 11.665 11.413 11.612 91.95 93.3 92.5 COX1 501 513 6.361 6.248 6.334 99.95 99.9 100.0 COX2 227 227 9.104 9.060 9.118 99.7 99.6 99.8 COX3 261 261 7.175 7.128 7.145 99.7 99.6 99.7 CYTB 379 380 10.697 10.647 10.683 99.65 99.6 99.7 ND1 323 318 8.407 8.391 8.413 99.1 98.85 98.9 ND2 346 347 10.694 10.616 10.659 99.1 99.1 99.4 ND3 116 115 10.221 10.126 10.238 97.8 98.3 98.4 ND4 460 459 8.935 8.905 8.995 96.7 96.45 97.1 ND4L 98 98 10.364 10.223 10.236 96.35 96.9 94.1 ND5 611 603 7.071 7.006 7.105 96.55 94.75 95.9 ND6 173 174 8.813 8.937 8.970 87.4 90.5 87.9 eef1a1 364 462 3.047 3.045 3.045 100 100 100.0 H3.2 109 136 3.583 3.581 3.594 99.7 99.7 100.0 RuBisCO 453 479 8.661 8.601 8.651 99.95 99.85 99.9 Clustal Omega vs Mafft 2 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH Supplementary Table 3. Fraction of species with data on the density of non-fixed states. Gene Number of species Number of species with 2 or more sequences Fraction of species with 2 or more sequences ATP6 3021 2069 0.68 ATP8 1244 865 0.70 COX1 4450 1809 0.41 COX2 4204 2044 0.49 COX3 2191 1582 0.72 CYTB 7954 4864 0.61 ND1 2056 1148 0.56 ND2 5963 2596 0.44 ND3 2852 1811 0.63 ND4 2041 1427 0.70 ND4L 1785 1710 0.96 ND5 949 678 0.71 ND6 1015 706 0.70 eef1a1 1743 207 0.12 H3.2 1228 727 0.59 RuBisCO 13912 2558 0.18 Supplementary Table 4. Estimating amino acid usage without the contribution of non-fixed states through the elimination of rare amino acid states. Gene Amino acid usage Expected dn/ds from (u-1)/19 ATP6 6.92 0.31 ATP8 9.07 0.42 COX1 3.85 0.15 COX2 6.00 0.26 COX3 6.21 0.27 CYTB 5.78 0.25 ND1 5.72 0.25 ND2 7.32 0.33 ND3 6.17 0.27 ND4 8.34 0.39 ND4L 7.06 0.32 ND5 5.70 0.25 ND6 8.03 0.37 eef1a1 2.25 0.07 H3.2 1.90 0.05 RuBisCO 2.74 0.09 WWW.NATURE.COM/NATURE 3

RESEARCH SUPPLEMENTARY INFORMATION Supplementary Table 5. Estimating amino acid usage without the contribution of rare non-fixed states. Gene Number of species with 3 or more sequences Number of species Fraction of species with 3 or more sequences Average amino acid usage excluding rare non-fixed states Average amino acid usage from 1000 replicates of single sequences ATP6 228 3021 0.075 5.0 5.2 ATP8 51 1244 0.041 5.2 5.3 COX1 259 4450 0.058 3.0 3.2 COX2 346 4204 0.082 5.0 5.4 COX3 55 2191 0.025 3.5 3.6 CYTB 1585 7954 0.199 5.5 6.5 ND1 73 2056 0.036 4.7 4.8 ND2 641 5963 0.107 7.0 7.3 ND3 123 2852 0.043 5.4 5.6 ND4 112 2041 0.055 4.8 4.9 ND4L 52 1785 0.029 4.8 4.9 ND5 53 949 0.056 3.6 3.6 ND6 31 1015 0.031 4.1 4.2 eef1a1 21 1743 0.012 1.1 1.2 H3.2 12 1228 0.001 0.9 1.0 RiBisCO 353 13912 0.025 2.2 2.5 Supplementary Table 6. Estimating average dn/ds in different genes Gene Number of nonoverlapping clusters Number of species Average observed dn/ds Standard deviation of the observed dn/ds ATP6 245 1300 0.056 0.048 ATP8 100 781 0.224 0.158 COX1 326 1123 0.015 0.022 COX2 330 1214 0.025 0.024 COX3 173 622 0.036 0.031 CYTB 798 3992 0.039 0.029 ND1 177 569 0.040 0.032 ND2 623 3210 0.067 0.033 ND3 242 989 0.069 0.047 ND4 135 510 0.045 0.027 ND4L 135 441 0.076 0.078 ND5 97 370 0.057 0.028 ND6 104 406 0.073 0.068 eef1a1 94 1343 0.020 0.014 H3.2 73 670 0.037 0.065 RiBisCO 151 13546 0.072 0.067 4 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH 0.08 0.07 0.06 Frequency 0.05 0.04 0.03 0.02 0.01 0 0 5 10 15 20 Amino acid usage Supplementary Figure 1. Frequency distribution of amino acid usage across sites in all genes in our dataset. 0.25 0.2 Frequency 0.15 0.1 0.05 0 0 5 10 15 20 The number of times an amino acid state observed at a site Supplementary Figure 2. Frequency distribution of the number of times an amino acid state is observed across sites in all genes in our dataset. WWW.NATURE.COM/NATURE 5

RESEARCH SUPPLEMENTARY INFORMATION t 0 t 1 t 2 time t 3 t 4 Supplementary Figure 3. A simulated phylogeny with regularly spaced speciation events. The total time on the phylogeny between different depths is indicated by t n ; for example, the total evolutionary time on the phylogeny since the last speciation event is t 0. In this example t 0 is twofold larger than t 1 and t n is twofold larger than t n+1. If the rate of amino acid substitution is constant along the phylogeny than the number of substitutions that happened within the timeframe of t 0 is also twofold higher than the number of substitutions that occurred within the timeframe of t 1. Therefore, given a multiple alignment of orthologues from the species represented on this tree the number of amino acid states found only once is expected to be twofold larger than the number of states that are found twice. Thus, given a realistic phylogeny of many species, without an overwhelming bias of shorter branches close to the terminal areas of the phylogeny, the frequency distribution of amino acid states is expected to be an exponentially declining function closely resembling the relationship reported in Supplementary Figure 2. 6 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH FSupplementary Figure 4A-B WWW.NATURE.COM/NATURE 7

RESEARCH SUPPLEMENTARY INFORMATION Supplementary Figure4C-D 8 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH Supplementary Figure 4E-F WWW.NATURE.COM/NATURE 9

RESEARCH SUPPLEMENTARY INFORMATION Supplementary Figure 4G-H 10 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH Supplementary Figure 4I-J WWW.NATURE.COM/NATURE 11

RESEARCH SUPPLEMENTARY INFORMATION Supplementary Fgure 4K-L 12 WWW.NATURE.COM/NATURE

SUPPLEMENTARY INFORMATION RESEARCH Supplementary Figure 4M-N WWW.NATURE.COM/NATURE 13

RESEARCH SUPPLEMENTARY INFORMATION H3.2 Supplementary Figure 4O-P Supplementary Figure 4. The relationship between amino acid usage, u, and the number of sequences included in the multiple alignment. From the multiple alignment we sampled a single sequence without replacement and placed it into a new alignment calculating u in the new alignment at every step until we ran out of sequences. The procedure was repeated 100 times. 14 WWW.NATURE.COM/NATURE