Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences

Size: px
Start display at page:

Download "Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences"

Transcription

1 Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences Xiaoyu Chen Martin Tompa Department of Computer Science and Engineering Department of Genome Sciences Box University of Washington Seattle, WA U.S.A. Nature Biotechnology: doi:.38/nbt.637

2 S Comparing StatSigMA-w with Previous Accuracy Assessments Margulies et al. [2] estimated the specificity of the ENCODE alignments using the following two measures: Alu exclusion, defined as the fraction of human Alu residues that are not aligned in other species, and coding sequence periodicity. By both of these measures, they found TBA and Pecan to be the most accurate alignments, with TBA slightly more accurate than Pecan. They performed these analyses only on the mammals, providing no information about the accuracies of nonmammal alignments. Because Alu sequences occur mostly in intronic and intergenic regions [], their analysis should correspond to our accuracy analysis in the intronic and intergenic categories. For the placental mammals, we do indeed find that TBA and Pecan are comparable and have the lowest suspicious% values. However, our accuracy results disagree sharply with theirs on the nonplacental species included in both analyses, namely monodelphis and platypus, in each of the four location categories. As Figure 3 in the main paper illustrates, suspicious% increases in these species in the order Pecan, MAVID, TBA, MLAGAN. In the intronic and intergenic categories, TBA s suspicious% is 5 times that of Pecan in platypus, and -2 times that of Pecan in monodelphis. In contrast, in terms of Alu exclusion for these two species, Margulies et al. [2] showed that TBA is best, with Pecan and MAVID close behind. Unlike our analysis, in their analysis monodelphis and platypus do not exhibit significantly different patterns of alignment accuracy from the placental mammals cow, dog, armadillo, elephant, tenrec, shrew, bat, and rabbit. Differences between the accuracy measures may have led to these differences in conclusions. First, Alu sequences represent only a narrow portion of the spectrum of genomic sequences. Even if no nonhuman sequences are aligned to human Alu bases, it remains unknown whether those sites that are aligned are done so correctly. Second, Alu exclusion is calculated with reference to the total number of human Alu residues, rather than the number of human residues aligned to a given species. As such, two species that have the same Alu exclusion may achieve very different alignment specificity. For example, in the TBA alignments, coverage of human intergenic/intronic residues is 2 Mbp for armadillo and 6 Mbp for monodelphis. These two species have a very similar level of Alu exclusion (96%) for TBA. These facts together suggest that the percentage of misaligned residues is twice as high in monodelphis as in armadillo. S2 Correlation of Discordance and Alignment Agreement This section describes a clear correlation between level of alignment agreement and suspicious regions. Namely, when compared to low discordance regions, suspicious regions are highly depleted in alignment-agreeing coordinates and highly enriched in alignment-unique coordinates. This correlation gives additional supporting evidence from the other alignments 2 Nature Biotechnology: doi:.38/nbt.637

3 % 9% 8% 7% agree% 6% 5% 4% 3% 2% % % chimp baboon macaque marmoset galago bat armadillo dog elephant cow rabbit mouse rat shrew susp_tba susp_mavid susp_mlagan good_tba good_mavid good_mlagan tenrec monodelphis platypus chicken xenopus tetraodon fugu zebrafish Figure S: Agreement percentage in suspicious and low discordance regions for each species and three alignments. In the legend, susp denotes suspicious regions and good denotes low discordance regions. Note how much greater agree% is in low discordance regions than in suspicious regions, for each alignment and each species. that suspicious regions may be misaligned. If one alignment A (for example, TBA) misaligns a human coordinate h to a coordinate s from a given species S (for example, mouse), the other alignments are unlikely to align h to the same coordinate s, particularly if s is part of a longer region of S that is misaligned in A. Recall from the section on level of alignment agreement that this human coordinate h is said to agree (for target alignment A and target species S) if and only if there is at least one other alignment that aligns h to the same coordinate s of S. Therefore, given suspicious regions of the alignment A where species S is the worst aligned species, we expect the comparison percentage agree% for A and S to be very low in these regions. Conversely, suppose we consider a region of alignment A that contains human and species S, and for which StatSigMA-w s reported discordance score is less than at all sites in the region. We call such regions low discordance for alignment A and species S. In such a low discordance region we expect the comparison percentage agree% for A and S to be high. Figure S plots the comparison percentage agree% in both suspicious regions and low discordance regions for all 22 nonhuman species and the three alignments for which we have computed comparison percentages. The correlation of agree% with region type is as expected, and demonstrates significant difference between the two types of regions, for any given species: agree% varies between.5% and 3% for suspicious regions, whereas it varies between 56% and 99% for low discordance regions. To make this difference even clearer, the ratio of agree% in low discordance regions to agree% in suspicious regions exceeds 5.8, for each species and each alignment. That is, suspicious regions are highly depleted in 3 Nature Biotechnology: doi:.38/nbt.637

4 % 9% 8% 7% unique% 6% 5% 4% 3% 2% % % chimp baboon macaque marmoset galago bat armadillo dog elephant cow rabbit mouse rat shrew susp_tba susp_mavid susp_mlagan good_tba good_mavid good_mlagan tenrec monodelphis platypus chicken xenopus tetraodon fugu zebrafish Figure S2: Unique percentage in suspicious and low discordance regions for each species and three alignments. In the legend, susp denotes suspicious regions and good denotes low discordance regions. Note how much greater unique% is in suspicious regions than in low discordance regions, for each alignment and each species. alignment-agreeing coordinates compared to low discordance regions. In suspicious regions of alignment A we would expect unique% for A to be high, and this is nearly always the case for all alignments and all species. Figure S2 plots the comparison percentage unique% in both suspicious regions and low discordance regions for all 22 nonhuman species and the three alignments for which we have computed comparison percentages. The correlation of unique% with region type is again as expected, and demonstrates significant difference between the two types of regions: unique% varies between 39% and 9% for suspicious regions, whereas it varies between.2% and 2% for low discordance regions. To make this difference even clearer, the ratio of unique% in suspicious regions to unique% in low discordance regions exceeds 3, for each species and each alignment. That is, suspicious regions are highly enriched in alignment-unique coordinates compared to low discordance regions. The general trends in all the curves of Figures S and S2 is that agree% decreases and unique% increases as species distance to human increases. This trend is in agreement with the more general trend observed for all the noncoding location categories in Figure 2 of the main paper. This is not coincidental: since agree% is so low and unique% so high in noncoding regions of the distant species, there will tend to be fewer agreeing coordinates and more unique coordinates in nearly any subset of coordinates. What this says, though, is that the correlations shown in Figures S and S2 for the suspicious regions of placental mammals are all the more striking. Whereas agree% > 43% for all alignments, all placental mammals, and all location categories generally, agree% < 3% in suspicious regions for all alignments and 4 Nature Biotechnology: doi:.38/nbt.637

5 all placental mammals. Conversely, whereas unique% < 25% for all alignments, all placental mammals, and all location categories generally, unique% > 38% in suspicious regions for all alignments and all placental mammals. S3 Improving Suspicious Alignments Figure S3 shows scatter plots for three representative species (baboon, mouse, and zebrafish) and each of the four alignments as target alignment. Notice that, for any given species, the distribution of points is very similar for all four alignments. This suggests that StatSigMA-w is not biased toward any particular alignment method. The plots labeled Baboon (+/ ) show baboon alignments using the same pairwise alignment scoring function used for mouse and zebrafish. The plots labeled Baboon (+/ 2) show baboon alignments with mismatch score 2, which better reflects the smaller divergence between human and baboon. The change in scoring function does not have much effect on the shapes of the scatter plots. S4 Length Distribution of Gaps Figure S4 shows the length distribution of gaps for the four alignments in ENm3, a representative ENCODE region. Although the figure only shows the distributions for two selected species, the same trends hold for all species. Note the small fraction of gaps exceeding 5 bp in TBA compared to the other three alignments. S5 Assessing Whole-Genome Multiple Sequence Alignments In the future, we plan to apply our analyses to whole-genome multiple sequence alignments, particularly when comparable whole-genome alignments (that is, using the same species and assemblies) are available to assess and compare. These analyses will guide alignment users in their choice of alignment and will warn them about regions that may be misaligned. The only difficulty we envision in extending our analyses to whole-genome alignments is the amount of computation involved: we estimate that performing these analyses on a wholegenome alignment such as the MULTIZ vertebrate alignment currently available from the UCSC Genome Browser would require a few weeks on a few hundred processors. We expect the coverage and accuracy results presented here for % of the whole-genome alignment to be representative of what we will see when applied to the whole genome. It is possible that accuracy will be slightly worse in whole-genome alignments, because the challenge of identifying orthologous regions to align to whole human chromosomes is so much greater than it was in the ENCODE pilot project, where the aligners were given as input the orthologous sequences for each individual human ENCODE region. This predicted decrease in accuracy 5 Nature Biotechnology: doi:.38/nbt.637

6 Baboon (+/ ) Baboon (+/ ) Baboon (+/ ) Baboon (+/ ) Alternative alignment score Alternative alignment score.5.5 Baboon (+/ 2).5.5 Baboon (+/ 2).5.5 Baboon (+/ 2).5.5 Baboon (+/ 2) Mouse Mouse Mouse Mouse Alternative alignment score Zebrafish Zebrafish Zebrafish Zebrafish Alternative alignment score.5 MAVID alignment score.5 MLAGAN alignment score.5 TBA alignment score.5 PECAN alignment score Figure S3: Alignment scores of suspicious regions versus scores for alternative alignments of the same human region. For three representative species S (baboon, mouse, and zebrafish) and four representative target alignments, scatter plots show all points (x, y ), where x is the pairwise human-s alignment score of the target alignment region that is suspicious for species S, and y is the pairwise human-s alignment score of one of the other three alignments for the same human region that is not suspicious for S. Alignment scores are normalized by alignment length. The dashed black diagonal 6 line has equation y = x. The solid blue line has equation y x = µ, where µ is the mean value of y x for all points (x, y ) in the plot. The dotted blue lines have equations y x = µ ± σ, where σ is the standard deviation of Nature Biotechnology: y x for all doi:.38/nbt.637 points (x, y ) in the plot.

7 is consistent with what we have seen when comparing the TBA ENCODE alignment to the 7-vertebrate MULTIZ alignment of human chromosome analyzed by Prakash and Tompa [3], where the suspicious% figures of the former are about.7 times those of the latter, averaged over the nonprimate species common to both alignments. References [] C. Chen, A. J. Gentles, J. Jurka, and S. Karlin. Genes, pseudogenes, and ALU sequence organization across human chromosomes 2 and 22. Proceedings of the National Academy of Science USA, 99(5): , Mar. 22. [2] E. H. Margulies, G. M. Cooper, G. Asimenos, D. J. Thomas, C. N. Dewey, A. Siepel, E. Birney, D. Keefe, A. S. Schwartz, M. Hou, J. Taylor, S. Nikolaev, J. I. Montoya-Burgos, A. Lvytynoja, S. Whelan, F. Pardi, T. Massingham, J. B. Brown, P. Bickel, I. Holmes, J. C. Mullikin, A. Ureta-Vidal, B. Paten, E. A. Stone, K. R. Rosenbloom, W. J. Kent, G. G. Bouffard, X. Guan, N. F. Hansen, J. R. Idol, V. V. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C. Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley, H. Jiang, G. M. Weinstock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K. Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. Lindblad-Toh, E. S. Lander, A. Hinrichs, H. Trumbower, H. Clawson, A. Zweig, R. M. Kuhn, G. Barber, R. Harte, D. Karolchik, M. A. Field, R. A. Moore, C. A. Matthewson, J. E. Schein, M. A. Marra, S. E. Antonarakis, S. Batzoglou, N. Goldman, R. Hardison, D. Haussler, W. Miller, L. Pachter, E. D. Green, and A. Sidow. Analyses of deep mammalian sequence alignments and constraint predictions for % of the human genome. Genome Research, 7(6):76 774, June 27. [3] A. Prakash and M. Tompa. Measuring the accuracy of genome-size multiple alignments. Genome Biology, 8(6):R24, Nature Biotechnology: doi:.38/nbt.637

8 .35 Mouse: short gaps (up to 5 bp).8 Mouse: long gaps (> 5 bp) TBA MAVID MLAGAN PECAN.6.4 TBA MAVID MLAGAN PECAN Dog: short gaps (up to 5 bp) Gap size Dog: long gaps (> 5 bp) > Figure S4: Length distribution of gaps in the alignments of ENCODE region ENm3. The distributions are shown for four alignments and two representative species, mouse and dog. The left panel shows the distributions for gaps of length -5 bp, and the right panel shows the distributions for gaps of length exceeding 5 bp. Note the small fraction of gaps exceeding 5 bp in TBA compared to the other three alignments. 8 Nature Biotechnology: doi:.38/nbt.637

Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome

Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome Article Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome Elliott H. Margulies, 2,7,8,21 Gregory M. Cooper, 2,3,9 George Asimenos, 2,10 Daryl J. Thomas,

More information

Multiple Alignment of Genomic Sequences

Multiple Alignment of Genomic Sequences Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part

More information

Conservation of Human Microsatellites across 450 Million Years of Evolution

Conservation of Human Microsatellites across 450 Million Years of Evolution Conservation of Human Microsatellites across 450 Million Years of Evolution Emmanuel Buschiazzo*,1,2 and Neil J. Gemmell 1,3 1 School of Biological Sciences, University of Canterbury, Christchurch, New

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Opinion Multi-species sequence comparison: the next frontier in genome annotation Inna Dubchak* and Kelly Frazer

Opinion Multi-species sequence comparison: the next frontier in genome annotation Inna Dubchak* and Kelly Frazer Opinion Multi-species sequence comparison: the next frontier in genome annotation Inna Dubchak* and Kelly Frazer Addresses: *Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720,

More information

Orthologous loci for phylogenomics from raw NGS data

Orthologous loci for phylogenomics from raw NGS data Orthologous loci for phylogenomics from raw NS data Rachel Schwartz The Biodesign Institute Arizona State University Rachel.Schwartz@asu.edu May 2, 205 Big data for phylogenetics Phylogenomics requires

More information

Phylogenomic Resources at the UCSC Genome Browser

Phylogenomic Resources at the UCSC Genome Browser 9 Phylogenomic Resources at the UCSC Genome Browser Kate Rosenbloom, James Taylor, Stephen Schaeffer, Jim Kent, David Haussler, and Webb Miller Summary The UC Santa Cruz Genome Browser provides a number

More information

Distribution and intensity of constraint in mammalian genomic sequence

Distribution and intensity of constraint in mammalian genomic sequence Article Distribution and intensity of constraint in mammalian genomic sequence Gregory M. Cooper, 1 Eric A. Stone, 2,3 George Asimenos, 4 NISC Comparative Sequencing Program, 5 Eric D. Green, 5 Serafim

More information

One of most striking discoveries to arise from comparative

One of most striking discoveries to arise from comparative A large family of ancient repeat elements in the human genome is under strong selection Michael Kamal*, Xiaohui Xie*, and Eric S. Lander* *Broad Institute of Massachusetts Institute of Technology and Harvard

More information

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Evolution at the nucleotide level: the problem of multiple whole-genome alignment Human Molecular Genetics, 2006, Vol. 15, Review Issue 1 doi:10.1093/hmg/ddl056 R51 R56 Evolution at the nucleotide level: the problem of multiple whole-genome alignment Colin N. Dewey 1, * and Lior Pachter

More information

Reconstruction of Human Genome Evolution Predicts an Extensively Changed Neurodevelopmental Gene

Reconstruction of Human Genome Evolution Predicts an Extensively Changed Neurodevelopmental Gene Reconstruction of Human Genome Evolution Predicts an Extensively Changed Neurodevelopmental Gene David Haussler Howard Hughes Medical Institute Center for Biomolecular Science and Engineering University

More information

28-Way vertebrate alignment and conservation track in the UCSC Genome Browser

28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Resource 28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Webb Miller, 1,11 Kate Rosenbloom, 2 Ross C. Hardison, 1 Minmei Hou, 1 James Taylor, 3 Brian Raney, 2 Richard Burhans,

More information

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition Chapter for Human Genetics - Principles and Approaches - 4 th Edition Editors: Friedrich Vogel, Arno Motulsky, Stylianos Antonarakis, and Michael Speicher Comparative Genomics Ross C. Hardison Affiliations:

More information

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors Phylo-VISTA: Interactive Visualization of Multiple DNA Sequence Alignments Nameeta Shah 1,*, Olivier Couronne 2,*, Len A. Pennacchio 2, Michael Brudno 3, Serafim Batzoglou 3, E. Wes Bethel 2, Edward M.

More information

Conserved noncoding elements (CNEs) represent 3.5% of

Conserved noncoding elements (CNEs) represent 3.5% of A family of conserved noncoding elements derived from an ancient transposable element Xiaohui Xie*, Michael Kamal*, and Eric S. Lander* *Broad Institute of Massachusetts Institute of Technology and Harvard

More information

Reconstructing the History of Large-scale Genomic Changes. Jian Ma

Reconstructing the History of Large-scale Genomic Changes. Jian Ma Reconstructing the History of Large-scale Genomic Changes Jian Ma The Human Genome: the blueprint of our body Initial sequencing and analysis of the human genome International Human Genome Sequencing Consortium*

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven) BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged

More information

Supplementary Material

Supplementary Material Supplementary Material 1 Sequence Data and Multiple Alignments The five vertebrate, four insect, two worm, and seven yeast genomes used in the analysis are summarized in Table S1, and the four genome-wide

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Automated whole-genome multiple alignment of rat, mouse, and human Permalink https://escholarship.org/uc/item/1z58c37n

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Reconstructing contiguous regions of an ancestral genome

Reconstructing contiguous regions of an ancestral genome Reconstructing contiguous regions of an ancestral genome Jian Ma, Louxin Zhang, Bernard B. Suh, Brian J. Raney, Richard C. Burhans, W. James Kent, Mathieu Blanchette, David Haussler and Webb Miller Genome

More information

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre PhD defense Chromosomal rearrangements in mammalian genomes : characterising the breakpoints Claire Lemaitre Laboratoire de Biométrie et Biologie Évolutive Université Claude Bernard Lyon 1 6 novembre 2008

More information

Biased amino acid composition in warm-blooded animals

Biased amino acid composition in warm-blooded animals Biased amino acid composition in warm-blooded animals Guang-Zhong Wang and Martin J. Lercher Bioinformatics group, Heinrich-Heine-University, Düsseldorf, Germany Among eubacteria and archeabacteria, amino

More information

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6.

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6. NIH Public Access Author Manuscript Published in final edited form as: Pac Symp Biocomput. 2009 ; : 162 173. SIMULTANEOUS HISTORY RECONSTRUCTION FOR COMPLEX GENE CLUSTERS IN MULTIPLE SPECIES * Yu Zhang,

More information

Adaptive Evolution of Conserved Noncoding Elements in Mammals

Adaptive Evolution of Conserved Noncoding Elements in Mammals Adaptive Evolution of Conserved Noncoding Elements in Mammals Su Yeon Kim 1*, Jonathan K. Pritchard 2* 1 Department of Statistics, The University of Chicago, Chicago, Illinois, United States of America,

More information

Computational Identification of Evolutionarily Conserved Exons

Computational Identification of Evolutionarily Conserved Exons Computational Identification of Evolutionarily Conserved Exons Adam Siepel Center for Biomolecular Science and Engr. University of California Santa Cruz, CA 95064, USA acs@soe.ucsc.edu David Haussler Howard

More information

Handling Rearrangements in DNA Sequence Alignment

Handling Rearrangements in DNA Sequence Alignment Handling Rearrangements in DNA Sequence Alignment Maneesh Bhand 12/5/10 1 Introduction Sequence alignment is one of the core problems of bioinformatics, with a broad range of applications such as genome

More information

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

Multiple Whole Genome Alignment

Multiple Whole Genome Alignment Multiple Whole Genome Alignment BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 206 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

Alignment Strategies for Large Scale Genome Alignments

Alignment Strategies for Large Scale Genome Alignments Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty

More information

Evolution and functional classification of vertebrate gene deserts

Evolution and functional classification of vertebrate gene deserts Chicken Special/Letter Evolution and functional classification of vertebrate gene deserts Ivan Ovcharenko, 1,7 Gabriela G. Loots, 2 Marcelo A. Nobrega, 3 Ross C. Hardison, 4 Webb Miller, 5,6 and Lisa Stubbs

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Reconstructing large regions of an ancestral mammalian genome in silico

Reconstructing large regions of an ancestral mammalian genome in silico Letter Reconstructing large regions of an ancestral mammalian genome in silico Mathieu Blanchette, 1,4,5 Eric D. Green, 2 Webb Miller, 3 and David Haussler 1,5 1 Howard Hughes Medical Institute, University

More information

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada Multiple Whole Genome Alignments Without a Reference Organism Inna Dubchak 1,2, Alexander Poliakov 1, Andrey Kislyuk 3, Michael Brudno 4* 1 Genome Sciences Division, Lawrence Berkeley National Laboratory,

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

Computation and Analysis of Genomic Multi-Sequence Alignments

Computation and Analysis of Genomic Multi-Sequence Alignments Annu. Rev. Genomics Hum. Genet. 2007. 8:193 213 First published online as a Review in Advance on May 9, 2007. The Annual Review of Genomics and Human Genetics is online at genom.annualreviews.org This

More information

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications, Cvicek et al. Supporting Text 1 Here we compare the GRoSS alignment

More information

Supplemental Figure 1.

Supplemental Figure 1. Supplemental Material: Annu. Rev. Genet. 2015. 49:213 42 doi: 10.1146/annurev-genet-120213-092023 A Uniform System for the Annotation of Vertebrate microrna Genes and the Evolution of the Human micrornaome

More information

Multiple Genome Alignment by Clustering Pairwise Matches

Multiple Genome Alignment by Clustering Pairwise Matches Multiple Genome Alignment by Clustering Pairwise Matches Jeong-Hyeon Choi 1,3, Kwangmin Choi 1, Hwan-Gue Cho 3, and Sun Kim 1,2 1 School of Informatics, Indiana University, IN 47408, USA, {jeochoi,kwchoi,sunkim}@bio.informatics.indiana.edu

More information

Synteny Portal Documentation

Synteny Portal Documentation Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,

More information

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes.

More information

Vertebrate genome sequencing: building a backbone for comparative genomics

Vertebrate genome sequencing: building a backbone for comparative genomics 104 Forum Web Watch Vertebrate genome sequencing: building a backbone for comparative genomics James W. Thomas and Jeffrey W. Touchman The human genome sequence provides a reference point from which we

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/312/5780/1653/dc1 Supporting Online Material for The Xist RNA Gene Evolved in Eutherians by Pseudogenization of a Protein-Coding Gene Laurent Duret,* Corinne Chureau,

More information

Complex evolutionary history of the vertebrate sweet/umami taste receptor genes

Complex evolutionary history of the vertebrate sweet/umami taste receptor genes Article SPECIAL ISSUE Adaptive Evolution and Conservation Ecology of Wild Animals doi: 10.1007/s11434-013-5811-5 Complex evolutionary history of the vertebrate sweet/umami taste receptor genes FENG Ping

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Annotation and Nomenclature: A Zebrafish Example. Ingo Braasch, Julian Catchen and John Postlethwait

Annotation and Nomenclature: A Zebrafish Example. Ingo Braasch, Julian Catchen and John Postlethwait Annotation and Nomenclature: A Zebrafish Example Ingo Braasch, Julian Catchen and John Postlethwait Annotation and Nomenclature: An Example: Zebrafish The goal Solutions Annotation and Nomenclature: An

More information

Homolog. Orthologue. Comparative Genomics. Paralog. What is Comparative Genomics. What is Comparative Genomics

Homolog. Orthologue. Comparative Genomics. Paralog. What is Comparative Genomics. What is Comparative Genomics Orthologue Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs

More information

8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage

8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage Chris M. Rands 1, Stephen Meader 1, Chris P. Ponting 1 *, Gerton Lunter 2

More information

Comparative Genomics. Primer. Ross C. Hardison

Comparative Genomics. Primer. Ross C. Hardison Primer Comparative Genomics Ross C. Hardison A complete genome sequence of an organism can be considered to be the ultimate genetic map, in the sense that the heritable characteristics are encoded within

More information

Article. Reference. Early history of mammals is elucidated with the ENCODE multiple species sequencing data. NIKOLAEV, Sergey, et al.

Article. Reference. Early history of mammals is elucidated with the ENCODE multiple species sequencing data. NIKOLAEV, Sergey, et al. Article Early history of mammals is elucidated with the ENCODE multiple species sequencing data NIKOLAEV, Sergey, et al. Abstract Understanding the early evolution of placental mammals is one of the most

More information

Finding Anchors for Genomic Sequence Comparison ABSTRACT

Finding Anchors for Genomic Sequence Comparison ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 6, 2005 Mary Ann Liebert, Inc. Pp. 762 776 Finding Anchors for Genomic Sequence Comparison ROSS A. LIPPERT, 1,4 XIAOYUE ZHAO, 2 LILIANA FLOREA, 1,3 CLARK

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs

High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs Abdulrhman Aljouie, Ling Zhong, and Usman Roshan Department of Computer Science,

More information

1 ATGGGTCTC 2 ATGAGTCTC

1 ATGGGTCTC 2 ATGAGTCTC We need an optimality criterion to choose a best estimate (tree) Other optimality criteria used to choose a best estimate (tree) Parsimony: begins with the assumption that the simplest hypothesis that

More information

Eric A. Stone, 1,2 Gregory M. Cooper, 3 and Arend Sidow 2,3 INTRODUCTION

Eric A. Stone, 1,2 Gregory M. Cooper, 3 and Arend Sidow 2,3 INTRODUCTION Annu. Rev. Genomics Hum. Genet. 2005. 6:143 64 doi: 10.1146/annurev.genom.6.080604.162146 Copyright c 2005 by Annual Reviews. All rights reserved First published online as a Review in Advance on April

More information

Supplementary information

Supplementary information Supplementary information Superoxide dismutase 1 is positively selected in great apes to minimize protein misfolding Pouria Dasmeh 1, and Kasper P. Kepp* 2 1 Harvard University, Department of Chemistry

More information

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights Stat 529 (Winter 2011) A simple linear regression (SLR) case study Reading: Sections 8.1 8.4, 8.6, 8.7 Mammals brain weights and body weights Questions of interest Scatterplots of the data Log transforming

More information

Sequence motif analysis

Sequence motif analysis Sequence motif analysis Alan Moses Associate Professor and Canada Research Chair in Computational Biology Departments of Cell & Systems Biology, Computer Science, and Ecology & Evolutionary Biology Director,

More information

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure Review Cross-Species Sequence Comparisons: A Review of Methods and Available Resources Kelly A. Frazer, 1,6 Laura Elnitski, 2,3 Deanna M. Church, 4 Inna Dubchak, 5 and Ross C. Hardison 3 1 Perlegen Sciences,

More information

High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs. Abdulrhman Aljouie

High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs. Abdulrhman Aljouie Int. J. Computational Biology and Drug Design, Vol. x, No. x, 201X 1 High scoring segment selection for pairwise whole genome sequence alignment with the maximum scoring subsequence and GPUs Abdulrhman

More information

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes Cedric Chauve 1, Eric Tannier 2,3,4,5 * 1 Department of Mathematics,

More information

Comparative Genomics. Dept. of Computer Science Comenius University in Bratislava, Slovakia

Comparative Genomics. Dept. of Computer Science Comenius University in Bratislava, Slovakia Comparative Genomics Broňa Brejová Dept. of Computer Science Comenius University in Bratislava, Slovakia 1 2 Why to sequence so many genomes? 3 Comparative genomics Compare genomic sequences of multiple

More information

1 Introduction. Abstract

1 Introduction. Abstract CBS 530 Assignment No 2 SHUBHRA GUPTA shubhg@asu.edu 993755974 Review of the papers: Construction and Analysis of a Human-Chimpanzee Comparative Clone Map and Intra- and Interspecific Variation in Primate

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Principles of Long Noncoding RNA Evolution Derived from Direct Comparison of Transcriptomes in 17 Species

Principles of Long Noncoding RNA Evolution Derived from Direct Comparison of Transcriptomes in 17 Species Resource Principles of Long Noncoding RNA Evolution Derived from Direct Comparison of Transcriptomes in 17 Species Graphical Abstract Authors Hadas Hezroni, David Koppstein,..., David P. Bartel, Igor Ulitsky

More information

Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes

Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes W. James Kent*, Robert Baertsch*, Angie Hinrichs*, Webb Miller, and David Haussler *Center for Biomolecular

More information

Fast Statistical Alignment

Fast Statistical Alignment Robert K. Bradley 1,2 *, Adam Roberts 3, Michael Smoot 4, Sudeep Juvekar 3, Jaeyoung Do 5, Colin Dewey 5,6, Ian Holmes 7, Lior Pachter 1,2 1 Department of Mathematics, University of California Berkeley,

More information

Whole-Genome Alignments and Polytopes for Comparative Genomics

Whole-Genome Alignments and Polytopes for Comparative Genomics Whole-Genome Alignments and Polytopes for Comparative Genomics Colin Noel Dewey Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-104

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

WHAT fraction of new mutations in the genome are

WHAT fraction of new mutations in the genome are Copyright Ó 2011 by the Genetics Society of America DOI: 10.1534/genetics.110.124073 Inference of Mutation Parameters and Selective Constraint in Mammalian Coding Sequences by Approximate Bayesian Computation

More information

NcDNAlign: Plausible Multiple Alignments of Non-Protein-Coding Genomic Sequences

NcDNAlign: Plausible Multiple Alignments of Non-Protein-Coding Genomic Sequences NcDNAlign: Plausible Multiple Alignments of Non-Protein-Coding Genomic Sequences Dominic Rose a, Jana Hertel a, Kristin Reiche a, Peter F. Stadler a,b,c, Jörg Hackermüller d, a Bioinformatics Group, Department

More information

arxiv: v1 [q-bio.gn] 30 Oct 2009

arxiv: v1 [q-bio.gn] 30 Oct 2009 arxiv:0910.5780v1 [q-bio.gn] 30 Oct 2009 Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement Aaron E. Darling 1,2,3 Bob Mau 4 Nicole T. Perna 5 Running title: Multiple genome

More information

Inference of mutation parameters and selective constraint in mammalian. coding sequences by approximate Bayesian computation

Inference of mutation parameters and selective constraint in mammalian. coding sequences by approximate Bayesian computation Genetics: Published Articles Ahead of Print, published on February 14, 2011 as 10.1534/genetics.110.124073 Inference of mutation parameters and selective constraint in mammalian coding sequences by approximate

More information

A model of the statistical power of comparative genome sequence analysis

A model of the statistical power of comparative genome sequence analysis Washington University School of Medicine Digital Commons@Becker Open Access Publications 2005 A model of the statistical power of comparative genome sequence analysis Sean R. Eddy Washington University

More information

The Contribution of Bioinformatics to Evolutionary Thought

The Contribution of Bioinformatics to Evolutionary Thought The Contribution of Bioinformatics to Evolutionary Thought A demonstration of the abilities of Entrez, BLAST, and UCSC s Genome Browser to provide information about common ancestry. American Scientific

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

RNA- seq read mapping

RNA- seq read mapping RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference

More information

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

More information

Information Theoretic Distance Measures in Phylogenomics

Information Theoretic Distance Measures in Phylogenomics Information Theoretic Distance Measures in Phylogenomics Pavol Hanus, Janis Dingel, Juergen Zech, Joachim Hagenauer and Jakob C. Mueller Institute for Communications Engineering Technical University, 829

More information

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Promoters and Enhancers Systematic discovery of transcriptional regulatory motifs

More information

Lecture 16: Again on Regression

Lecture 16: Again on Regression Lecture 16: Again on Regression S. Massa, Department of Statistics, University of Oxford 10 February 2016 The Normality Assumption Body weights (Kg) and brain weights (Kg) of 62 mammals. Species Body weight

More information

The gain, loss, and modification of gene

The gain, loss, and modification of gene Three s of Regulatory Innovation During Vertebrate Evolution Craig B. Lowe, 1,2,3 Manolis Kellis, 4,5 Adam Siepel, 6 Brian J. Raney, 1 Michele Clamp, 5 Sofie R. Salama, 1,3 David M. Kingsley, 2,3 Kerstin

More information

Alignment Algorithms. Alignment Algorithms

Alignment Algorithms. Alignment Algorithms Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

MegAlign Pro Pairwise Alignment Tutorials

MegAlign Pro Pairwise Alignment Tutorials MegAlign Pro Pairwise Alignment Tutorials All demo data for the following tutorials can be found in the MegAlignProAlignments.zip archive here. Tutorial 1: Multiple versus pairwise alignments 1. Extract

More information

Comparing Genomes! Homologies and Families! Sequence Alignments!

Comparing Genomes! Homologies and Families! Sequence Alignments! Comparing Genomes! Homologies and Families! Sequence Alignments! Allows us to achieve a greater understanding of vertebrate evolution! Tells us what is common and what is unique between different species

More information

BLAT The BLAST-Like Alignment Tool

BLAT The BLAST-Like Alignment Tool Resource BLAT The BLAST-Like Alignment Tool W. James Kent Department of Biology and Center for Molecular Biology of RNA, University of California, Santa Cruz, Santa Cruz, California 95064, USA Analyzing

More information

Copyright Warning & Restrictions

Copyright Warning & Restrictions Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions

More information

BMC Evolutionary Biology

BMC Evolutionary Biology BMC Evolutionary Biology BioMed Central Research article Identification of the Otopetrin Domain, a conserved domain in vertebrate otopetrins and invertebrate otopetrin-like family members Inna Hughes 1,

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that

More information

Detection of gene expression changes at chromosomal rearrangement breakpoints in evolution

Detection of gene expression changes at chromosomal rearrangement breakpoints in evolution Detection of gene expression changes at chromosomal rearrangement breakpoints in evolution Adriana Muñoz 1,2 and David Sankoff 2 1 School of Information Technology & Engineering, 2 Department of Mathematics

More information

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science Phylogeny and Evolution Gina Cannarozzi ETH Zurich Institute of Computational Science History Aristotle (384-322 BC) classified animals. He found that dolphins do not belong to the fish but to the mammals.

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

80 million years the divergence time of human and mouse

80 million years the divergence time of human and mouse 80 million years the divergence time of human and mouse Year Authors Paper/Book Journal Content Relation Country 1985 Wen-Hsiung Li, Chung-I Wu, Chi-Cheng Luo 2000 Dan Graur, Wen-Hsiung Li A New Method

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Browsing Genomic Information with Ensembl Plants

Browsing Genomic Information with Ensembl Plants Browsing Genomic Information with Ensembl Plants Etienne de Villiers, PhD (Adapted from slides by Bert Overduin EMBL-EBI) Outline of workshop Brief introduction to Ensembl Plants History Content Tutorial

More information