Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Size: px
Start display at page:

Download "Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons"

Transcription

1 Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes. We perform all-against-all two-way blast [1] comparisons between the full-length mrna sequence sets of these species [2] and determine homologous gene pairs based on the alignment coverage (C > 0.6) and E-value (E < 10 6 ) of the comparison. 2. Generate Pairwise Sequence Alignments. For each comparison, we randomly select 500 homologous pairs and align them with the program blastz [3]. The resulting pairwise alignments are then used to train Markov models of comparisons [4]. 3. Calculate KLD Distances and Cluster the Comparisons. We calculate KLD distances between Markov models and generate a distance profile for each comparison. (A profile is the vector of KLD distances of one comparison with all comparisons.) Profiles are then clustered using hierarchical clustering with the Pearson correlation coefficient implemented in the R package ( to create groups of similar comparisons. 4. Determine and Evaluate Optimal Seeds. We optimize seeds for each cluster and for each comparison in the clusters, and validate them by comparing their performance within their group and in the other groups. We use the methods in [4] to calculate the sensitivity of a seed, combined with a hill-climbing procedure to estimate optimal seeds [5]. 1

2 1.2 Efficient calculation of the KLD distance for two Markov models Let M 1 and M 2 be two Markov models of order k = 3, characterized by the probability distributions P and Q on the space of alignment words X = {0, 1, x} L. To compute KLD(P, Q), we divide the contributions from words w in groups based on the last 3 digits in the word, and calculate the values recursively for increasing word lengths w = m L: KLD(P, Q) = a 1 a 2 a 3 {0,1,x} 3 KLD (L) (P, Q; a 1 a 2 a 3 ) (1) Let λ {0, 1, x} m 3, b {0, 1, x}, for m 3. KLD (m+1) (P, Q; a 1 a 2 a 3 ) (2) = p(λba 1 a 2 a 3 ) log p(λba 1a 2 a 3 ) 1 a 2 a 3 ) = ( p(λba 1 a 2 ) µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + log µ P (a 3 ba 1 a 2 ) ) µ Q (a 3 ba 1 a 2 ) = p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) = [ µ P (a 3 ba 1 a 2 ) p(λba 1 a 2 ) log p(λba 1a 2 )] + K(a 3 ; ba 1 a 2 ) p(λba 1 a 2 ) b λ 1 a 2 ) b λ = b where is a constant, and µ P (a 3 ba 1 a 2 ) KLD (m) (P, Q; ba 1 a 2 ) + b K(a 3 ; ba 1 a 2 ) = µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) P (m) (ba 1 a 2 ) = λ K(a 3 ; ba 1 a 2 ) P (m) (ba 1 a 2 ) (3) p(λba 1 a 2 ) (4) Note that P (m) (ba 1 a 2 ) can be calculated a priori with the recurrences: P (m+1) (a 1 a 2 a 3 ) = λb = λ = b = b p(λba 1 a 2 a 3 ) (5) p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) b µ P (a 3 ba 1 a 2 )( p(λba 1 a 2 )) λ µ P (a 3 ba 1 a 2 )P (m) (ba 1 a 2 ) For L = 64, these recurrences will generate intermediate values, which are later used in the calculation of KLD (m). Hence, the KLD distance can be computed effciently in O(L) time.

3 2 Supplementary Tables Table S1: Seeds optimized with the hill-climbing algorithm for four clusters. Cluster n 1 n 0 n x W S n Optimal Seeds L x111110x L x x L x110x1011x11x L x1100x011011x11x1100 L x x11x11x1100 L x110x1011x11111x1100 L x1100x0x1011x L x101100x0x1011x1100 L x1100x0x1011x L x11011x1100x011x1100 L x1100x011x11011x11 L x11011x110xx011x11x11 L x x L xx00x0x1011xx1011 L L x011011x11 L x x11 L x11011xx x11 L x x L x100x0x10110x1011 L L L x x11 L x x11

4 Table S2: Average of seed sensitivities when applying seeds optimized for each of the four clusters L 1, L 2, L 3, L 4 to the four clusters L 1, L 2, L 3, and L 4. Only results for seeds with weight 11 and 12 are shown. L 1,O, L 2,O, L 3,O, L 4,O are the seeds optimized with hill-climbing for clusters L 1, L 2, L 3, L 4, respectively. Cluster W L 1,O L 2,O L 3,O L 4,O L L L L L L L L

5 Table S3: Average of seed sensitivities when applying optimal seeds obtained from clusters L 1, L 2, L 3, L 4 on the comparisons in cluster L 2. Comparisons W L 1 L 2 L 3 L 4 human.mouse human.mouse human.rat human.rat human.cow human.cow human.dog human.dog chimp.mouse chimp.mouse chimp.rat chimp.rat chimp.cow chimp.cow chimp.dog chimp.dog macaque.mouse macaque.mouse macaque.rat macaque.rat macaque.cow macaque.cow macaque.dog macaque.dog mouse.cow mouse.cow mouse.dog mouse.dog rat.cow rat.cow rat.dog rat.dog cow.dog cow.dog

6 Table S4: Average of seed sensitivities when applying optimal seeds obtained from individual comparisons in clusters L 1, L 2, L 3, L 4 to clusters L 1, L 2, L 3, L 4. W mouse.rat cow.dog chimp.chicken frog.fugu Apply seeds optimized with hill-climbing on cluster L Apply seeds optimized with hill-climbing on cluster L Apply seeds optimized with hill-climbing on cluster L Apply seeds optimized with hill-climbing on cluster L

7 Table S5: Performance of seeds optimized for the four clusters when incorporated into sim4cc (Zhou and Florea, in prep.; S n and S p are the sensitivity and specificity at the nucleotide level. The Intron column shows the percentage of accurately detected introns, as a measure of the splice junction detection accuracy. Cluster Human-Mouse Human-Zebrafish W = 12 S n S p Intron S n S p Intron L L L L References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215(3), [2] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35 (Database issue), D [3] Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D. and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res., 13(1), [4] Zhou, L. and Florea, L. (2007) Designing sensitive and specific spaced seeds for cross-species mrna-to-genome alignment. J. Comput. Biol., 14(2), [5] Buhler, J., Keich, U. and Sun, Y. (2003) Designing seeds for similarity search in genomic DNA, In Proc. Seventh Annual Intln. Conference on Computational Molecular Biology. RECOMB 2003,

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Performing local similarity searches with variable length seeds

Performing local similarity searches with variable length seeds Performing local similarity searches with variable length seeds Miklós Csűrös Département d informatique et de recherche opérationnelle, Université de Montréal C.P. 6128 succ. Centre-Ville, Montréal, Québec,

More information

Multiple Genome Alignment by Clustering Pairwise Matches

Multiple Genome Alignment by Clustering Pairwise Matches Multiple Genome Alignment by Clustering Pairwise Matches Jeong-Hyeon Choi 1,3, Kwangmin Choi 1, Hwan-Gue Cho 3, and Sun Kim 1,2 1 School of Informatics, Indiana University, IN 47408, USA, {jeochoi,kwchoi,sunkim}@bio.informatics.indiana.edu

More information

On Spaced Seeds for Similarity Search

On Spaced Seeds for Similarity Search On Spaced Seeds for Similarity Search Uri Keich, Ming Li, Bin Ma, John Tromp $ Computer Science & Engineering Department, University of California, San Diego, CA 92093, USA Bioinformatics Lab, Computer

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Improved hit criteria for DNA local alignment

Improved hit criteria for DNA local alignment Improved hit criteria for DNA local alignment Laurent Noé Gregory Kucherov Abstract The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

On spaced seeds for similarity search

On spaced seeds for similarity search Discrete Applied Mathematics 138 (2004) 253 263 www.elsevier.com/locate/dam On spaced seeds for similarity search Uri Keich a;, Ming Li b, Bin Ma c, John Tromp d a Computer Science & Engineering Department,

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

More information

BLAT The BLAST-Like Alignment Tool

BLAT The BLAST-Like Alignment Tool Resource BLAT The BLAST-Like Alignment Tool W. James Kent Department of Biology and Center for Molecular Biology of RNA, University of California, Santa Cruz, Santa Cruz, California 95064, USA Analyzing

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

A Novel Method for Similarity Analysis of Protein Sequences

A Novel Method for Similarity Analysis of Protein Sequences 5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015) A Novel Method for Similarity Analysis of Protein Sequences Longlong Liu 1, a, Tingting Zhao 1,b and Maojuan

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Multiple Alignment of Genomic Sequences

Multiple Alignment of Genomic Sequences Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Automated whole-genome multiple alignment of rat, mouse, and human Permalink https://escholarship.org/uc/item/1z58c37n

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Multi-seed lossless filtration (Extended abstract)

Multi-seed lossless filtration (Extended abstract) Multi-seed lossless filtration (Extended abstract) Gregory Kucherov, Laurent Noé, Mikhail Roytberg To cite this version: Gregory Kucherov, Laurent Noé, Mikhail Roytberg. Multi-seed lossless filtration

More information

Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

More information

GenomeBlast: a Web Tool for Small Genome Comparison

GenomeBlast: a Web Tool for Small Genome Comparison GenomeBlast: a Web Tool for Small Genome Comparison Guoqing Lu 1*, Liying Jiang 2, Resa M. Kotalik 3, Thaine W. Rowley 3, Luwen Zhang 4, Xianfeng Chen 6, Etsuko N. Moriyama 4,5* 1 Department of Biology,

More information

Alignment Strategies for Large Scale Genome Alignments

Alignment Strategies for Large Scale Genome Alignments Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty

More information

Subset seed automaton

Subset seed automaton Subset seed automaton Gregory Kucherov, Laurent Noé, and Mikhail Roytberg 2 LIFL/CNRS/INRIA, Bât. M3 Cité Scientifique, 59655, Villeneuve d Ascq cedex, France, {Gregory.Kucherov,Laurent.Noe}@lifl.fr 2

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences

Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences MOHAMED EL-ZANATY 1, MAGDY SAEB 1, A. BAITH MOHAMED 1, SHAWKAT K. GUIRGUIS 2, EMAN EL-ABD 3 1. School

More information

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster. NCBI BLAST Services DELTA-BLAST BLAST (http://blast.ncbi.nlm.nih.gov/), Basic Local Alignment Search tool, is a suite of programs for finding similarities between biological sequences. DELTA-BLAST is a

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure Review Cross-Species Sequence Comparisons: A Review of Methods and Available Resources Kelly A. Frazer, 1,6 Laura Elnitski, 2,3 Deanna M. Church, 4 Inna Dubchak, 5 and Ross C. Hardison 3 1 Perlegen Sciences,

More information

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Evolution at the nucleotide level: the problem of multiple whole-genome alignment Human Molecular Genetics, 2006, Vol. 15, Review Issue 1 doi:10.1093/hmg/ddl056 R51 R56 Evolution at the nucleotide level: the problem of multiple whole-genome alignment Colin N. Dewey 1, * and Lior Pachter

More information

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors Phylo-VISTA: Interactive Visualization of Multiple DNA Sequence Alignments Nameeta Shah 1,*, Olivier Couronne 2,*, Len A. Pennacchio 2, Michael Brudno 3, Serafim Batzoglou 3, E. Wes Bethel 2, Edward M.

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes

Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes W. James Kent*, Robert Baertsch*, Angie Hinrichs*, Webb Miller, and David Haussler *Center for Biomolecular

More information

Visit to BPRC. Data is crucial! Case study: Evolution of AIRE protein 6/7/13

Visit to BPRC. Data is crucial! Case study: Evolution of AIRE protein 6/7/13 Visit to BPRC Adres: Lange Kleiweg 161, 2288 GJ Rijswijk Utrecht CS à Den Haag CS 9:44 Spoor 9a, arrival 10:22 Den Haag CS à Delft 10:28 Spoor 1, arrival 10:44 10:48 Delft Voorzijde à Bushalte TNO/Lange

More information

Handling Rearrangements in DNA Sequence Alignment

Handling Rearrangements in DNA Sequence Alignment Handling Rearrangements in DNA Sequence Alignment Maneesh Bhand 12/5/10 1 Introduction Sequence alignment is one of the core problems of bioinformatics, with a broad range of applications such as genome

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool 2014, TextRoad Publication ISSN: 2090-4274 Journal of Applied Environmental and Biological Sciences www.textroad.com E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool Muhammad Tariq

More information

Multiseed Lossless Filtration

Multiseed Lossless Filtration Multiseed Lossless Filtration Gregory Kucherov, Laurent Noé, Mikhail Roytberg To cite this version: Gregory Kucherov, Laurent Noé, Mikhail Roytberg. Multiseed Lossless Filtration. IEEE/ACM Transactions

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

How much non-coding DNA do eukaryotes require?

How much non-coding DNA do eukaryotes require? How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical Engg. and Computer Science

More information

Multiseed lossless filtration

Multiseed lossless filtration Multiseed lossless filtration 1 Gregory Kucherov, Laurent Noé, Mikhail Roytberg arxiv:0901.3215v1 [q-bio.qm] 21 Jan 2009 Abstract We study a method of seed-based lossless filtration for approximate string

More information

Finding Anchors for Genomic Sequence Comparison ABSTRACT

Finding Anchors for Genomic Sequence Comparison ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 6, 2005 Mary Ann Liebert, Inc. Pp. 762 776 Finding Anchors for Genomic Sequence Comparison ROSS A. LIPPERT, 1,4 XIAOYUE ZHAO, 2 LILIANA FLOREA, 1,3 CLARK

More information

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that

More information

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6.

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6. NIH Public Access Author Manuscript Published in final edited form as: Pac Symp Biocomput. 2009 ; : 162 173. SIMULTANEOUS HISTORY RECONSTRUCTION FOR COMPLEX GENE CLUSTERS IN MULTIPLE SPECIES * Yu Zhang,

More information

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional

More information

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES Eser Aygün 1, Caner Kömürlü 2, Zafer Aydin 3 and Zehra Çataltepe 1 1 Computer Engineering Department and 2

More information

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus L3.1: Circuits: Introduction to Transcription Networks Cellular Design Principles Prof. Jenna Rickus In this lecture Cognitive problem of the Cell Introduce transcription networks Key processing network

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

Multiple Choice Review- Eukaryotic Gene Expression

Multiple Choice Review- Eukaryotic Gene Expression Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS Aslı Filiz 1, Eser Aygün 2, Özlem Keskin 3 and Zehra Cataltepe 2 1 Informatics Institute and 2 Computer Engineering Department,

More information

Evolutionary Rate Covariation of Domain Families

Evolutionary Rate Covariation of Domain Families Evolutionary Rate Covariation of Domain Families Author: Brandon Jernigan A Thesis Submitted to the Department of Chemistry and Biochemistry in Partial Fulfillment of the Bachelors of Science Degree in

More information

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI 1 GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI Justin Dailey and Xiaoyu Zhang Department of Computer Science, California State University San Marcos San Marcos, CA 92096 Email: daile005@csusm.edu,

More information

Phylogenomic Resources at the UCSC Genome Browser

Phylogenomic Resources at the UCSC Genome Browser 9 Phylogenomic Resources at the UCSC Genome Browser Kate Rosenbloom, James Taylor, Stephen Schaeffer, Jim Kent, David Haussler, and Webb Miller Summary The UC Santa Cruz Genome Browser provides a number

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure 1 Abstract None 2 Introduction The archaeal core set is used in testing the completeness of the archaeal draft genomes. The core set comprises of conserved single copy genes from 25 genomes. Coverage statistic

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018 DATA ACQUISITION FROM BIO-DATABASES AND BLAST Natapol Pornputtapong 18 January 2018 DATABASE Collections of data To share multi-user interface To prevent data loss To make sure to get the right things

More information

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 Cluster Analysis of Gene Expression Microarray Data BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 1 Data representations Data are relative measurements log 2 ( red

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

A Practical Approach to Significance Assessment in Alignment with Gaps

A Practical Approach to Significance Assessment in Alignment with Gaps A Practical Approach to Significance Assessment in Alignment with Gaps Nicholas Chia and Ralf Bundschuh 1 Ohio State University, Columbus, OH 43210, USA Abstract. Current numerical methods for assessing

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

Example of Function Prediction

Example of Function Prediction Find similar genes Example of Function Prediction Suggesting functions of newly identified genes It was known that mutations of NF1 are associated with inherited disease neurofibromatosis 1; but little

More information

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Abstract: We used an algorithm known as statistical coupling analysis (SCA) 1 to create a set of features for building

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Chapter 15 Active Reading Guide Regulation of Gene Expression

Chapter 15 Active Reading Guide Regulation of Gene Expression Name: AP Biology Mr. Croft Chapter 15 Active Reading Guide Regulation of Gene Expression The overview for Chapter 15 introduces the idea that while all cells of an organism have all genes in the genome,

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information