A metric approach for. comparing DNA sequences

Size: px
Start display at page:

Download "A metric approach for. comparing DNA sequences"

Transcription

1 A metric approach for comparing DNA sequences H. Mora-Mora Department of Computer and Information Technology University of Alicante, Alicante, Spain M. Lloret-Climent Department of Applied Mathematics. University of Alicante, Alicante, Spain F. Vives-Macia Department of Applied Mathematics. Universidad de Alicante, Alicante, Spain Abstract Purpose This article attempts to compare two biological successions, in general any two successions, analysing the differences between them. Design/methodology/approach We designed an algorithm based on a metric which enables us to calculate the associated distance between two successions and based on the distance obtained, we approach the type of mutations which have occurred. Findings The empirical analysis shows that the transformations caused in the successions have been detected by the metric. We are aware that today there are numerous enormously powerful programmes able to trawl entire databases and therefore our aim in this article was not to compare these, but to demonstrate a different way of comparing two given successions. Practical implications Comparison of two DNA successions having a measure of the degree of similarity between them and their possible mutations. Originality/value The metric presented is a generalisation of the Hamming distance in strings of different lengths. The sotfware associated with the metric has enabled us to validate the results obtained. Keywords DNA sequences, algorithm, metric. Paper type Research paper

2 1. Introduction Comparison of nucleotide (and amino acid) sequences is essential for research in the areas of molecular genetics, molecular biology, and/or bioinformatics. Code theory, programming and algorithm analysis face the same problem. The problem in the comparison of symbol sequences is that it is introducing a real metric between two sequences. To be precise, a metric exists (the so-called Hamming distance), but it does not provide any comprehension of a comparison of strings, yielding only two outcomes: two strings coincide, or two strings do not coincide. For example the Hamming distance between and is 2, or the Hamming distance between toned and roses is 3. The most widespread methodology to establish similarity in a family of symbol sequences (of a different origin) is the approach called alignment, or editing distance (Watermann, 1989; Alexandrov et al., 1990; Wootton and Federchen, 1996). The methodology is based on an idea of transformation of one sequence into another, where a transformation is performed with a fixed number of operations allowed (Sadovsky, 2003). Operation is understood to be an insertion, elimination or substitution of a character. For example, the distance between kitten and sitting is 3 because it needs at least three basic editing operations to change from one to the other. 1º kitten for sitten (substitution of k for s ) 2º sitten for sittin (substitution of e for i ) 3º sittin for sitting (insert g at the end) It is considered to be a generalisation of the Hamming distance, which is used in strings of the same length and which is only deemed to be a substitution operation. The metric presented here, as in the distance editing also applies to strings of different lengths, however it may also be applied to strings of the same length (in fact it

3 coincides with the Hamming distance for changes of the same length). The main difference is that it introduces the absolute value of the difference of the string lengths and that it achieves similar results to those obtained with other distances. 2. Set comparison model ach gene g at an instant of time t which falls within the period of existence of a particular cell, will be denoted g(t). Therefore g(t)=(g1, g2,...,gn) where gi is one of the four symbols (T, C, A, G ). M represents the organised set f bases associated with each gene. In this case, the order in which these bases appear is particularly interesting for our purposes. If we define d:(mxt)x(mxt) + such that if g(t1)=(g1, g2,...,gn) and h(t2)=(h1, h2,...,hm) by the expression: d( g ( t1), h ( t2) n i1 p i1 (1 (1 g, h g, h ( i)) if ( i)) dim( g dim( g ( t ) dim( h ( t ) dim( h 1 1 ( t ) 2 ( t )) 2 if dim( g ( t )) dim( h 1 ( t )) 2 1 if g ( t1) h ( t2 ) where g, h ( i) 0 other case and p=min(n,m) then d is a metric (Lloret, 1999), (Bonnet et al. 2004). Note. In a specific case, if the genes have the same number of nucleotides (n) and the order in which these bases are situated exerts an influence, the above metric is expressed as:

4 d :(MxT)x(MxT) + such that if g(t1)=(g1, g2,...,gn) and h(t2)=(h1, h2,...,hn), then n i i d( g ( t1), h ( t2)) (1 g, ( i)) where g h i i1 h 1, ( ) 0 if g h if g h (whether or not they both have the same element in position i). This metric coincides with the Hamming distance, so that our metric is a generalisation of this metric if the sequences have different lengths. That is, d(g(t1),h(t2)) represents the number of g elements which are not shared with h (where we do take into account the positions in which they appear). Clearly, the smaller the value of d, the greater will be the coincidence between the sequences g and h under consideration. For example, the distance between g(t1)=(t, C, G, A, C, T, A, T, C, A) and h(t2)=(t, C, G, A, C, T, A, T, C, A, G) is: p is the lowest of the dimensions of the two changes which in this case is 10, therefore the summatory would be taken up to 10. Futhermore, the absolute value of the difference between the strings is 11-10=1. And therefore d( g ( t ), h ( t )) (1 10 ( i)) dim( g ( t )) dim( h ( t 1 2 g, h i1 i i )) 0 1 More complex and developed examples will be presented with the programme. 3. Software for the comparison of DNA sequences This application is based on the previous metric and its aim is to analyse strings of DNA by comparing the sequence of nucleotides. The algorithm begins with two DNA strings and then offers a list of the differences found between both sequences, an interpretation of these changes, and a numeric value indicating the measurement of the alterations produced.

5 Interface Below, a detailed description of the software inputs and outputs is provided: Inputs: As inputs, the algorithm receives two DNA sequences to analyse. ach sequence will be made up of a string of its basic components represented by the ASCII characters (a, t, g, c) of an arbitrary length. Outputs: As output, the algorithm offers a report showing the results of the analysis that have been carried out. The information it contains deals with both quantitative and qualitative aspects. The contents of this report as follows: Position, length and composition of each change found between both sequences. Semantic determination of each change. Some of the changes produced can be from the following list: substitution, insertion, deletion, duplication inversion and transposition. Measurement of the distance between both sequences. Output is provided in a text file. Function The software is organised in a modular way and programmed structurally, which offers flexibility and adaptation to the conditions required by the analysis. The comparison criteria are adjustable to the characteristics of the strings so that the focus on different transformation operations in the sequences being studied. The function is based on the application of comparison heuristics with strings of consecutive characters. Use is made of combined comparison strategies between local substrings and overall analysis of the complete string.

6 In order to value the degree of relationship between the sequences, three sequence proximity functions have been defined. These functions will supply information so that the most probable transformation can be decided. The functions are as follows: 1. Measurement of the global correlation: this counts the number of noncoincidences in a base to base comparison of the two strings. 2. Measurement of the local correlation: counts the number of non-coincidences in a base to base comparison in an environment around a specific position. 3. Measurement of consecutive positions: counts the number of coincident consecutive positions in a base to base comparison from a specific position. The comparison procedure between the sequences is a combination of the following heuristics: 1. Direct comparison of sequences: this checks whether the bases of the sequences situated in the same position coincide. 2. Best position of a subsequence: this looks for the string of bases which obtains the most coincidences in the functions of measurement of local correlation and of consecutive positions. 3. Correlation of least difference: this looks for the relative position between the sequences that obtains the most coincidences in the measurement of the global correlation function. This heuristic is a generalisation of the previous one to the complete string. 4. Duplicity of subsequences: analyses to ascertain whether a subsequence of bases is equal to the subsequences next to it.

7 5. Special cases: transformation situations undetected by the previous heuristics are analysed. The decision on the produced transformation is made by taking into account both the characteristics of the operation and the consequences for the complete string. The order in which the heuristics are applied may alter the analysis of the results, since a result can reach the same sequence via different operations. This is because combined strategies are established to determine the order. These strategies minimise the function distance as they are based on information about the frequency and number of transformations. In any case, this order can be varied depending on the purposes of the study. The following figure shows a flow diagram that graphically illustrates how the software works: start data acquisition parameter adjustment sequence direct comparison differences amount transform. yes smaller difference correlation better position subsequence no particular case analisys no insertion report generation report yes subsequences duplicity end Figure 1: Algorithm flowchart Results

8 The empirical analysis demonstrates that in most of the cases analysed, the changes occurring between two sequences have been detected by the software. However, a previous stage of training and calibration of the algorithm parameters to the nature of the analysed data was necessary. 4. Application xample Our main problem will be to compare the structure of a gene and the nucleotide changes formed in a gene at a specific time by using the gene database at the uropean Bioinformatics Institute (BI_website, 2006). More specifically, we will compare nucleotide changes of the following gene: Drosophila melanogaster, sequence length 4220 BP, Accession # AB095028, ntry name: MBL: AB In the first matrix, the algorithm displays the digits corresponding to each of the nucleotides of the gene, in the second matrix the original gene whose nucleotide changes we want to analyse (it can be consulted in the database) and in the third matrix, the gene with the associate d nucleotide change. The software shows the nucleotide changes that have taken place, indicating the positions in which they have occurred and a distance value indicative of the similarity between the two sequences. For reasons of space, only a substring of the previous gene with its nucleotide changes and the result obtained are given in this paper. A more extensive account of the experiments that have been carried out can be found on the authors website. ->Processing: sub_mbl_ab txt

9 Original segment : A T T T A T A C G A T C G G A A A C G G A A C G G A T T T G C T T G A G C C A G C A T C T G C A G C A T G T C C T G C A A C T G C A G C A G C A A C A C C A G C A G C A T G A A A C G T C G C C A A C A G T G T C G C A G C A G C A A C G T C A C A A T G C G A T G C A T C A T C T C C A T C A T C A G A C A G G T T A G T A T T A A G T G T G T A T T G T T T T T T A A G T T T C T G G C Segment with nucleotide changes: A T T T A T G T A C G A T C G G A A A C G G A A C G G A T T T G C T T G A G C C A G C A T C T G C A T G T C C T G C A A C T G C A G C A G C A A C A C C T T T A G C A T G A A A C G T C G C C A A C A G T G T C G C A G C A G C A A C G T C A C A A T G T A G C G C A T C A T C T C C A T C A T C A G A C A G G T T A G T A T T A A G T A G T G T G T A T T G T T T T T T A A G T T T C T G G C Processing Report: -Insertion: (7, 8) => (G, T) -Deletion: (51, 52, 53) => (G, C, A) -Duplication: (168, 169, 170) => (A, G, T) -Inverted segment: (128, 129, 130, 131) => (T, A, G, C) -Substituted segment: ({A, G, C} --> {T, T, T}) Distance: 15

10 5. Conclusions In this paper we have proposed a different means for comparing two successions, both from a theoretical point of view with the presentation of the metric and also from a practical perspective, with the introduction of the associated programme. We have not attempted to analyse the advantages of our programme over the infinite number of programmes already in existence for the comparative analysis of DNA sequences such as BLAST, FASTA, CLUSTALW, etc. and which many researchers already have installed on their computers which are able to trawl through entire databases (with millions of sequences), provide multiple sequence alignments and an extensive report on quantitative data, in addition to providing an enormous number of financial and human resources, for which the comparison would have no purpose. Our metric represents another means of analysing the comparison between two successions. The advantages gained with this software will need to be addressed in future studies although at present we know that differences in treatment of the successions will appear as our basic hypothesis is different. References Alexandrov, A. A., V. V. Alexandrov, Yu. Borodovsky and A. V. Mironov Computer Analysis of Genetic Texts, Moscow: Nauka. Bonnet-Jerez J.L.; Lloret Climent M An approach to measuring change in muscular tissue contraction. Byosystems BI_website, 2006, uropean Bioinformatics Institute, URL: Lloret-Climent M Measures of cellular change in systems theory. Kybernetes. 28. No 8/ Sadovsky Michael G The method to compare nucleotide sequences based on the

11 minimum entropy principle. Bulletin of Mathematical biology, Waterman M. (ed.) Alignment of sequences, Boca Raton: CRC Press Inc. Wootton, J. C. and S. Federchen (1996). Alignment of sequences, in Methods of nzymology, R. F. Doolite (d.),

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1 Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data."

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses. Kirsi Kostamo Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

More information

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Using Bioinformatics to Study Evolutionary Relationships Instructions

Using Bioinformatics to Study Evolutionary Relationships Instructions 3 Using Bioinformatics to Study Evolutionary Relationships Instructions Student Researcher Background: Making and Using Multiple Sequence Alignments One of the primary tasks of genetic researchers is comparing

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018 DATA ACQUISITION FROM BIO-DATABASES AND BLAST Natapol Pornputtapong 18 January 2018 DATABASE Collections of data To share multi-user interface To prevent data loss To make sure to get the right things

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

RGP finder: prediction of Genomic Islands

RGP finder: prediction of Genomic Islands Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Collected Works of Charles Dickens

Collected Works of Charles Dickens Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

More information

Conceptual Similarity: Why, Where, How

Conceptual Similarity: Why, Where, How Conceptual Similarity: Why, Where, How Michalis Sfakakis Laboratory on Digital Libraries & Electronic Publishing, Department of Archives and Library Sciences, Ionian University, Greece First Workshop on

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics 1. Introduction. Jorge L. Ortiz Department of Electrical and Computer Engineering College of Engineering University of Puerto

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST Big Idea 1 Evolution INVESTIGATION 3 COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST How can bioinformatics be used as a tool to determine evolutionary relationships and to

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Sequence comparison by compression

Sequence comparison by compression Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

Pattern Structures 1

Pattern Structures 1 Pattern Structures 1 Pattern Structures Models describe whole or a large part of the data Pattern characterizes some local aspect of the data Pattern is a predicate that returns true for those objects

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

Molecular evolution - Part 1. Pawan Dhar BII

Molecular evolution - Part 1. Pawan Dhar BII Molecular evolution - Part 1 Pawan Dhar BII Theodosius Dobzhansky Nothing in biology makes sense except in the light of evolution Age of life on earth: 3.85 billion years Formation of planet: 4.5 billion

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

BACHELOR OF GEOINFORMATION TECHNOLOGY (NQF Level 7) Programme Aims/Purpose:

BACHELOR OF GEOINFORMATION TECHNOLOGY (NQF Level 7) Programme Aims/Purpose: BACHELOR OF GEOINFORMATION TECHNOLOGY ( Level 7) Programme Aims/Purpose: The Bachelor of Geoinformation Technology aims to provide a skilful and competent labour force for the growing Systems (GIS) industry

More information

ARCGIS COURSE, BEGINNER LEVEL ONLINE TRAINING

ARCGIS COURSE, BEGINNER LEVEL ONLINE TRAINING ARC COURSE, BEGINNER LEVEL ONLINE TRAINING Course.com TYC TRAINING OVERVIEW This course will qualify students to use Arc Desktop 10 and in particular, ArcMap, ArcCatalog and ArcTool Box, focusing on the

More information

Mutual information content of homologous DNA sequences

Mutual information content of homologous DNA sequences Mutual information content of homologous DNA sequences 55 Mutual information content of homologous DNA sequences Helena Cristina G. Leitão, Luciana S. Pessôa and Jorge Stolfi Instituto de Computação, Universidade

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

1 Introduction. Abstract

1 Introduction. Abstract CBS 530 Assignment No 2 SHUBHRA GUPTA shubhg@asu.edu 993755974 Review of the papers: Construction and Analysis of a Human-Chimpanzee Comparative Clone Map and Intra- and Interspecific Variation in Primate

More information

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel

More information

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Geosciences Data Digitize and Materialize, Standardization Based on Logical Inter- Domain Relationships GeoDMS

Geosciences Data Digitize and Materialize, Standardization Based on Logical Inter- Domain Relationships GeoDMS Geosciences Data Digitize and Materialize, Standardization Based on Logical Inter- Domain Relationships GeoDMS Somayeh Veiseh Iran, Corresponding author: Geological Survey of Iran, Azadi Sq, Meraj St,

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Emily Blanton Phylogeny Lab Report May 2009

Emily Blanton Phylogeny Lab Report May 2009 Introduction It is suggested through scientific research that all living organisms are connected- that we all share a common ancestor and that, through time, we have all evolved from the same starting

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that

More information

Place Syntax Tool (PST)

Place Syntax Tool (PST) Place Syntax Tool (PST) Alexander Ståhle To cite this report: Alexander Ståhle (2012) Place Syntax Tool (PST), in Angela Hull, Cecília Silva and Luca Bertolini (Eds.) Accessibility Instruments for Planning

More information

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough Search Search is a key component of intelligent problem solving Search can be used to Find a desired goal if time allows Get closer to the goal if time is not enough section 11 page 1 The size of the search

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Solving and Graphing a Linear Inequality of a Single Variable

Solving and Graphing a Linear Inequality of a Single Variable Chapter 3 Graphing Fundamentals Section 3.1 Solving and Graphing a Linear Inequality of a Single Variable TERMINOLOGY 3.1 Previously Used: Isolate a Variable Simplifying Expressions Prerequisite Terms:

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Phylogenetic analysis. Characters

Phylogenetic analysis. Characters Typical steps: Phylogenetic analysis Selection of taxa. Selection of characters. Construction of data matrix: character coding. Estimating the best-fitting tree (model) from the data matrix: phylogenetic

More information

Miller & Levine Biology 2010

Miller & Levine Biology 2010 A Correlation of 2010 to the Pennsylvania Assessment Anchors Grades 9-12 INTRODUCTION This document demonstrates how 2010 meets the Pennsylvania Assessment Anchors, grades 9-12. Correlation page references

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information