A metric approach for. comparing DNA sequences

Similar documents
Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Single alignment: Substitution Matrix. 16 march 2017

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Sequence analysis and Genomics

Bioinformatics and BLAST

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Effects of Gap Open and Gap Extension Penalties

Comparative genomics: Overview & Tools + MUMmer algorithm

Lecture 5,6 Local sequence alignment

Algorithms in Bioinformatics

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Sequence Comparison. mouse human

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

BIOINFORMATICS: An Introduction

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Introduction to Bioinformatics Online Course: IBT

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Motivating the need for optimal sequence alignments...

Phylogenetic analyses. Kirsi Kostamo

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Computational Biology

Hands-On Nine The PAX6 Gene and Protein

BLAST: Target frequencies and information content Dannie Durand

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Tools and Algorithms in Bioinformatics

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Sequence Alignment Techniques and Their Uses

Computational methods for predicting protein-protein interactions

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Tools and Algorithms in Bioinformatics

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

An Introduction to Sequence Similarity ( Homology ) Searching

Using Bioinformatics to Study Evolutionary Relationships Instructions

Bioinformatics Exercises

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Bioinformatics Chapter 1. Introduction

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

RGP finder: prediction of Genomic Islands

Introduction to Bioinformatics

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Collected Works of Charles Dickens

Conceptual Similarity: Why, Where, How

Comparing whole genomes

In-Depth Assessment of Local Sequence Alignment

Comparative Bioinformatics Midterm II Fall 2004

Introduction to Bioinformatics

EVOLUTIONARY DISTANCES

Hidden Markov Models

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

Genomes and Their Evolution

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Sequence comparison by compression

Computational Biology: Basics & Interesting Problems

On the Monotonicity of the String Correction Factor for Words with Mismatches

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Pattern Structures 1

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Molecular evolution - Part 1. Pawan Dhar BII

Gibbs Sampling Methods for Multiple Sequence Alignment

BACHELOR OF GEOINFORMATION TECHNOLOGY (NQF Level 7) Programme Aims/Purpose:

ARCGIS COURSE, BEGINNER LEVEL ONLINE TRAINING

Mutual information content of homologous DNA sequences

Copyright 2000 N. AYDIN. All rights reserved. 1

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

1 Introduction. Abstract

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BLAST. Varieties of BLAST

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Analysis and Design of Algorithms Dynamic Programming

Geosciences Data Digitize and Materialize, Standardization Based on Logical Inter- Domain Relationships GeoDMS

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise & Multiple sequence alignments

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Emily Blanton Phylogeny Lab Report May 2009

Computational Structural Bioinformatics

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Place Syntax Tool (PST)

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

BLAST: Basic Local Alignment Search Tool

Overview Multiple Sequence Alignment

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Processes of Evolution

Solving and Graphing a Linear Inequality of a Single Variable

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Phylogenetic analysis. Characters

Miller & Levine Biology 2010

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Transcription:

A metric approach for comparing DNA sequences H. Mora-Mora Department of Computer and Information Technology University of Alicante, Alicante, Spain M. Lloret-Climent Department of Applied Mathematics. University of Alicante, Alicante, Spain F. Vives-Macia Department of Applied Mathematics. Universidad de Alicante, Alicante, Spain Abstract Purpose This article attempts to compare two biological successions, in general any two successions, analysing the differences between them. Design/methodology/approach We designed an algorithm based on a metric which enables us to calculate the associated distance between two successions and based on the distance obtained, we approach the type of mutations which have occurred. Findings The empirical analysis shows that the transformations caused in the successions have been detected by the metric. We are aware that today there are numerous enormously powerful programmes able to trawl entire databases and therefore our aim in this article was not to compare these, but to demonstrate a different way of comparing two given successions. Practical implications Comparison of two DNA successions having a measure of the degree of similarity between them and their possible mutations. Originality/value The metric presented is a generalisation of the Hamming distance in strings of different lengths. The sotfware associated with the metric has enabled us to validate the results obtained. Keywords DNA sequences, algorithm, metric. Paper type Research paper

1. Introduction Comparison of nucleotide (and amino acid) sequences is essential for research in the areas of molecular genetics, molecular biology, and/or bioinformatics. Code theory, programming and algorithm analysis face the same problem. The problem in the comparison of symbol sequences is that it is introducing a real metric between two sequences. To be precise, a metric exists (the so-called Hamming distance), but it does not provide any comprehension of a comparison of strings, yielding only two outcomes: two strings coincide, or two strings do not coincide. For example the Hamming distance between 1011101 and 1001001 is 2, or the Hamming distance between toned and roses is 3. The most widespread methodology to establish similarity in a family of symbol sequences (of a different origin) is the approach called alignment, or editing distance (Watermann, 1989; Alexandrov et al., 1990; Wootton and Federchen, 1996). The methodology is based on an idea of transformation of one sequence into another, where a transformation is performed with a fixed number of operations allowed (Sadovsky, 2003). Operation is understood to be an insertion, elimination or substitution of a character. For example, the distance between kitten and sitting is 3 because it needs at least three basic editing operations to change from one to the other. 1º kitten for sitten (substitution of k for s ) 2º sitten for sittin (substitution of e for i ) 3º sittin for sitting (insert g at the end) It is considered to be a generalisation of the Hamming distance, which is used in strings of the same length and which is only deemed to be a substitution operation. The metric presented here, as in the distance editing also applies to strings of different lengths, however it may also be applied to strings of the same length (in fact it

coincides with the Hamming distance for changes of the same length). The main difference is that it introduces the absolute value of the difference of the string lengths and that it achieves similar results to those obtained with other distances. 2. Set comparison model ach gene g at an instant of time t which falls within the period of existence of a particular cell, will be denoted g(t). Therefore g(t)=(g1, g2,...,gn) where gi is one of the four symbols (T, C, A, G ). M represents the organised set f bases associated with each gene. In this case, the order in which these bases appear is particularly interesting for our purposes. If we define d:(mxt)x(mxt) + such that if g(t1)=(g1, g2,...,gn) and h(t2)=(h1, h2,...,hm) by the expression: d( g ( t1), h ( t2) n i1 p i1 (1 (1 g, h g, h ( i)) if ( i)) dim( g dim( g ( t ) dim( h ( t ) dim( h 1 1 ( t ) 2 ( t )) 2 if dim( g ( t )) dim( h 1 ( t )) 2 1 if g ( t1) h ( t2 ) where g, h ( i) 0 other case and p=min(n,m) then d is a metric (Lloret, 1999), (Bonnet et al. 2004). Note. In a specific case, if the genes have the same number of nucleotides (n) and the order in which these bases are situated exerts an influence, the above metric is expressed as:

d :(MxT)x(MxT) + such that if g(t1)=(g1, g2,...,gn) and h(t2)=(h1, h2,...,hn), then n i i d( g ( t1), h ( t2)) (1 g, ( i)) where g h i i1 h 1, ( ) 0 if g h if g h (whether or not they both have the same element in position i). This metric coincides with the Hamming distance, so that our metric is a generalisation of this metric if the sequences have different lengths. That is, d(g(t1),h(t2)) represents the number of g elements which are not shared with h (where we do take into account the positions in which they appear). Clearly, the smaller the value of d, the greater will be the coincidence between the sequences g and h under consideration. For example, the distance between g(t1)=(t, C, G, A, C, T, A, T, C, A) and h(t2)=(t, C, G, A, C, T, A, T, C, A, G) is: p is the lowest of the dimensions of the two changes which in this case is 10, therefore the summatory would be taken up to 10. Futhermore, the absolute value of the difference between the strings is 11-10=1. And therefore d( g ( t ), h ( t )) (1 10 ( i)) dim( g ( t )) dim( h ( t 1 2 g, h 1 2 1 i1 i i )) 0 1 More complex and developed examples will be presented with the programme. 3. Software for the comparison of DNA sequences This application is based on the previous metric and its aim is to analyse strings of DNA by comparing the sequence of nucleotides. The algorithm begins with two DNA strings and then offers a list of the differences found between both sequences, an interpretation of these changes, and a numeric value indicating the measurement of the alterations produced.

Interface Below, a detailed description of the software inputs and outputs is provided: Inputs: As inputs, the algorithm receives two DNA sequences to analyse. ach sequence will be made up of a string of its basic components represented by the ASCII characters (a, t, g, c) of an arbitrary length. Outputs: As output, the algorithm offers a report showing the results of the analysis that have been carried out. The information it contains deals with both quantitative and qualitative aspects. The contents of this report as follows: Position, length and composition of each change found between both sequences. Semantic determination of each change. Some of the changes produced can be from the following list: substitution, insertion, deletion, duplication inversion and transposition. Measurement of the distance between both sequences. Output is provided in a text file. Function The software is organised in a modular way and programmed structurally, which offers flexibility and adaptation to the conditions required by the analysis. The comparison criteria are adjustable to the characteristics of the strings so that the focus on different transformation operations in the sequences being studied. The function is based on the application of comparison heuristics with strings of consecutive characters. Use is made of combined comparison strategies between local substrings and overall analysis of the complete string.

In order to value the degree of relationship between the sequences, three sequence proximity functions have been defined. These functions will supply information so that the most probable transformation can be decided. The functions are as follows: 1. Measurement of the global correlation: this counts the number of noncoincidences in a base to base comparison of the two strings. 2. Measurement of the local correlation: counts the number of non-coincidences in a base to base comparison in an environment around a specific position. 3. Measurement of consecutive positions: counts the number of coincident consecutive positions in a base to base comparison from a specific position. The comparison procedure between the sequences is a combination of the following heuristics: 1. Direct comparison of sequences: this checks whether the bases of the sequences situated in the same position coincide. 2. Best position of a subsequence: this looks for the string of bases which obtains the most coincidences in the functions of measurement of local correlation and of consecutive positions. 3. Correlation of least difference: this looks for the relative position between the sequences that obtains the most coincidences in the measurement of the global correlation function. This heuristic is a generalisation of the previous one to the complete string. 4. Duplicity of subsequences: analyses to ascertain whether a subsequence of bases is equal to the subsequences next to it.

5. Special cases: transformation situations undetected by the previous heuristics are analysed. The decision on the produced transformation is made by taking into account both the characteristics of the operation and the consequences for the complete string. The order in which the heuristics are applied may alter the analysis of the results, since a result can reach the same sequence via different operations. This is because combined strategies are established to determine the order. These strategies minimise the function distance as they are based on information about the frequency and number of transformations. In any case, this order can be varied depending on the purposes of the study. The following figure shows a flow diagram that graphically illustrates how the software works: start data acquisition parameter adjustment sequence direct comparison differences amount transform. yes smaller difference correlation better position subsequence no particular case analisys no insertion report generation report yes subsequences duplicity end Figure 1: Algorithm flowchart Results

The empirical analysis demonstrates that in most of the cases analysed, the changes occurring between two sequences have been detected by the software. However, a previous stage of training and calibration of the algorithm parameters to the nature of the analysed data was necessary. 4. Application xample Our main problem will be to compare the structure of a gene and the nucleotide changes formed in a gene at a specific time by using the gene database at the uropean Bioinformatics Institute (BI_website, 2006). More specifically, we will compare nucleotide changes of the following gene: Drosophila melanogaster, sequence length 4220 BP, Accession # AB095028, ntry name: MBL: AB095028. In the first matrix, the algorithm displays the digits corresponding to each of the nucleotides of the gene, in the second matrix the original gene whose nucleotide changes we want to analyse (it can be consulted in the database) and in the third matrix, the gene with the associate d nucleotide change. The software shows the nucleotide changes that have taken place, indicating the positions in which they have occurred and a distance value indicative of the similarity between the two sequences. For reasons of space, only a substring of the previous gene with its nucleotide changes and the result obtained are given in this paper. A more extensive account of the experiments that have been carried out can be found on the authors website. ->Processing: sub_mbl_ab095028.txt --------------------------------------------------------------

Original segment : A T T T A T A C G A T C G G A A A C G G A A C G G A T T T G C T T G A G C C A G C A T C T G C A G C A T G T C C T G C A A C T G C A G C A G C A A C A C C A G C A G C A T G A A A C G T C G C C A A C A G T G T C G C A G C A G C A A C G T C A C A A T G C G A T G C A T C A T C T C C A T C A T C A G A C A G G T T A G T A T T A A G T G T G T A T T G T T T T T T A A G T T T C T G G C Segment with nucleotide changes: A T T T A T G T A C G A T C G G A A A C G G A A C G G A T T T G C T T G A G C C A G C A T C T G C A T G T C C T G C A A C T G C A G C A G C A A C A C C T T T A G C A T G A A A C G T C G C C A A C A G T G T C G C A G C A G C A A C G T C A C A A T G T A G C G C A T C A T C T C C A T C A T C A G A C A G G T T A G T A T T A A G T A G T G T G T A T T G T T T T T T A A G T T T C T G G C Processing Report: -Insertion: (7, 8) => (G, T) -Deletion: (51, 52, 53) => (G, C, A) -Duplication: (168, 169, 170) => (A, G, T) -Inverted segment: (128, 129, 130, 131) => (T, A, G, C) -Substituted segment: 80-82 ({A, G, C} --> {T, T, T}) Distance: 15

5. Conclusions In this paper we have proposed a different means for comparing two successions, both from a theoretical point of view with the presentation of the metric and also from a practical perspective, with the introduction of the associated programme. We have not attempted to analyse the advantages of our programme over the infinite number of programmes already in existence for the comparative analysis of DNA sequences such as BLAST, FASTA, CLUSTALW, etc. and which many researchers already have installed on their computers which are able to trawl through entire databases (with millions of sequences), provide multiple sequence alignments and an extensive report on quantitative data, in addition to providing an enormous number of financial and human resources, for which the comparison would have no purpose. Our metric represents another means of analysing the comparison between two successions. The advantages gained with this software will need to be addressed in future studies although at present we know that differences in treatment of the successions will appear as our basic hypothesis is different. References Alexandrov, A. A., V. V. Alexandrov, Yu. Borodovsky and A. V. Mironov. 1990. Computer Analysis of Genetic Texts, Moscow: Nauka. Bonnet-Jerez J.L.; Lloret Climent M. 2004. An approach to measuring change in muscular tissue contraction. Byosystems. 74. 73-78 BI_website, 2006, uropean Bioinformatics Institute, URL: http://www.ebi.ac.uk/. Lloret-Climent M. 1999. Measures of cellular change in systems theory. Kybernetes. 28. No 8/9. 1016-1026 Sadovsky Michael G. 2003. The method to compare nucleotide sequences based on the

minimum entropy principle. Bulletin of Mathematical biology, 65. 309-322. Waterman M. (ed.) 1989. Alignment of sequences, Boca Raton: CRC Press Inc. Wootton, J. C. and S. Federchen (1996). Alignment of sequences, in Methods of nzymology, R. F. Doolite (d.), 554-571