A metric approach for. comparing DNA sequences

A metric approach for comparing DNA sequences H. Mora-Mora Department of Computer and Information Technology University of Alicante, Alicante, Spain M. Lloret-Climent Department of Applied Mathematics. University of Alicante, Alicante, Spain F. Vives-Macia Department of Applied Mathematics. Universidad de Alicante, Alicante, Spain Abstract Purpose This article attempts to compare two biological successions, in general any two successions, analysing the differences between them. Design/methodology/approach We designed an algorithm based on a metric which enables us to calculate the associated distance between two successions and based on the distance obtained, we approach the type of mutations which have occurred. Findings The empirical analysis shows that the transformations caused in the successions have been detected by the metric. We are aware that today there are numerous enormously powerful programmes able to trawl entire databases and therefore our aim in this article was not to compare these, but to demonstrate a different way of comparing two given successions. Practical implications Comparison of two DNA successions having a measure of the degree of similarity between them and their possible mutations. Originality/value The metric presented is a generalisation of the Hamming distance in strings of different lengths. The sotfware associated with the metric has enabled us to validate the results obtained. Keywords DNA sequences, algorithm, metric. Paper type Research paper

1. Introduction Comparison of nucleotide (and amino acid) sequences is essential for research in the areas of molecular genetics, molecular biology, and/or bioinformatics. Code theory, programming and algorithm analysis face the same problem. The problem in the comparison of symbol sequences is that it is introducing a real metric between two sequences. To be precise, a metric exists (the so-called Hamming distance), but it does not provide any comprehension of a comparison of strings, yielding only two outcomes: two strings coincide, or two strings do not coincide. For example the Hamming distance between 1011101 and 1001001 is 2, or the Hamming distance between toned and roses is 3. The most widespread methodology to establish similarity in a family of symbol sequences (of a different origin) is the approach called alignment, or editing distance (Watermann, 1989; Alexandrov et al., 1990; Wootton and Federchen, 1996). The methodology is based on an idea of transformation of one sequence into another, where a transformation is performed with a fixed number of operations allowed (Sadovsky, 2003). Operation is understood to be an insertion, elimination or substitution of a character. For example, the distance between kitten and sitting is 3 because it needs at least three basic editing operations to change from one to the other. 1º kitten for sitten (substitution of k for s ) 2º sitten for sittin (substitution of e for i ) 3º sittin for sitting (insert g at the end) It is considered to be a generalisation of the Hamming distance, which is used in strings of the same length and which is only deemed to be a substitution operation. The metric presented here, as in the distance editing also applies to strings of different lengths, however it may also be applied to strings of the same length (in fact it

coincides with the Hamming distance for changes of the same length). The main difference is that it introduces the absolute value of the difference of the string lengths and that it achieves similar results to those obtained with other distances. 2. Set comparison model ach gene g at an instant of time t which falls within the period of existence of a particular cell, will be denoted g(t). Therefore g(t)=(g1, g2,...,gn) where gi is one of the four symbols (T, C, A, G ). M represents the organised set f bases associated with each gene. In this case, the order in which these bases appear is particularly interesting for our purposes. If we define d:(mxt)x(mxt) + such that if g(t1)=(g1, g2,...,gn) and h(t2)=(h1, h2,...,hm) by the expression: d( g ( t1), h ( t2) n i1 p i1 (1 (1 g, h g, h ( i)) if ( i)) dim( g dim( g ( t ) dim( h ( t ) dim( h 1 1 ( t ) 2 ( t )) 2 if dim( g ( t )) dim( h 1 ( t )) 2 1 if g ( t1) h ( t2 ) where g, h ( i) 0 other case and p=min(n,m) then d is a metric (Lloret, 1999), (Bonnet et al. 2004). Note. In a specific case, if the genes have the same number of nucleotides (n) and the order in which these bases are situated exerts an influence, the above metric is expressed as:

d :(MxT)x(MxT) + such that if g(t1)=(g1, g2,...,gn) and h(t2)=(h1, h2,...,hn), then n i i d( g ( t1), h ( t2)) (1 g, ( i)) where g h i i1 h 1, ( ) 0 if g h if g h (whether or not they both have the same element in position i). This metric coincides with the Hamming distance, so that our metric is a generalisation of this metric if the sequences have different lengths. That is, d(g(t1),h(t2)) represents the number of g elements which are not shared with h (where we do take into account the positions in which they appear). Clearly, the smaller the value of d, the greater will be the coincidence between the sequences g and h under consideration. For example, the distance between g(t1)=(t, C, G, A, C, T, A, T, C, A) and h(t2)=(t, C, G, A, C, T, A, T, C, A, G) is: p is the lowest of the dimensions of the two changes which in this case is 10, therefore the summatory would be taken up to 10. Futhermore, the absolute value of the difference between the strings is 11-10=1. And therefore d( g ( t ), h ( t )) (1 10 ( i)) dim( g ( t )) dim( h ( t 1 2 g, h 1 2 1 i1 i i )) 0 1 More complex and developed examples will be presented with the programme. 3. Software for the comparison of DNA sequences This application is based on the previous metric and its aim is to analyse strings of DNA by comparing the sequence of nucleotides. The algorithm begins with two DNA strings and then offers a list of the differences found between both sequences, an interpretation of these changes, and a numeric value indicating the measurement of the alterations produced.

Interface Below, a detailed description of the software inputs and outputs is provided: Inputs: As inputs, the algorithm receives two DNA sequences to analyse. ach sequence will be made up of a string of its basic components represented by the ASCII characters (a, t, g, c) of an arbitrary length. Outputs: As output, the algorithm offers a report showing the results of the analysis that have been carried out. The information it contains deals with both quantitative and qualitative aspects. The contents of this report as follows: Position, length and composition of each change found between both sequences. Semantic determination of each change. Some of the changes produced can be from the following list: substitution, insertion, deletion, duplication inversion and transposition. Measurement of the distance between both sequences. Output is provided in a text file. Function The software is organised in a modular way and programmed structurally, which offers flexibility and adaptation to the conditions required by the analysis. The comparison criteria are adjustable to the characteristics of the strings so that the focus on different transformation operations in the sequences being studied. The function is based on the application of comparison heuristics with strings of consecutive characters. Use is made of combined comparison strategies between local substrings and overall analysis of the complete string.

In order to value the degree of relationship between the sequences, three sequence proximity functions have been defined. These functions will supply information so that the most probable transformation can be decided. The functions are as follows: 1. Measurement of the global correlation: this counts the number of noncoincidences in a base to base comparison of the two strings. 2. Measurement of the local correlation: counts the number of non-coincidences in a base to base comparison in an environment around a specific position. 3. Measurement of consecutive positions: counts the number of coincident consecutive positions in a base to base comparison from a specific position. The comparison procedure between the sequences is a combination of the following heuristics: 1. Direct comparison of sequences: this checks whether the bases of the sequences situated in the same position coincide. 2. Best position of a subsequence: this looks for the string of bases which obtains the most coincidences in the functions of measurement of local correlation and of consecutive positions. 3. Correlation of least difference: this looks for the relative position between the sequences that obtains the most coincidences in the measurement of the global correlation function. This heuristic is a generalisation of the previous one to the complete string. 4. Duplicity of subsequences: analyses to ascertain whether a subsequence of bases is equal to the subsequences next to it.

5. Special cases: transformation situations undetected by the previous heuristics are analysed. The decision on the produced transformation is made by taking into account both the characteristics of the operation and the consequences for the complete string. The order in which the heuristics are applied may alter the analysis of the results, since a result can reach the same sequence via different operations. This is because combined strategies are established to determine the order. These strategies minimise the function distance as they are based on information about the frequency and number of transformations. In any case, this order can be varied depending on the purposes of the study. The following figure shows a flow diagram that graphically illustrates how the software works: start data acquisition parameter adjustment sequence direct comparison differences amount transform. yes smaller difference correlation better position subsequence no particular case analisys no insertion report generation report yes subsequences duplicity end Figure 1: Algorithm flowchart Results

The empirical analysis demonstrates that in most of the cases analysed, the changes occurring between two sequences have been detected by the software. However, a previous stage of training and calibration of the algorithm parameters to the nature of the analysed data was necessary. 4. Application xample Our main problem will be to compare the structure of a gene and the nucleotide changes formed in a gene at a specific time by using the gene database at the uropean Bioinformatics Institute (BI_website, 2006). More specifically, we will compare nucleotide changes of the following gene: Drosophila melanogaster, sequence length 4220 BP, Accession # AB095028, ntry name: MBL: AB095028. In the first matrix, the algorithm displays the digits corresponding to each of the nucleotides of the gene, in the second matrix the original gene whose nucleotide changes we want to analyse (it can be consulted in the database) and in the third matrix, the gene with the associate d nucleotide change. The software shows the nucleotide changes that have taken place, indicating the positions in which they have occurred and a distance value indicative of the similarity between the two sequences. For reasons of space, only a substring of the previous gene with its nucleotide changes and the result obtained are given in this paper. A more extensive account of the experiments that have been carried out can be found on the authors website. ->Processing: sub_mbl_ab095028.txt --------------------------------------------------------------

Original segment : A T T T A T A C G A T C G G A A A C G G A A C G G A T T T G C T T G A G C C A G C A T C T G C A G C A T G T C C T G C A A C T G C A G C A G C A A C A C C A G C A G C A T G A A A C G T C G C C A A C A G T G T C G C A G C A G C A A C G T C A C A A T G C G A T G C A T C A T C T C C A T C A T C A G A C A G G T T A G T A T T A A G T G T G T A T T G T T T T T T A A G T T T C T G G C Segment with nucleotide changes: A T T T A T G T A C G A T C G G A A A C G G A A C G G A T T T G C T T G A G C C A G C A T C T G C A T G T C C T G C A A C T G C A G C A G C A A C A C C T T T A G C A T G A A A C G T C G C C A A C A G T G T C G C A G C A G C A A C G T C A C A A T G T A G C G C A T C A T C T C C A T C A T C A G A C A G G T T A G T A T T A A G T A G T G T G T A T T G T T T T T T A A G T T T C T G G C Processing Report: -Insertion: (7, 8) => (G, T) -Deletion: (51, 52, 53) => (G, C, A) -Duplication: (168, 169, 170) => (A, G, T) -Inverted segment: (128, 129, 130, 131) => (T, A, G, C) -Substituted segment: 80-82 ({A, G, C} --> {T, T, T}) Distance: 15

5. Conclusions In this paper we have proposed a different means for comparing two successions, both from a theoretical point of view with the presentation of the metric and also from a practical perspective, with the introduction of the associated programme. We have not attempted to analyse the advantages of our programme over the infinite number of programmes already in existence for the comparative analysis of DNA sequences such as BLAST, FASTA, CLUSTALW, etc. and which many researchers already have installed on their computers which are able to trawl through entire databases (with millions of sequences), provide multiple sequence alignments and an extensive report on quantitative data, in addition to providing an enormous number of financial and human resources, for which the comparison would have no purpose. Our metric represents another means of analysing the comparison between two successions. The advantages gained with this software will need to be addressed in future studies although at present we know that differences in treatment of the successions will appear as our basic hypothesis is different. References Alexandrov, A. A., V. V. Alexandrov, Yu. Borodovsky and A. V. Mironov. 1990. Computer Analysis of Genetic Texts, Moscow: Nauka. Bonnet-Jerez J.L.; Lloret Climent M. 2004. An approach to measuring change in muscular tissue contraction. Byosystems. 74. 73-78 BI_website, 2006, uropean Bioinformatics Institute, URL: http://www.ebi.ac.uk/. Lloret-Climent M. 1999. Measures of cellular change in systems theory. Kybernetes. 28. No 8/9. 1016-1026 Sadovsky Michael G. 2003. The method to compare nucleotide sequences based on the

minimum entropy principle. Bulletin of Mathematical biology, 65. 309-322. Waterman M. (ed.) 1989. Alignment of sequences, Boca Raton: CRC Press Inc. Wootton, J. C. and S. Federchen (1996). Alignment of sequences, in Methods of nzymology, R. F. Doolite (d.), 554-571