FLAG: Fast Local Alignment Generating Methodology. Abstract. Introduction

Romnin Biotechnologicl Letters Vol 8, No, 23 Copyright 23 University of Buchrest Printed in Romni All rights reserved SHORT COMMUNICATION FLAG: Fst Locl Alignment Generting Methodology Abstrct Received for publiction, August 5, 22 Accepted, November 2, 22 Fculty of Computer Science, University Goce Delcev - Štip, Republic of Mcedoni Emil: donestojnov@ugdedumk A new, time nd spce efficient lignment methodology is presented, pplicble on similr nucleotide sequences Liner time complexity O(m), hs been determined when ligning pproximtely sme size similr sequences Time complexity improvement is due to the methodology, ccording which lignments re generted nd the significnt spce s reduction, where the serch for lignments is crried out Keywords: liner time nd spce, un-gpped, locl lignment, methodology Introduction The time inefficiency hs been the mjor disdvntge of locl pirwise lignment techniques Smith Wtermn s lgorithm (T SMITH & l []) requires fixed O(nm) time, identifying one optiml (score mximized lignment), llowing gps insertion Insted of finding one optiml lignment, M WATERMAN nd M EGGERT [2] cme up with n ide of identifying k suboptiml locl lignments The min disdvntge of Wtermn Eggert s lgorithm is gin the nonliner time complexity In order to reduce the spce complexity of Wtermn Eggert s lgorithm, X HUANG nd W MILLER [3] in 99 presented liner spce solution of Wtermn Eggert s lgorithm, being until then the spce chepest locl lignment technique Newly heuristic ultrfst solutions, such s: FASTA (D LIPMAN & l [4]) nd BLAST (S ALTSCHUL & l [5]), re pplicble for fst serch of lrge genetic dtbse, identifying similr sequences regrding referent sequence, not lwys finding the optiml solution Despite the time complexity, spce complexity is often found s limiting fctor when ligning lrge nucleotide sequences In order to reduce spce complexity of n lignment, methodology presented in (D STOJANOV & l [6]) represents ech region of consecutive mtching nucleotides with triple, identifying region s length nd strting positions t the sequences Bsed on this representtion, mesurements performed in [6] clerly show tht liner spce is required, while the time complexity is O(nm 2 ) Most of the time in [6] is wsted on exmining ll combintions of un-gpped locl lignments within overlpping sections nd the number of lignments being exmined When ligning similr nucleotide sequences, lrge regions of consecutive mtching nucleotides re prt of the optiml un-gpped locl lignment Also, the probbility of finding n optiml un-gpped locl lignment within m nucleotides long overlpping sections is higher thn the probbility of finding it in overlpping sections with less thn m nucleotides, where m is the length of the smller nucleotide sequence, subject of n lignment Bsed on the previous, fst locl lignment generting methodology is presented, requiring liner time nd spce O(m), when ligning pproximtely sme size, similr nucleotide sequences Romnin Biotechnologicl Letters, Vol 8, No, 23 788

Mterils nd methods Methodology As Tble shows, the smller nucleotide sequence b overlps n-m+ different sequence sections with length m nd sections with length less thn m Overlpping sections with length less thn the length of the smller sequence - b, re formed by left nd right one plce sequence b shifts, out of the length of the sequence Tble Overlpping sections Nucleotide sequences: = n 2n, = m 2 m m nucleotides long sequence b left shifted sequence b right shifted overlpping sections overlpping sections overlpping sections m 2m n n m+ n 2n m 2m n m 2 m b bm 3bm 2bm b b b m 2bm m 3m 2 n n m+ 2 n 2n m m n 2 m 2 m b bm 4bm 3bm 2bm m 2 m n m n b b b m 2b m When compring prllel nucleotides within overlpping sections, χ, χ regions composed of consecutive mtching nucleotides, with t lest one mtch, re found As we hve shown in [6], ech mtching region cn be represented with triple R:(p b, p, l), where p b is region s strting position t the sequence b, p is region s strting position t the sequence, while l is region s length An un-gpped locl lignment consists of one or more mtching regions, seprted with region(s) of mismtching nucleotides The sme lignment s score metrics, which hs been used in [6], will be lso used here, wrding positive score μ for ech nucleotide mtch, while penlizing ech nucleotide mismtch with negtive score δ Ech lignment A:(R R 2 R k- R k ) is ssigned unique score, computed with the formul presented in [6]: k f ( A: R R2 R R ) = μ len( R ) δ dif ( R, Rj ) k k i i= j= 2 where: len(r i ) is the length of the mtching region R i, while dif(r j, R j- ) is the number of mismtching nucleotides, seprting mtching regions R j nd R j- Alignments re formed ccording to the following strtegy fst locl lignment generting methodology, generting s n output score mximized un-gpped locl lignment A, regrding the longest mtching region within overlpping sections, where χ mtching regions hve been found: Find the longest mtching region: Rς = mx len R, R2,, Rχ, Rχ, ς χ Tke initilly region R ς s extending nd locl lignment: A e Rς, A Rς If ς =, extend A e, ppending regions R ξ, Ae Ae >< Rξ, consecutively for ξ = 2,3,, χ, χ If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae If ς = χ, extend A e, ppending regions R ξ, A e Rξ >< Ae, consecutively for ξ = χ, χ 2,,2, If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae k j 7882 Romnin Biotechnologicl Letters, Vol 8, No, 23

FLAG: Fst Locl Alignment Generting Methodology If < ς < χ, extend A e, ppending left positioned regions R ξ, A e Rξ >< Ae, consecutively for ξ = ς, ς 2,,2, If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae Tke the locl lignment found t this stge s extending lignment, A e A, now being subject of right positioned extension, ppending regions R ξ, Ae Ae >< Rξ, consecutively for ξ = ς +, ς + 2,, χ, χ If f(a e )>f(a), then: f ( A) f ( Ae ), A Ae If f mx is score of the highest scoring un-gpped locl lignment, found within m nucleotides long overlpping sections, lso hs to be checked whether exists n lignment with higher score thn f mx, within overlpping sections with less thn m nucleotides, formed by left one plce sequence b shifts, out of the length of the sequence Proposition : An lignment with higher score thn f mx, could be found within sequence b left shifted overlpping sections, with lengths rnging between: m- nd [ f mx / μ] +, including those vlues Proof: Within overlpping sections of length l, the mximum possible score of n un-gpped locl lignment is l μ Accordingly, higher score lignment thn f mx could be found if l μ > f mx, where from we get tht l > fmx / μ [ fmx / μ] According Proposition, there is no need for serch of n lignment with higher score thn f mx, within sequence b left shifted overlpping sections, with lengths less thn [ f mx / μ] +, once n lignment with highest score f mx, hs been found within m nucleotides long overlpping sections Proposition 2: An lignment with higher score thn f mx, could be found within sequence b right shifted overlpping sections, with lengths rnging between: m- nd [ f mx / μ] +, including those vlues, if f mx is score of the optiml(highest scoring) lignment, found fter exmining lignments within m nucleotides long overlpping sections nd sequence b left shifted overlpping sections, ccording to the fst locl lignment generting methodology The proof of Proposition 2 is nlogous to the proof of Proposition While serching for the optiml un-gpped locl lignment, within m nucleotides long overlpping sections, left nd right shifted overlpping sections, dt vector identifying current un-gpped locl lignment with highest score, is kept in the memory Vector s content dynmiclly updtes if new, higher scoring un-gpped locl lignment thn the current highest one is found The lst updte of this vector identifies the optiml un-gpped locl lignment An exmple Fst locl lignment generting methodology will be demonstrted on concrete exmple, tking nucleotide sequences: : TGCTAACTTTGATTGCCTA nd b: TGAATCCCTTGAATGAAC s smples Since the length of the sequence is 9, while the length of the sequence b is 8, sequence b overlps n-m+=9-8+=2 different sequence sections with length 8 Alignments within overlpping sections re generted ccording to the fst locl lignment generting methodology Tble 2, wrding +2 for ech nucleotide mtch, while penlizing ech nucleotide mismtch with - Tble 2 Exmining lignments within 8 nucleotides long overlpping sections 8 nucleotides long overlpping sections TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC mtching region(s), found / region s score R : (,,2) / f ( R ) = 4 R : (6,6,) / f ( R 2 ) = 2 2 locl lignment found ccording the fst locl lignment generting methodology/lignment s score CTTTGATTG CCTTGAATG Romnin Biotechnologicl Letters, Vol 8, No, 23 7883

TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC R (8,8,4) / f ( R 3 ) = 8 : 3 R : (3,3,2) / f ( R 4 ) = 4 4 R : (3,4,) / f ( R ) = 2 R 2 : (5,6,) / f ( R ) 2 2 = R (8,9,) / f ( R 3 ) = 2 : 3 f ( A : R2 R3R4 ) = AAC ATC f A: R R ) ( 2 = 3 2 When compring prllel nucleotides within the first overlpping sections, four mtching regions re found There re two mtching nucleotides within the first region, one mtching nucleotide within the second region, four mtching nucleotides within the third region, while the number of mtching nucleotides within the fourth region is two The third mtching region is the longest one, initilly tken s extending nd locl lignment: A e R3, A R3 A e is left extended, Ae R2 >< Ae : R2R3, resulting with n lignment with score 9 Since extended lignment s score is higher thn the score of the locl lignment, locl lignment A is updted with A e, A Ae : R2R3 Further left positioned extension of A e results with n lignment: Ae R >< Ae : RR 2R3, with score 9 Currently extended lignment s score equls locl lignment s score, cusing no chnge of the locl lignment nd its score Optiml left extended lignment, regrding the longest mtching region, is A:R 2 R 3 Now this lignment is tken s extending lignment, being subject of right positioned extension, A e A : R2R3 After ppending R 4, n lignment: Ae Ae >< R4 : R2R3R4, with score 2, is obtined Extended lignment s score is higher thn the score of the locl lignment A, cusing locl lignment s updte with A e, A A : R2R3R4 Within the second overlpping sections, three mtching regions, with one mtching nucleotide, re found The locl lignment ccording to the fst locl lignment generting methodology is A:R R 2, with score 3 Locl lignment within the second overlpping sections is not higher scoring thn the locl lignment found within the first overlpping sections, whereby we cn conclude tht the optiml un-gpped locl lignment, found within 8 nucleotides long overlpping sections is A:R 2 R 3 R 4 =(6, 6,, 8, 8, 4, 3, 3, 2), with score 2 According Proposition, n lignment with higher score thn 2, might exist within left shifted overlpping sections, with lengths between nd 7 Exmining lignments within left shifted overlpping sections, ccording to the fst locl lignment generting methodology, no lignment with higher score is found Finlly ccording Proposition 2, lignments within right shifted overlpping sections, with lengths rnging between nd 7, re exmined Since lso within those overlpping sections, no lignment with higher score is found, un-gpped locl lignment found within the first 8 nucleotides long overlpping sections: A:R 2 R 3 R 4 =(6, 6,, 8, 8, 4, 3, 3, 2), is the optiml one being found Tble 3 Sequence b left nd right shifted overlpping sections Sequence b right shifted overlpping Sequence b left shifted overlpping sections sections TGCTAACTTTGATTGCCTA TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC TGAATCCCTTGAATGAAC TGCTAACTTTGATTGCCTA TGCTAACTTTGATTGCCTA TGAATCCCTTGAATGAAC TGAATCCCTTGAATGAAC e 7884 Romnin Biotechnologicl Letters, Vol 8, No, 23

Results nd Discussion FLAG: Fst Locl Alignment Generting Methodology An implementtion Fst locl lignment generting methodology hs been implemented in C++ While serching for the optiml solution, during the execution, memory keeps two dt vectors, whose content is dynmiclly chnged Dt vector - set of triples, identifying mtching regions found within current overlpping sections nd dt vector - set of triples, identifying n un-gpped locl lignment with highest score, found until then As hs been previously explined, ech triple is unique identifier of mtching region, holding region s length nd region s strting positions t the sequences For ech set of mtching regions, found within current overlpping sections, fst locl lignment generting function FLAG is clled, generting s n output n optiml un-gpped locl lignment, regrding the longest locl mtching region Afterwrds, lignment s score is compred with the score of the optiml lignment found until then If higher score lignment is found, the optiml lignment nd its score re updted function FLAG(input: R, R2,, R χ, Rχ, output: A) if( χ!= ) if( χ == ) score μ length( R ) A R else Find Rς = mx len R, R2,, Rχ, Rχ A R ς A e R ς score μ length( A) if( ς == ) for( ξ = 2 ; ξ <= χ; ξ + + ) Ae Ae >< R ξ if(f(a e ) >score) score f ( A e ) A A e else if( ς == χ ) for( ξ = χ ; ξ >= ; ξ ) A e Rξ >< A e if(f(a e ) >score) score f ( A e ) A A e Romnin Biotechnologicl Letters, Vol 8, No, 23 7885

else for( ξ = ς ; ξ >= ; ξ ) A e Rξ >< A e if(f(a e ) >score) score f ( A e ) A A e A e A for( ξ = ς + ; ξ <= χ; ξ + + ) Ae Ae >< R ξ if(f(a e ) >score) score f ( A e ) A A e Test results Implementtion s running time hs been mesured, ligning ten pirs different length nucleotide sequences, on Fujitsu computer with Core(TM) 2 Duo CPU t 267GHz nd 2 GB RAM Score metrics, wrding +2 for ech nucleotide mtch, while penlizing ech mismtch with -, hs been used Approximtely sme size similr sequences hve been ligned According results presented in Tble 4, implementtion s liner time complexity O(m) is more thn evident, when ligning pproximtely sme size similr sequences Following ours previous spce efficient implementtion [6], two dt vectors set of triples, identifying mtching regions found within overlpping sections nd current optiml lignment re kept in the memory during the execution, resulting with liner spce complexity O(m) Tble 4 Implementtion s running time sequence sequence s length - l sequence b Columne ltent 374 Columne ltent viroid clone -6 viroid RNA Cherry chlorotic 69 Cherry chlorotic rusty spot ssocited rusty spot ssocited smll stellite-like smll stellite-like dsrna B dsrna C Agertum lef curl Cmeroon betstellite, isolte 38 Agertum lef curl Cmeroon betstellite, isolte sequence s running time b length l b t (sec) 37 47 66 25 379 343 7886 Romnin Biotechnologicl Letters, Vol 8, No, 23

FLAG: Fst Locl Alignment Generting Methodology StB6 StB4 Cyclovirus Chimp 75 Cyclovirus Chimp2 747 28 Stchytrphet lef curl virus - [Hn6] Adeno-ssocited virus 3 Mouse prvovirus 4 Bnn strek IM virus Gremmeniell bietin type B RNA virus XL O'nyong-nyong virus strin SG65 275 Stchytrphet lef curl virus - [Hn54] 4726 Adeno-ssocited virus 3B 48 Mouse prvovirus 4b 7769 Bnn strek Imove virus strin IRFA9 375 Gremmeniell bietin type B RNA virus XL2 822 Igbo Or virus strin IBH964 2748 437 4722 29 4794 72 7768 685 374 35 82 2729 Figure Columne ltent viroid RNA nd Columne ltent viroid clone -6 lignment Given dt set S=<l i, lb i, t i >, i= obtined during the experimentl evlution, Principl Components Anlysis (PCA) ws used to fit liner regression tht minimizes the perpendiculr distnces from the dt to the fitted line This problem is equivlent to serch for the liner sub-spce which mximizes the vrince of projected points, the ltter being obtined by eigen decomposition of the covrince mtrix Eigen vectors corresponding to lrge eigen vlues re the directions in which the dt hs strong component, or equivlently lrge vrince PCA finds n orthogonl bsis tht best represents given dt set In our cse the fitted line cn be described with the following eqution: r = n+ t * p Where n = [7738;7775;852933e-4] is the point on the fitted line nd p = [46356e+3; 46329e+3; 42] is the line direction vector, nd t R The fitted line together with the orthogonl distnces from ech point to the line is shown on Fig 2 It shows liner dependency of ligning time for similr nucleotide sequences Romnin Biotechnologicl Letters, Vol 8, No, 23 7887

t l b l Figure 2 Fitted line with the orthogonl distnces from ech point to the line Conclusions Liner time nd spce lignment technique hs been presented, pplicble on similr nucleotide sequences The time complexity improvement is due to the methodology, ccording which un-gpped locl lignments re generted within overlpping sections nd the reduced lignments serch spce Also, the spce complexity remins liner, bsed on region s spce efficient representtion References T SMITH, M WATERMAN, Identifiction of common moleculr subsequences Journl of Moleculr Biology, 47(), 95, 97 (98) 2 M WATERMAN, M EGGERT, A new lgorithm for best subsequence lignments with ppliction to trna-rrna comprisons Journl of Moleculr Biology, 97,723, 728 (987) 3 X HUANG, W MILLER, A time-efficient, liner-spce locl similrity lgorithm Advnces in Applied Mthemtics, 2,337, 357 (99) 4 D LIPMAN, W PEARSON, Rpid nd sensitive protein similrity serches Science, 227(4693),435, 44 (985) 5 S ALTSCHUL, W GISH, W MILLER, E MYERS, D LIPMAN, Bsic locl lignment serch tool Journl of Moleculr Biology, 25(3),43, 4 (99) 6 D STOJANOV, A MILEVA, S KOCESKI, A new, spce-efficient locl pirwise lignment methodology Advnced Studies in Biology, 4(2),85, 93 (22) 7888 Romnin Biotechnologicl Letters, Vol 8, No, 23