Improving an Algorithm for Approximate Pattern Matching. University of Chile. Blanco Encalada Santiago - Chile.

Size: px

Start display at page:

Download "Improving an Algorithm for Approximate Pattern Matching. University of Chile. Blanco Encalada Santiago - Chile."

Thomasina Blake
6 years ago
Views:

1 Improving an Algorihm for Approximae Paern Maching Gonzalo Navarro Ricardo Baeza-Yaes Deparmen of Compuer Science Universiy of Chile Blanco Encalada - Saniago - Chile fgnavarro,rbaezag@dcc.uchile.cl Absrac We sudy a recen algorihm for fas on-line approximae sring maching. This is he problem of searching a paern in a ex allowing errors in he paern or in he ex. The algorihm is based on a very fas ernel which is able o search shor paerns using a nondeerminisic nie auomaon, which is simulaed using bi-parallelism. A number of echniques o exend his ernel for longer paerns are presened in ha wor. However, he echniques can be inegraed in many ways and he opimal inerplay among hem is by no means obvious. The soluion o his problem sars a a very low level, by obaining basic probabilisic informaion abou he problem which was no previously nown, and ends inegraing analyical resuls wih empirical daa o obain he opimal heurisic. The conclusions obained via analysis are experimenally conrmed. We also improve many of he echniques and obain a combined heurisic which is faser han he original wor. This wor shows an excellen example of a complex and heoreical analysis of algorihms used for design and for pracical algorihm engineering, insead of he common pracice of rs designing an algorihm and hen analyzing i. Inroducion Approximae sring maching is one of he main problems in classical sring algorihms, wih applicaions o ex searching, compuaional biology, paern recogniion, ec. The problem can be formally saed as follows: given a (long) ex of lengh n, a (shor) paern of lengh m, and a maximal number of errors allowed, nd all segmens (called \occurrences" or \maches") whose edi disance o he paern is a mos. Tex and paern are sequences of characers from an alphabe of size. We call = =m he error raio or error level. The edi disance beween wo srings a and b is he minimum number of edi operaions needed o ransform a ino b. The allowed edi operaions are deleing, insering and replacing a characer. Therefore, he problem is non-rivial for < < m, i.e. < <. The soluions o his problem dier if he algorihm is on-line (ha is, he ex is no nown in advance) or o-line (he ex can be preprocessed). In his wor we focus on on-line algorihms, where he classical soluion, involving dynamic programming, is O(mn) ime [, ]. This wor has been suppored in par by FONDECYT gran 96.

2 In he las years several algorihms have been presened ha achieve O(n) comparisons in he wors-case [8,,, ] or in he average case [9,, 8], by aing advanage of he properies of he dynamic programming marix (e.g. values in neighbor cells dier a mos in one). The bes average complexiy achieved under his approach is O(n= p ) [8]. Oher approaches aemp o ler he ex, reducing he area in which dynamic programming needs o be used [6, 3,,, 9,, 8, 7, ]. The lraion is based on he fac ha some porions of he paern mus appear wih no errors even in an approximae occurrence. These algorihms achieve \sublinear" expeced ime in many cases for low error raios (i.e. no all ex characers are read, O(n log m=m) is a ypical gure), bu he lraion is no eecive for larger raios, and some algorihms are no pracical if m is no very large. In [9], he use of a deerminisic nie auomaon (DFA) which recognizes he approximae occurrences of he paern in he ex is proposed. Alhough he search phase is O(n), he DFA can be huge. In [9, 3] he auomaon is compued in lazy form (i.e. only he saes acually reached in he ex are generaed). Ye oher approaches use bi-parallelism [,, 33]. This echnique simulaes parallelism on a sequenial processor using bi operaions. This aes advanage of he fac ha he processor operaes in all he bis of he compuer word in parallel. In a RAM machine of word lengh w = (log n) bis, his can reduce he number of real operaions by a facor of O(=w). In [3] he cells of he dynamic programming marix are paced in diagonals o achieve O(mn log()=w) ime complexiy. In [3] a Four Russians approach is used o pac he marix in machine words (hey end up in fac wih a DFA where hey can rade he number of saes for heir inernal complexiy). In [33], a non-deerminisic nie auomaon (NFA) ha recognizes he approximae occurrences of he paern is used, which has only a few saes and a regular srucure. They achieve O(mn=w) ime by parallelizing in bis he wor of such auomaon. A recen wor in his rend is [7], which parallelizes he dynamic programming algorihm o obain O(mn=w) cos in he wors case and O(n=w) on average. In [3, ] we proposed a new algorihm based on he bi-parallel simulaion of he same NFA of [33]. The simulaion, however, is compleely dieren and a core algorihm which is O(n) ime for small paerns (independenly of ) was obained. The algorihm is he fases one in ha case. A number of echniques o exend ha algorihm for longer paerns were shown: Auomaon pariioning simulaes he NFA using many compuer words, riyng no o wor on inacive porions of he auomaon. Paern pariioning cus he paern in pieces and searches all hem wih less errors, building up he occurrences of he complee paern from he maches of he pieces. Superimposiion searches many pieces using a single auomaon which serves as a ler for he mulipaern search. Is maches have o be veried o chec which piece (if any) acually mached. However, hose echniques can be combined in non-rivial ways and heir opimal ineracion is no obvious and was no obained in [3, ]. This opimal arrangemen depends on he parameers of he search problem, he mos imporan of which being

3 The error level oleraed (). In general, lering echniques (such as paern pariioning or superimposiion) wor well only for moderaely low error levels. As we see in Secion 3, he behavior of he problem changes drasically depending on he error level o olerae. The paern lengh (m). Our basic echniques wor for shor paerns and we need o exend hem o wor on longer ones. The alphabe size (). The larger he alphabe, he less probable is ha wo random srings mach, which improves he eciency of lraion algorihms. The lengh in bis of he compuer word (w). We simulae auomaa using he bis of compuer words, so he longer i is, he longer paerns can be accommodaed. In his paper we sudy in deph he echniques, nding ou he range of parameers where each one can be applied and he opimal way o combine hem. The analysis sars a a very low level, nding he probabiliy of an approximae occurrence. Each echnique is analyically and experimenally sudied o deermine is expeced behavior, and hen opimized. A he end, we combine opimally he echniques. Throughou he wor, experimenal validaion of he analyical resuls is provided and someimes used as par of he heurisic resuls. As a separae conribuion obained in par hans o he analysis, we improve many of he echniques hemselves. Those improvemens ranslae ino beer execuion imes in he combined algorihm which were no possible o obain using he original echniques, even using hem in he opimal way. Moreover, due o he new echniques, mos previous analyses are signicanly modied. As a side eec, we show he inerplay beween algorihm analysis and design and he feedbac beween heory and pracice. We highligh he use of analysis for design as well as experimenal resuls for design. The main improvemens obained over our previous wor of [3, ] are: We use our algorihms o verify poenial maches insead of relying on classical dynamic programming. This improves he algorihms for high error levels when he paerns are no very long. We use a new echnique o verify poenial maches which is capable of early discarding unineresing candidaes. This maes all our algorihms more resisan o he error level, which ranslaes ino beer execuion imes even for low error levels. We improve he search ime when he paern is long and he error level is no very high, by improving he regiser usage of he algorihm. This doubles he searching performance in some cases. More imporanly, i allows o eep he same performance of he core algorihm in paerns up o 8 imes longer. We improve search imes for high error levels in up o % by using code ha assumes a high error level. This is achieved by avoiding performing some expensive booeeping which pays o only for low error levels. 3

4 We obain a heurisic o auomaically combine he echniques in an opimal way, based on analyical and empirical resuls. A summary of he organizaion and resuls presened in he paper follows. In secion we explain he main feaures of he core algorihm presened in [3, ]. In secion 3 we sudy he main saisics of he problem ha drive all he average-case analysis ha follows. We consider specically he probabiliy of maching when errors are allowed and he porion of he NFA which is acive on average. In secion we presen he auomaon pariioning echnique, explain he general idea, opimize he way o pariion he NFA and presen new echniques o improve regiser usage. In secion we explain he paern pariioning echnique, opimize he pariioning scheme and presen a new echnique o inegrae he maches of he paern pieces. In secion 6 we describe he echnique of superimposing auomaa and show how o opimize he amoun of superimposiion. In secion 7 we build he complee heurisic based on he previous resuls, nding he opimal way o combine he above echniques. In secion 8 we compare experimenally he combined algorihm agains he fases algorihms we now. Finally, we presen our conclusions and fuure wor direcions in secion 9. A compiled version of he complee algorihm is publicly available (see Secion 7.). All he experimenal resuls of his paper were obained on a Sun UlraSparc- of 67 MHz running Solaris.., wih 6 Mb of RAM. This is a 3-bi machine, i.e. w = 3. All he imes are measured in seconds of user (CPU) ime per megabye of ex. Excep oherwise saed, our experimens have a sandard deviaion of %. A Bi-Parallel Core Algorihm In his secion we review he main poins of he algorihm [3, ]. We refer he reader o he original aricles for more deails. Consider he NFA for searching "pa" wih a mos = errors shown in Figure. Every row denoes he number of errors seen. The rs one, he second one, and so on. Every column represens maching he paern up o a given posiion. A each ieraion, a new ex characer is considered and he auomaon changes is saes. Horizonal arrows represen maching a characer (since we advance in he paern and in he ex, and hey can only be followed if he corresponding mach occurs), verical arrows represen insering a characer in he paern (since we advance in he ex bu no in he paern, increasing he number of errors), solid diagonal arrows represen replacing a characer (since we advance in he ex and paern, increasing he number of errors), and dashed diagonal arrows represen deleing a characer of he paern (hey are empy ransiions, since we delee he characer from he paern wihou advancing in he ex, and increase he number of errors). Finally, he self-loop a he iniial sae allows o consider any characer as a poenial saring poin of a mach, and he auomaon acceps a characer (as he end of a mach) whenever a righmos sae is acive. This NFA has (m ) ( ) saes. We assign number (i; j) o he sae a row i and column j, where i ::; j ::m. Iniially, he acive saes a row i are a he columns from o i, o represen he deleion of he rs i characers of he paern.

5 p a no errors p a error p a errors Figure : An example NFA for approximae sring maching. Afer processing he ex "xa", acive saes (hose conaining a \") are (; 3), (; ) and (; 3), besides hose always acive of he lower-lef riangle. We enclose in doed he saes acually represened by he algorihm. Many algorihms for approximae sring maching consis fundamenally in simulaing his auomaon by rows or columns. The dependencies inroduced by he diagonal empy ransiions prevened he bi-parallel compuaion of he new values. In [3, ] we have shown ha by simulaing he auomaon by diagonals (i.e. pacing he bis of he diagonals in machine words), i is possible o compue all values in parallel (using bi-parallelism). Hence, when all he bis o represen in a compuer word, he parallel updae formula for each new ex characer read is O() cos and very fas in pracice. For his simulaion, i suces o represen only he complee diagonals of he auomaon (excluding he rs one). The oal number of bis needed o represen he auomaon is (m? )( ). If we call w he number of bis in he compuer word, he core algorihm is O(n) ime in he wors and average case whenever (m? )( ) w. A cenral par of he algorihm is he deniion of an m bis long mas [c], represening mach or mismach agains he paern for each characer c. Tha is, if he paern is pa = p :::p m (wih p i ), hen [c] = b :::b m, where he bi b i is se whenever p i 6= c. This [ ] able is similar o ha used in he Shif-Or algorihm for exac sring maching [], and i allows more sophisicaed searching: a each posiion of he paern, we can allow no only a single characer, bu a class of characers, a no addiional search cos. This is expressed as an exended paern, which has a se of characers a each posiion, i.e. i belongs o P() insead of. Those paerns are denoed as pa = C :::C m, where C i. To search an exended paern i suces o se [c] o \mach" a posiion i for every c C i. For example, we can search in case-insensiive by allowing each posiion o mach he upper-case and lower-case versions of he leer. We show laer oher applicaions of his abiliy for our purposes.

6 An separae speedup echnique considers ha any occurrence of he paern mus begin wih one of is iniial leers (oherwise we will spend more han errors in insering hose leers). We can herefore raverse he ex wih a fas search for one of hose characers, and sar he auomaon only when we nd one. The lering is resumed when he auomaon runs ou of acive saes. As shown in [] his can double in pracice he performance of he auomaon for low error levels, alhough is performance (and even is convenience) depends on, and m. This echnique can be used for oher algorihms as well, as long as hey are slower han he search for one characer (which ourules mos lraion algorihms), and hey can easily be resared and erminaed. 3 The Saisics of he Problem As we see laer, a number of echniques o handle a long paern rely on searching pieces of he paern or even more complex consrucions, and hen performing a cosly vericaion sep each ime a piece is found. Therefore, for an average case analysis and o compare dieren heurisics, i is essenial o deermine which is he probabiliy of nding a paern in a ex posiion allowing errors. Anoher saisical informaion which is necessary for our average-case analysis is relaed o which porion of he NFA has acive saes, as our algorihms ry o simulae only he acive porion of he NFA. In all he average-case analysis of his paper we assume ha he paerns are no exended. An easy way o consider exended paerns is o replace by =s in all he formulas, where s is he size of he C i ses corresponding o paern posiions. This is because he probabiliy of crossing a horizonal edge of he auomaon is no = anymore, bu s=. 3. Probabiliy of Maching Given a paern of lengh m which is searched in a ex, boh paern and ex being random sequences over an alphabe of size (he leers are seleced wih uniform probabiliy), we wan o nd he probabiliy f(m; ) of a mach wih errors or less a a given ex posiion. Recall ha we use = =m. As we show shorly, his probabiliy grows very abruply as a funcion of, being exponenially decreasing wih m for small. The imporance of being exponenially decreasing wih m is ha he cos o verify a ex posiion is O(m ), and herefore if ha even occurs wih probabiliy O( m ) for some < hen he oal cos of vericaions is O(m m ) = o(), which maes he vericaion cos negligible. On he oher hand, as soon as he cos ceases o be exponenially decreasing i begins o be a leas =m, which yields a oal vericaion cos of O(mn). This is he same cos of plain dynamic programming. In [3, ] i is shown ha f(m; ) m, where =!?? (? )! e? (? ) () 6

7 and herefore f(m; ) is exponenially decreasing wih m whenever <, i.e. < =? e p () On he oher hand, he only opimisic bound we can prove is based on considering ha only replacemens are allowed (i.e. Hamming disance). In his case, given a paern of lengh m, he number of srings ha are a disance i from i are obained by considering ha we can freely deermine he i places of mismach, and a hose places we can pu any characer excep ha of he paern, i.e.!! m (? ) i m = i ( O(=)) i i Alhough we should sum he above probabiliies for i from zero o, we use he larges i = as a (igh) lower bound. Hence, he probabiliy of maching is obained by dividing he above formula (wih i = ) by m (he oal number of possible ex windows of lengh m), o obain!! f(m; ) m m m (?)m = m? (m? ) m? m?p m () = (m?= )? (? ) (where we used Sirling's approximaion o he facorial). Since e??, he above expression can be lower bounded by f(m; ) m m?=, where = (? )? Therefore an upper bound for he maximum allowed value for is? =, since oherwise we can prove ha f(m; ) is no exponenially decreasing on m (i.e. i is (m?= )). Hence, he limi <? e= p corresponds o he maximum error level up o where we can prove ha he algorihms based on lraion can wor well, and we can prove ha hey canno wor well for >? =. We now verify his analysis empirically. The experimen we performed consiss of generaing a large random ex and running he search of a random paern on ha ex, wih = m errors. A each ex characer, we record he minimum for which ha posiion would mach he paern. Finally, we analyze he hisogram, nding how many ex posiions mached for each value. We consider ha is safe up o where he hisogram values become signican. The hreshold is se o n=m, since m is he cos o verify a mach. However, he selecion of his hreshold is no very imporan, since he hisogram is exremely concenraed. For example, i has ve or six signicaive values for m in he hundreds. Figure shows he resuls for m = 3. The curve =? = p is included o show is closeness o he experimenal daa. Leas squares give he approximaion =? :9= p, wih a relaive error smaller han %. Figure 3 validaes oher heoreical assumpions. On he lef we show ha he maching probabiliy undergoes a sharp increase a. On he righ we show ha his poin is essenially independen on m. Noice, however, ha our assumpions are a bi opimisic since for shor paerns he maching probabiliy is somewha higher. 7

8 p Upper bound? = The curve? = Experimenal daa Exac lower bound wih = (Eq. ()) Conservaive lower bound, Eq. () 3. Acive Columns 3 6 Figure : Theoreical and pracical bounds for, for m = 3. A quesion relaed o he previous one is: which is he average number of auomaon columns which have acive saes? Tha is, if we call c r he smalles row wih acive saes in column r of our NFA, which is he larges r saisfying c r? Those columns saisfying c r are called acive, and columns pas he las acive one need no be updaed. Since our simulaion avoids woring on he inacive porions of he auomaon, he quesion of he acive columns is imporan for he average-case analysis of our algorihm (especially for pariioned auomaa). Ours is no he rs algorihm proing from acive columns. Uonen dened acive columns in [9], and modied he dynamic programming algorihm so ha i does no wor pas he las acive column. The algorihm eeps rac of he curren las acive column. A he end of each ieraion his las column may increase in one (if a horizonal auomaon arrow is crossed from he las acive column o he nex one), or may decrease in one or more (if he las acive column runs ou of acive saes, he nex-o-las may be well before i). In his case he algorihm goes bacward in he marix looing a he new las acive column. Uonen conjecured ha he las acive column was O() on average and herefore his algorihm was O(n) on average. However, his was proved much laer by Chang and Lampe [8]. We found in [3, ] a igher bound, namely? e= p O() (3) which is O(). The e of he formula has he same source as before and hence can be replaced by.9 in pracice. By using leas squares on experimenal daa we nd ha a very accurae formula is :9? :9= p () wih a relaive error smaller han 3.%. 8

9 ..9 p m Figure 3: On he lef, probabiliy of an approximae mach as a funcion of he error level (m = 3). On he righ, maximum allowed error level as a funcion of he paern lengh. Boh cases correspond o random ex wih = 3. Figure (lef side) shows he las acive column for random paerns of lengh 3 on random ex, for dieren values of. Given he srong lineariy, we ae a xed = and use leas squares o nd he slope of he curves. From ha we obain he.9 above. The righ side of he gure shows he experimenal daa and he ed curve. The resuls are he same for any less han m(? :9= p ) Figure : On he lef, las acive column for =,, 8, 6, 3 and 6 (curves read from lef o righ). On he righ, las acive column for =, experimenal (solid line) and heoreical (dashed line). We used m = 3. 9

10 Auomaon Pariioning This echnique, presened in [3, ], is he simples way o exend he algorihm o handle longer paerns. We rs presen he general mehod and hen opimize i.. General Mehod If he auomaon does no in a single word, we jus pariion i using a number of machine words for he simulaion. Those subauomaa behave dierenly han he simple one, since hey mus communicae heir rs and las diagonals wih heir laeral neighbors, as well as propagae acive saes down o he cells below (see Figure ). We say ha he auomaon is pariioned in I J \cells", arranged in I \d-rows" (se of rows paced in a cell) and J \d-columns" (se of auomaon diagonals paced in a cell). I d-rows J d-columns Figure : A 3 pariioned auomaon where `c = 3; `r = ; I = ; J = 3. We seleced a cell (bold edges) and shaded all he nodes of oher cells aeced by i. The bold-edged cell mus communicae wih hose neighbors ha own shaded nodes. Le's suppose rs ha is small and m is large. Then, he auomaon can be \horizonally" spli in as many subauomaa as necessary, each one holding a number of diagonals. Noe ha we need ha a leas one auomaon diagonal s in a single machine word, i.e. w. Suppose, on he oher hand, ha is close o m, so ha he widh m? is small. In his case, he auomaon is no wide bu all, and a verical pariioning becomes necessary. These subauomaa mus propagae he -ransiions down o all subsequen subauomaa. In his case, we need ha a leas one auomaon row s in a machine word, i.e. (m? ) w (he is because we need an overow bi for each diagonal of each cell). When none of he wo previous condiions hold, we need a generalized pariion in d-rows and d-columns. We use I d-rows and J d-columns, so ha each cell conains `r bis of each one of `c diagonals. I mus hold ha (`r )`c w. There are many opions o pic (I; J) for a given problem. The correc choice is a maer of opimizaion. If we divide he auomaon in I J subauomaa (I d-rows and J d-columns), we mus updae I cells a each d-column. However, we use a heurisic similar o [9] (i.e. no processing he m

11 columns bu only up o he las acive one), so we wor only on acive auomaon diagonals (see Secion 3.). The expense of woring on less d-columns is having o eep accoun of he possible variaion of he las acive column for each ex characer.. Theoreical Analysis Since auomaon pariioning gives us some freedom o arrange he cells, we nd ou now he bes arrangemen. In Secion 3. we obained he expeced value for he las acive column in he auomaon (Eq. (3)). This measures acive columns and we wor on acive diagonals. To obain he las acive diagonal we subrac, o obain ha on average we wor on e=( p? e) diagonals. This is because he las acive column depends on he error level. Hence, a auomaon row i (where only i errors are allowed) he las acive column is lcol(i) = i=(? e= p ). Hence, he las acive column denes a diagonal line across he auomaon whose slope is =(? e= p ). Figure 6 illusraes he siuaion. All he acive saes of he auomaon are o he lef of he dashed diagonal. The number of diagonals aeced from he rs one (hic line) o he dashed one is =(? e= p )?. i Auomaon Iniial diagonal Las acive column: lcol(i) = i=(?e= p )) Las acive diagonal =(?e= p ) Figure 6: Convering acive columns o acive diagonals. The shaded area represens he acive saes of he auomaon. Since we pac (m? )=J diagonals in a single cell, we wor on average on e=( p? e) J=(m? ) d-columns. Each d-column mus wor on all is I cells. On he oher hand, here are only J d-columns. Hence our oal complexiy is e I J min ; (m? )( p n? e) which shows ha any choice for I and J is he same for a xed IJ. Since IJ (m? )( )=w (oal number of bis o place divided by he size of he compuer word), he nal cos expression is independen (up o round-os) of I and J: min m? ; e p? e w n ()

12 This formula has wo pars. Firs, for <? e= p, i is O( n=( p w)) ime. Second, if he error raio is high (? e= p ), i is O((m? )n=w). This las complexiy is also he wors case of his algorihm. Recall ha in pracice he value e should be replaced by.9 and he average number of acive columns is ha of Eq. ()..3 Pracical Tuning Since he gross analysis does no give us any clue abou which is he opimal selecion for I and J, we perform more deailed consideraions. The auomaon is pariioned ino a marix of I rows and J columns, each cell being a small subauomaon, ha sores `r rows of `c diagonals of he complee auomaon. Because of he naure of he updae formula, we need o sore (`r )`c bis for each sub-auomaon. Thus, he condiions o mee are m? (`r )`c w ; I = ; J = Noice ha in some conguraions he cells are beer occupied han in ohers, due o roundos. Tha is, once we selec `r and `c, he bes possible pacing leaves some bis unused, namely w? (`r )`c. Given he freedom ha he above condiions give us, we compare now he alernaives we have, o nd ou he bes one. One could, in fac, ry every I and J and pic he conguraion wih less cells. Since we wor proporionally o he number of cells, his seems o be a good crierion. Some conguraions need more cells han ohers because, due o round-os, hey use less bis in each compuer word (i.e. cell). In he wors possible conguraion, w= bis can be used ou of w, and in he bes one all he w bis can be used. I is clearly no possible o use as few as w= bis or less, since in ha case here is enough room o pac he bis of wo cells in one, and he above equaions would no hold. Hence, he bes we can obain by picing a dieren I and J is o reduce he number of cells by a facor of. However, i is shown in [3] ha by selecing minimal I, he possible auomaa are: (a) horizonal (I = ), (b) horizonal and wih only one diagonal per cell (I = ; `c = ), or (c) no horizonal nor verical, and wih only one diagonal per cell (I > ; J > ; `c = ). Those cases can be solved wih a simpler updae formula ( o 6 imes faser han he general one), since some cases of communicaion wih he neighbors are no presen. Moreover, a more horizonal auomaon maes he sraegy of acive columns wor beer. This much faser updae formula is more imporan han he possible -fold gains due o round-os. Hence, we prefer o ae minimal I, i.e. I = d( )=(w? )e ; `r = d( )=Ie ; `c = bw=(`r )c ; J = d(m? )=`ce However, he hree cases menioned do no cover (d) a purely verical pariioning, (i.e. J = ), which is applicable whenever (m? ) w and has also a simple updae formula. The selecion for verical pariioning is J v =, `vc = m?, `vr = bw=(m? )c?, I v = d( )=`vr e. Figure 7 shows an experimenal comparison beween (c) and (d). `r `c

13 Figure 7: Time in seconds for verical pariioning (dashed line) versus minimal rows pariioning (solid line). We use m =, w = 3, = 3, n = Mb, random ex and paerns. The mechanism we use o deermine he opimal seup and predic is search cos inegraes experimenal and analyical resuls, as follows. We experimenally obain he ime ha each ype of auomaon spends per ex characer (using leas squares over real measures). We express hose coss normalized so ha he cos of he core algorihm is.. These coss have wo pars: { A base cos ha does no depend on he number of cells: (a)., (b).3, (c)., (d).66. { A cos per processed cell of he auomaon: (a)., (b).83, (c).7, (d).36. { A cos spen in eeping accoun of which is he las acive diagonal: (a).68, (b)., (c).66. Noice ha alhough a a given ex posiion his wor can be proporional o he number of acive columns, he amorized cos is O() per ex posiion. To see his, consider ha a each ex characer we can a mos incremen in one he las acive column, and herefore no more han n incremens and n decremens are possible in a ex of size n. Hence he correc choice is o consider his cos as independen on he number of cells of he auomaon. We analyically deermine using Eq. () he expeced number of acive d-columns. Using he above informaion, we deermine wheher i is convenien o eep rac of he las acive column or jus modify all columns (normally he las opion is beer for high error raios). We also deermine which is he mos promising pariion. Since his sraegy is based on very well-behaved experimenal daa, i is no surprising ha i predics very well he cos of auomaon pariioning and ha i seleced he bes sraegy in almos all cases we ried (in some cases i seleced a sraegy % slower han he opimal, bu no more). 3

14 Finally, noice ha he wors case complexiy of O((m? )=w) per inspeced characer is worse han he O(m) of dynamic programming when he paern lengh ges large, i.e. m > w=((?)). This ensures ha auomaon pariioning is beer for m w, which is quie large. In fac, we should also accoun for he consans involved. The consan for pariioned auomaa is nearly wice as large as ha of dynamic programming, which maes sure ha his mehod is beer for m w. We use herefore a pariioned auomaon insead of dynamic programming as our vericaion engine for poenial maches in he secions ha follow. Figure 8 shows an experimenal comparison beween plain dynamic programming, he Uonen cuo varian [9] and our pariioned auomaon for large paerns. In he wors momen of he pariioned auomaon, i is sill faser han dynamic programming up o m = 6, which conrms our assumpions. The peas in he righ plo is no due o variance, bu o ineger round-os which are inheren o our algorihm (his is explained more in deail in Secion 8) Figure 8: Time in seconds for pariioned auomaon (hic line) versus dynamic programming (dashed line) and he Uonen's improvemen (solid hin line). The lef plo is for m = and he righ one for m =. We use w = 3, = 3, n = Mb, random ex and paerns.. Improving Regiser Usage We nish his secion explaining an improvemen in he engineering of he algorihms ha leads o riplicaing he performance in some cases. The improvemen is based on beer usage of he compuer regisers. The main dierence in he cos beween he core algorihm and an horizonally pariioned auomaon is ha in he rs case we can pu in a regiser he machine word which simulaes he auomaon. This canno be done in a pariioned auomaon, since we use an array of words. The localiy of accesses of hose words is very low, i.e. if here are a acive d-columns, we updae for each ex characer all he words from he rs one o he a-h. Hence, we canno eep hem in regisers. An excepion o he above saemen is he case a =. This represens having acive only he rs

15 cell of he horizonal auomaon. We can, herefore, pu ha cell in a regiser and raverse he ex updaing i, unil he las diagonal inside he cell becomes acive. A ha poin, i is possible ha he second cell will be acivaed a he nex characer and we mus resume he normal searching wih he array of cells. We can reurn o he one-cell mode when he second cell becomes inacive again. Wih his echnique, he search cos for a paern is equal o ha of he core algorihm unil he second auomaon is acivaed, which in some cases is a rare even. In fac, we mus adjus he above predicion formulas, so ha he horizonal auomaa cos he same as he core algorihm (.), and we add he above compued cos only whenever heir las diagonal is acivaed. The probabiliy of his even is f(`c ; ). This echnique eleganly generalizes a (non-elegan) runcaion heurisic proposed in earlier wor. I saed ha, for insance, if we had m = ; =, beer han pariioning he auomaon in wo we could jus runcae he paern in one leer, use he core algorihm and verify each occurrence. Wih he presen echnique we would auomaically achieve his, since he las leer will be isolaed in he second cell of he horizonal auomaon. Noice ha his idea canno be applied o he case I >, since in ha case we have always more han one acive cell. In order o use he echnique also for his case, and in order o exend he idea o no only he rs cell, we could develop specialized code for wo cells, for hree cells, and so on, bu he eor involved and he complexiy of he code are no worh i. Figure 9 shows he improvemens obained over he old version. The beer regiser usage is more noiceable for low error levels (horizonal pariioning). This version of our pariioned auomaon is auomaically deermining wheher o use he speedup echnique of he end of Secion or no Figure 9: Time in seconds for pariioned auomaa before (hin line) and afer (hic line) improving regiser usage. We use m = 6, w = 3, = 3, n = Mb, random ex and paerns.

16 Paern Pariioning We presen now a dieren echnique o cope wih long paerns. This echnique was developed in [3, ], and is improved here. We rs explain he general mehod and hen opimize i.. General Mehod The following lemma, proved in [3, ], suggess a way o pariion a large problem ino smaller ones. Lemma: If segm = T ex[a::b] maches pa wih errors, and pa = P :::P j (a concaenaion of subpaerns), hen segm includes a segmen ha maches a leas one of he P i 's, wih b=jc errors. The Lemma allows us o reduce he number of errors if we divide he paern, provided we search all he subpaerns. Each mach of a subpaern mus be checed o deermine if i is in fac a complee mach. Suppose we nd a ex posiion i he end of a mach for he subpaern ending a posiion s in he paern. Then, he poenial mach mus be searched in he area beween posiions i? s? and i? s m of he ex, an (m )-wide area. This checing mus be done wih an algorihm resisan o high error levels, such as our pariioned auomaon. To perform he pariion, we pic an ineger j, and spli he paern in j subpaerns of lengh m=j (more precisely, if m = qj r, wih r < j, r subpaerns of lengh dm=je and j? r of lengh bm=jc). Because of he lemma, i is enough o chec if any of he subpaerns is presen in he ex wih a mos b=jc errors. If we pariion he paern in j pars, we have o perform j searches. Moreover, hose searches will ogeher rigger more vericaions as j grows (i.e. a piece spli in wo will rigger all he vericaions riggered by he original piece plus spurious ones). This fac is reeced in he formula for he mach probabiliy of Secion 3. (Eq. ()), since he mach probabiliy is now O( m=j ), which may be much larger han O( m ) even for a single piece. Therefore, we prefer o eep j small. A rs alernaive is o mae j jus large enough for he subproblems o in a compuer word, ha is j m m = min j =? w ^ > j j j j j where he second guard avoids searching a subpaern of lengh m wih = m errors (hose of lengh dm=je are guaraneed o be longer han b=jc if m > ). Such a j always exiss if < m. Solving he above equaion (disregarding roundos) we obain j = m? p (m? ) w(m? ) w = m d(w; ) In fac, we used plain dynamic programming in previous wor, bu as shown in Secion. he pariioned auomaon is faser excep for very long paerns. As we see shorly, however, we elaborae more on his vericaion echnique. 6

17 where d(w; ) =? w q w=(? ) As a funcion of, d(w; ) is convex and is maximized for = = (? =( p w? )), where i aes he value =(( p w? )). To give an idea of he reducion obained, his maximum value is. for w = 3 and.7 for w = 6. Excluding vericaions, he search cos is O(j n). For very low error raios ( < =w), j = O(m=w) and he cos is O(mn=w). For higher error raios, j = O( p m=w) and hen he search cos is O( p m=w n). Boh cases can be obviously bounded by O(mn= p w). A second alernaive is o use a smaller j (and herefore he auomaa sill do no in a compuer word) and combine his echnique wih auomaon pariioning for he subpaerns. We consider his alernaive nex.. Opimal Selecion for j I is possible o use jus auomaon pariioning (Secion ) o solve a problem of any size. I is also possible o use jus paern pariioning, wih j large enough for he pieces o be racable wih he ernel algorihm direcly (i.e. j = j ). I is also possible o merge boh echniques: pariion he paern ino pieces. Those pieces may or may no be small enough o use he ernel algorihm direcly. If hey are no, search hem using auomaon pariioning. This has he previous echniques as paricular cases. To obain he opimal sraegy, consider ha if we pariion in j subpaerns, we mus perform j searches wih b=jc errors. For <? e= p, he cos of solving j subproblems by pariioning he auomaon is (using Eq. ()) e=j p?e (=j ) w jn = e(= ) ( p? e)w n which shows ha he lowes cos is obained wih he larges j value, and herefore j = j is he bes choice. However, his is jus an asympoic resul. In pracice he bes opion is more complicaed due o simplicaions in he analysis, consan facors, and ineger roundos. For insance, a paern wih pieces can be beer searched wih wo horizonal auomaa of size (I = ; J = ) han wih four simple auomaa (especially given he improvemens of Secion.). The cos of each auomaon depends heavily on is deailed srucure. Therefore, o deermine he bes opion in pracice we mus chec all he possible j values, from o j and predic he cos of each sraegy. This cos accouns for running j auomaa of he required ype (which depends on j), as well as for he cos o verify he poenial maches muliplied by heir probabiliy of occurrence (using Eq. ()). 7

18 .3 A Hierarchical Vericaion Algorihm The original proposal for paern pariioning (presened in [3, ]) sopped woring long before he limi <? :9= p, as i can be seen in he original references and in Figure. This was because all he paern was veried whenever any piece mached. Hence, he oal cos for vericaions for a single piece was O(m m=j ). For ha cos o be O(), we need =m j =m, i.e.? p e m j m? =? p e m d(w;)? which clearly decreases as m grows. Therefore, he original mehod degraded for longer paerns. This was caused mainly because a large paern was veried alhough he probabiliy o verify i increased wih j (i.e. wih m). We propose now a dieren vericaion echnique which does no degrade as he paern ges longer. The idea is o ry o quicly deermine ha he mach of he small piece is no in fac par of a complee mach. A echnique similar o his hierarchical vericaion was menioned in [6], in he conex of indexed searching. Firs assume ha j is a power of. Then, we recursively spli he paern in wo halves of size bm=c and dm=e (halving also he number of errors, i.e. b=c) unil he pieces are small enough o be searched wih he core algorihm (i.e. (m? )( ) w, where m and are he parameers for he subpaerns). Those pieces (leaves of he ree) are searched in he ex. Each ime a leaf repors an occurrence, is paren node checs he area looing for is paern (whose size is close o wice he size of he leaf paern). Only if he paren node nds he longer paern, i repors he occurrence o is paren, and so on. The occurrences repored by he roo of he ree are he nal answers. This consrucion is correc because he pariioning lemma applies o each level of he ree, i.e. any occurrence repored by he roo node mus include an occurrence repored by one of he wo halves, so we search boh halves. The argumen applies hen recursively o each half. Figure illusraes his concep. If we search he paern "aaabbbcccddd" wih four errors in he ex "xxxbbxxxxxxx", and spli he paern in four pieces, he piece "bbb" will be found in he ex. In he original approach, we would verify he complee paern in he ex area, while wih he new approach we verify only is paren "aaabbb" and immediaely deermine ha here canno be a complee mach. In he Appendix (Eq. (8)) we analyze his mehod and show ha he oal amoun of vericaion wor for each piece is O((m=j) m=j ). This is much beer han O(m m=j ), and in paricular i is O() whenever <. Hence, wih his vericaion mehod he accepable error level does no degrade as he paern grows. If j is no a power of wo we ry o build he ree as well balanced as possible. This is because an unbalanced ree will force he vericaion of a long paern because of he mach of a shor paern (where he long paern is more han wice as long as he shor one). The same argumen shows ha i is no a good idea o use ernary or higher ariy rees. Finally, we could increase j o have a perfec binary pariion, bu he shorer pieces rigger more vericaions. In order o handle pariions which are no a power of wo, we need a sronger version of he 8

19 aaabbbcccddd aaabbb cccddd aaa bbb ccc ddd Figure : The hierarchical vericaion mehod. The boxes (leaves) are he elemens which are really searched, and he roo represens he whole paern. A leas one paern a each level mus mach in any occurrence of he complee paern. If he bold box is found, all he bold lines may be veried. pariioning lemma of Secion.. For insance, if we deermine j =, we have o pariion he ree in, say, a lef child wih hree pieces and a righ child wih wo pieces. The sandard pariioning lemma ells us ha each subree could search is paern wih b=c errors, bu his will increase he vericaions of he subree wih he shorer paern. In fac, we can search he lef subree wih b3=c errors and he righ one wih b=c errors. Coninuing wih his policy we arrive o he leaves, which are searched wih b=c errors each as expeced. The sronger version of he Lemma follows Sronger Lemma: If segm = T ex[a::b] maches pa wih errors, and pa = P :::P j (a concaenaion of subpaerns), hen segm includes a segmen ha maches a leas one of he P i 's, wih ba i =Ac errors, where A = P j i= a i. Proof: Oherwise, each P i maches wih a leas ba i =Ac > a i =A errors. Summing up he errors of all he pieces we have more han A=A = errors and herefore a mach is no possible. Alhough when here are few maches (i.e. low error level) plain and hierarchical vericaion behave similarly, here is an imporan dierence for medium error levels: hierarchical vericaion is more oleran o errors. We illusrae his fac in Figure. As i can be seen, boh mehods evenually are overwhelmed by vericaions before reaching he limi =? :9= p. This is because, as j grows, he cos of vericaions O((m=j) m=j ) increases. In he case = 3, he heoreical limi is = :83 (i.e. = ), while he plain mehod ceases o be useful for = 3 (i.e. = :8) and he hierarchical one wors well up o = (i.e. = :7). For English ex he limi is = :69, while he plain mehod wors up o = 3 ( = :) and he hierarchical one up o = 3 ( = :8). I is also noiceable ha hierarchical vericaion wors a lile harder in he vericaions once hey become signicaive (very high error levels). This is because he hierarchy of vericaions maes i o chec many imes he same ex area. On he oher hand, we noice ha he use of pariioned auomaa insead of dynamic programming for he vericaion of possible maches is especially advanageous in combinaion wih our hierarchical vericaion, since in mos cases we verify only a shor paern, where he auomaon is much faser han dynamic programming. 9

20 Figure : Time in seconds for paern pariioning using plain (hin line) and hierarchical (hic line) vericaion. We use m = 6, w = 3, and n = Mb. On he lef, random ex ( = 3). On he righ, English ex. 6 Superimposed Auomaa This echnique was rs presened in [] for mulipaern approximae search, and inegraed ino he single-paern algorihm in []. We rs explain i and hen nd he opimal form o use i. 6. General Mehod When we use paern pariioning, he search is divided ino a number of subsearches for smaller paerns P ; :::; P j. The aim of his echnique is o avoid searching each subpaern separaely, by collapsing a number r of searches in a single one. In paern pariioning all he paerns have almos he same lengh. If hey dier (a mos in one), we runcae hem o he shores lengh. Hence, all he auomaa have he same srucure, diering only in he labels of he horizonal arrows. The superimposiion is dened as follows: we build he [ ] able for each paern (Secion ), and hen ae he biwise-or of all he ables. The resuling [ ] able maches in is posiion i wih he i-h characer of any of he paerns involved. We hen build he auomaon as before using his able. The resuling auomaon acceps a ex posiion if i ends an occurrence of a much more relaxed paern (in fac, an exended paern, see he end of Secion ), namely C :::C m, wih C i = fp [i]; :::; P r [i]g. For example, if he search is for pa and wai, he sring wa is acceped wih zero errors (see Figure ). Each occurrence repored by he auomaon has o be veried for all he paerns involved. For a moderae number of paerns, his sill consiues a good lering mechanism, a he same cos of a single search. Clearly, he relaxed paern riggers many more vericaions han he simple

21 p or w a or i no errors p or w a or i error p or w a or i errors Figure : An NFA o ler he parallel search of pa and wai. ones. This limis he amoun of possible superimposiion. If we use paern pariioning in j pieces and superimpose in groups of r pieces, we mus perform dj=re superimposed searches. We eep he groups of almos he same size, namely bj=dj=rec and dj=dj=ree. We group subpaerns which are coniguous in he paern. When an occurrence is repored we canno now which of he superimposed subpaerns caused he mach (since he mechanism does no allow o now), so we chec wheher he concaenaion of he subpaerns appears in he area. From ha poin on, we use he normal hierarchical vericaion mechanism. 6. Opimizing he Amoun of Superimposiion Suppose we decide o superimpose r paerns in a single search. We are limied in he amoun of his superimposiion because of he increase in he error level o olerae, wih he consequen increase in he cos of vericaions. We analyze now how many paerns can we superimpose. As shown in Secion 3. (Eq. ()), he probabiliy of a given ex posiion maching a random paern is O( m ), where depends on and. This cos is exponenially decreasing wih m for <? e= p, while if his condiion does no hold he probabiliy is very high. In his formula, = sands for he probabiliy of a characer crossing a horizonal edge of he auomaon (i.e. he probabiliy of wo random characers being equal). To exend his resul, we noice ha we have r characers on each edge now, so he above menioned probabiliy is?(?=) r r=. The (pessimisic) approximaion is igh for r <<. We use he approximaion because in pracice r will be quie modes compared o. Hence, he value of when superimposing r paerns (which we call o eep unchanged he old

22 value) is = and herefore he new limi for is r? (? ) r r <? e!? = r? (6) or alernaively he limi for r (i.e. he maximum amoun of superimposiion r lim ha can be used given he error level) is (? ) r lim = e which for consan error level is O() independen on m. However, his is no he only resricion on r. If we use paern pariioning in j pieces and superimpose in groups of r pieces, we mus perform j=r superimposed searches. In he las par of he Appendix (Eq. (9)) we show ha he expeced cos due o vericaions is O( `r `) per search, where ` = m=j. For his cos o be O() we need a new (sricer) condiion on r. This is obained by expanding using Eq. (6): which yields r re!`(?) (? ) r ` = O() r lim (r lim`) `(?) A = `(?) r? lim which approaches r lim for large ` (i.e. pariioning he paern ino less pieces). In he analysis ha follows we mae he simplifying assumpion r = r lim. Noice ha superimposiion may give more argumens o pariion a paern in j < j pieces. On he oher hand, hans o he new vericaion mechanism of Secion.3 we can superimpose more paerns han in he original wor, which ranslaes ino beer performance everywhere, no only when he error level is becoming high. Considering he above limi, he oal search cos becomes =r = O(=( (? ) )) imes ha of paern pariioning. For insance, if we pariion in j pieces (so ha hey can be searched wih he core algorihm), he search cos becomes O m d(w; ) (? ) n which for =w is O(mn=(w)), and for higher error level becomes O( p m=(w) n) (his is because? is lower bounded by e= p ). Again, a general bound is O(mn= p w). A recen wor on mulipaern approximae searching shows ha by applying he idea of hierarchical vericaion o he number r of paerns we achieve in fac r = rlim, since he cos o verify r superimposed paerns does no depend on r anymore [6].

23 A naural quesion is for which error level can we superimpose all he j paerns o perform jus one search, i.e. when r = j holds. Tha is whose approximae soluion is (? ) m d(w; ) = e < =? e m p w where as always we mus replace e by.9 in pracice. As we see in he experimens, his bound is pessimisic because of he roundo facors which aec j for medium-size paerns. Noice ha superimposiion sops woring when r =, i.e. when =? e= p. This is he same poin when paern pariioning sops woring. We show in Figure 3 he eec of superimposiion on he performance of he algorihm and is olerance o he error level. As we see in Secion 8, we achieve almos consan search ime unil he error level becomes medium. This is because we auomaically superimpose as much as possible given he error level Figure 3: Times in seconds for superimposed auomaa. Superimposiion is forced o r = (solid line), (dashed line) and 6 (doed line). The larger r, he faser he algorihm bu i sops woring for lower error levels. We use m =, w = 3, and n = Mb and random ex and paerns wih = 3. (7) 6.3 Opimal Grouping and Aligning Two nal aspecs allow furher opimizaion. A rs one is ha i is possible o ry o form he groups so ha he paerns in each group are similar (e.g. hey are a small edi disance among hem, or hey share leers a he same posiion). This would decrease he probabiliy of nding spurious maches in he ex. A possible disadvanage of his heurisic is ha since he subpaerns are no coniguous we canno simply verify wheher heir concaenaion appears, bu we have o chec if any of he corresponding leaves of he ree appears. The probabiliy ha he concaenaion appears is much lower. 3

24 A second one is ha, since we may have o prune he longer subpaerns of each group, we can deermine wheher o eliminae he rs or he las characer (he paerns dier a mos in one), using he same idea of rying o mae he paerns as similar as possible. None of hese heurisics have been esed ye. 7 Combining All he Techniques A his poin, a number of echniques have been described, analyzed and opimized. They can be used in many combinaions for a single problem. A large paern can be spli in one or more subpaerns (he case of \one" meaning no spliing a all). Those subpaerns can be small enough o be searched wih he ernel algorihm or hey can be sill large and need o be searched wih a pariioned auomaon. Moreover, we can group hose auomaa (simple or pariioned) o speed up he search by using superimposiion. The analysis helped us o nd more ecien vericaion echniques and o deermine he cases where each echnique can be used. However, a number of quesions sill arise. Which is he correc choice o spli he paern versus he size of he pieces? Is i beer o have less pieces or smaller pieces? How does he superimposiion aec his picure? Is i beer o have more small pieces and superimpose more pieces per group or is i beer o have larger pieces and smaller groups? We sudy he opimal combinaion in his secion. We begin showing he resul of a heoreical analysis and hen explain he heurisic we use. 7. A Theoreical Approach The analysis recommends using he maximal possible superimposiion, r = r, o reduce he number of searches. As proved in Secion., i also recommends o use he maximal j = j. This gives he following combined (simplied) average complexiy for our algorihm, illusraed in Figure : If he problem s in a machine word (i.e. (m? )( ) w), he core algorihm is used a O(n) average and wors-case search cos. If he error level is so low ha we can cu he paern in j pieces and superimpose all hem (i.e. <, Eq. (7)) hen superimposed auomaa gives O(n) average search cos. If he error level is no so low bu i is no oo high (i.e. <, Eq. ()), hen use paern pariioning in j pars, o obain O( p m=(w) n) average search cos. If he error level is oo high (i.e. > ) we mus use auomaon pariioning a O((m? )n=w) average and wors-case search cos. On he oher hand, he wors-case search cos is O((m? )=w n) in all cases. This is he same wors-case cos of he search using he auomaon. This is because we use such an auomaon o verify he maches, and we never verify a ex posiion wice wih he same auomaon. We eep he sae of he search and is las ex posiion visied o avoid bacracing in he ex due o overlapping vericaion requiremens. This argumen is valid even wih hierarchical vericaion.

Approximate String Matching. Department of Computer Science. University of Chile. Blanco Encalada Santiago - Chile

Approximate String Matching. Department of Computer Science. University of Chile. Blanco Encalada Santiago - Chile Very Fas and Simple Approximae Sring Maching Gonzalo Navarro Ricardo Baeza-Yaes Deparmen of Compuer Science Universiy of Chile Blanco Encalada 2120 - Saniago - Chile fgnavarro,rbaezag@dcc.uchile.cl Absrac