New and Faster Filters for. Multiple Approximate String Matching. University of Chile. Blanco Encalada Santiago - Chile

Similar documents
Computer Propagation Analysis Tools

Representing Knowledge. CS 188: Artificial Intelligence Fall Properties of BNs. Independence? Reachability (the Bayes Ball) Example

Reinforcement learning

Lecture-V Stochastic Processes and the Basic Term-Structure Equation 1 Stochastic Processes Any variable whose value changes over time in an uncertain

An Automatic Door Sensor Using Image Processing

Sections 3.1 and 3.4 Exponential Functions (Growth and Decay)

Low-complexity Algorithms for MIMO Multiplexing Systems

General Non-Arbitrage Model. I. Partial Differential Equation for Pricing A. Traded Underlying Security

Combinatorial Approach to M/M/1 Queues. Using Hypergeometric Functions

Lecture 18: Kinetics of Phase Growth in a Two-component System: general kinetics analysis based on the dilute-solution approximation

The sudden release of a large amount of energy E into a background fluid of density

Extremal problems for t-partite and t-colorable hypergraphs

The shortest path between two truths in the real domain passes through the complex domain. J. Hadamard

CS 188: Artificial Intelligence Fall Probabilistic Models

, on the power of the transmitter P t fed to it, and on the distance R between the antenna and the observation point as. r r t

Approximate String Matching. Department of Computer Science. University of Chile. Blanco Encalada Santiago - Chile

On Control Problem Described by Infinite System of First-Order Differential Equations

Variance and Covariance Processes

Lecture 22 Electromagnetic Waves

Orthotropic Materials

On The Estimation of Two Missing Values in Randomized Complete Block Designs

Lecture 17: Kinetics of Phase Growth in a Two-component System:

Quantum Algorithms for Matrix Products over Semirings

STUDY OF THE STRESS-STRENGTH RELIABILITY AMONG THE PARAMETERS OF GENERALIZED INVERSE WEIBULL DISTRIBUTION

336 ERIDANI kfk Lp = sup jf(y) ; f () jj j p p whee he supemum is aken ove all open balls = (a ) inr n, jj is he Lebesgue measue of in R n, () =(), f

Online Completion of Ill-conditioned Low-Rank Matrices

KINEMATICS OF RIGID BODIES

7 Wave Equation in Higher Dimensions

Molecular Evolution and Phylogeny. Based on: Durbin et al Chapter 8

MEEN 617 Handout #11 MODAL ANALYSIS OF MDOF Systems with VISCOUS DAMPING

[ ] 0. = (2) = a q dimensional vector of observable instrumental variables that are in the information set m constituents of u

The Production of Polarization

MATHEMATICAL FOUNDATIONS FOR APPROXIMATING PARTICLE BEHAVIOUR AT RADIUS OF THE PLANCK LENGTH

Monochromatic Wave over One and Two Bars

The Substring Search Problem

Probabilistic Models. CS 188: Artificial Intelligence Fall Independence. Example: Independence. Example: Independence? Conditional Independence

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

r P + '% 2 r v(r) End pressures P 1 (high) and P 2 (low) P 1 , which must be independent of z, so # dz dz = P 2 " P 1 = " #P L L,

Research on the Algorithm of Evaluating and Analyzing Stationary Operational Availability Based on Mission Requirement

arxiv: v1 [math.co] 4 Apr 2019

AN EVOLUTIONARY APPROACH FOR SOLVING DIFFERENTIAL EQUATIONS


Risk tolerance and optimal portfolio choice

An Open cycle and Closed cycle Gas Turbine Engines. Methods to improve the performance of simple gas turbine plants

ENGI 4430 Advanced Calculus for Engineering Faculty of Engineering and Applied Science Problem Set 9 Solutions [Theorems of Gauss and Stokes]

Today - Lecture 13. Today s lecture continue with rotations, torque, Note that chapters 11, 12, 13 all involve rotations

Unsupervised Segmentation of Moving MPEG Blocks Based on Classification of Temporal Information

Relative and Circular Motion

Dynamic Estimation of OD Matrices for Freeways and Arterials

ÖRNEK 1: THE LINEAR IMPULSE-MOMENTUM RELATION Calculate the linear momentum of a particle of mass m=10 kg which has a. kg m s

Numerical solution of fuzzy differential equations by Milne s predictor-corrector method and the dependency problem

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

EFFECT OF PERMISSIBLE DELAY ON TWO-WAREHOUSE INVENTORY MODEL FOR DETERIORATING ITEMS WITH SHORTAGES

Support Vector Machines

Chapter 7. Interference

156 There are 9 books stacked on a shelf. The thickness of each book is either 1 inch or 2

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Chapter Finite Difference Method for Ordinary Differential Equations

Kalman Filter: an instance of Bayes Filter. Kalman Filter: an instance of Bayes Filter. Kalman Filter. Linear dynamics with Gaussian noise

A Bijective Approach to the Permutational Power of a Priority Queue

Reichenbach and f-generated implications in fuzzy database relations

Two-dimensional Effects on the CSR Interaction Forces for an Energy-Chirped Bunch. Rui Li, J. Bisognano, R. Legg, and R. Bosch

CHAPTER 12 DIRECT CURRENT CIRCUITS

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

Chapter 2. First Order Scalar Equations

Exponential and Logarithmic Equations and Properties of Logarithms. Properties. Properties. log. Exponential. Logarithmic.

HOTELLING LOCATION MODEL

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

A Study on Non-Binary Turbo Codes

EVENT HORIZONS IN COSMOLOGY

Distribution Free Evolvability of Polynomial Functions over all Convex Loss Functions

Chapter 7: Solving Trig Equations

Circular Motion. Radians. One revolution is equivalent to which is also equivalent to 2π radians. Therefore we can.

Physics 2001/2051 Moments of Inertia Experiment 1

Notes for Lecture 17-18

On Energy-Efficient Node Deployment in Wireless Sesnor Networks

Let us start with a two dimensional case. We consider a vector ( x,

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Bayes Nets. CS 188: Artificial Intelligence Spring Example: Alarm Network. Building the (Entire) Joint

Final Spring 2007

Online Ranking by Projecting

Lecture 20: Riccati Equations and Least Squares Feedback Control

The k-filtering Applied to Wave Electric and Magnetic Field Measurements from Cluster

A Weighted Moving Average Process for Forecasting. Shou Hsing Shih Chris P. Tsokos

Secure Frameproof Codes Through Biclique Covers

Linear Response Theory: The connection between QFT and experiments

( ) exp i ω b ( ) [ III-1 ] exp( i ω ab. exp( i ω ba

A note on characterization related to distributional properties of random translation, contraction and dilation of generalized order statistics

ON 3-DIMENSIONAL CONTACT METRIC MANIFOLDS

Design Guideline for Buried Hume Pipe Subject to Coupling Forces

Pressure Vessels Thin and Thick-Walled Stress Analysis

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Traversal of a subtree is slow, which affects prefix and range queries.

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

PHYS PRACTICE EXAM 2

Servomechanism Design

MATH 4330/5330, Fourier Analysis Section 6, Proof of Fourier s Theorem for Pointwise Convergence

New problems in universal algebraic geometry illustrated by boolean equations

Improving an Algorithm for Approximate Pattern Matching. University of Chile. Blanco Encalada Santiago - Chile.

Some Basic Information about M-S-D Systems

Transcription:

New and Fase Files fo Muliple Appoximae Sing Maching Ricado Baeza-Yaes Gonzalo Navao Depamen of Compue Science Univesiy of Chile Blanco Encalada 22 - Saniago - Chile fbaeza,gnavaog@dcc.uchile.cl Absac We pesen hee new algoihms fo on-line muliple sing maching allowing eos. These ae exensions of pevious algoihms ha seach fo a single paen. The aveage unning ime achieved is in all cases linea in he ex size fo modeae eo level, paen lengh and numbe of paens. They adap (wih highe coss) o he ohe cases. Howeve, he algoihms die in speed and hesholds of usefulness. We analyze heoeically when each algoihm should be used, and show expeimenally hei pefomance. The only pevious soluion fo his poblem allows only one eo. Ou algoihms ae he s o allow moe eos, and ae fase han pevious wok fo a modeae numbe of paens (e.g. less han 5- on English ex, depending on he paen lengh). Key wods: Sing maching, mulipaen seach, seach allowing eos. Inoducion Appoximae sing maching is one of he main poblems in classical sing algoihms, wih applicaions o ex seaching, compuaional biology, paen ecogniion, ec. Given a ex T ::n of lengh n and a paen P ::m of lengh m (boh sequences ove an alphabe of size ), and a maximal numbe of eos allowed, < k < m, we wan o nd all ex posiions whee he paen maches he ex wih up o k eos. Eos can be subsiuing, deleing o inseing a chaace. We use he em \eo level" o efe o = k=m. In his pape we ae ineesed in he on-line poblem (i.e. he ex is no known in advance), whee he classical soluion fo a single paen is based on dynamic pogamming and has a unning ime of O(mn) [26]. In ecen yeas seveal algoihms have impoved he classical one [22]. Some impove he wos o aveage case by using he popeies of he dynamic pogamming maix [3,, 6, 3, 9]. Ohes le he ex o quickly eliminae unineesing pas [29, 28,, 4, 24], some of hem being \sublinea" on aveage fo modeae (i.e. hey do no inspec all he ex chaaces). Ye ohe appoaches use bi-paallelism [3] in a compue wod of w bis o educe he numbe of opeaions [33, 35, 34, 6, 9]. The poblem of appoximaely seaching a se of paens (i.e. he occuences of anyone of hem) has been consideed only ecenly. This poblem has many applicaions, fo insance This wok has been suppoed in pa by FONDECYT gan 99627.

Spelling: many incoec wods can be seached in he dicionay a a ime, in ode o nd hei mos likely vaians. Moeove, we may even seach he dicionay of coec wods in he \ex" of misspelled wods, hopefully a much less cos. Infomaion eieval: when synonym o hesauus expansion is done on a keywod and he ex is eo-pone, we may wan o seach all he vaians allowing eos. Bached queies: if a sysem eceives a numbe of queies o pocess, i may impove eciency by seaching all hem in a single pass. Single-paen queies: some algoihms fo a single paen allowing eos (e.g. paen paiioning [6]) educe he poblem o he seach of many subpaens allowing less eos, and hey bene fom mulipaen seach algoihms. A ivial soluion o he mulipaen seach poblem is o pefom seaches. As fa as we know, he only pevious aemp o impove he ivial soluion is due o Muh & Manbe [7], who use hashing o seach many paens wih one eo, being ecien even fo one housand paens. In his wok, we pesen hee new algoihms ha ae exensions of pevious ones o he case of muliple seach. In Secion 2 we explain some basic conceps necessay o undesand he algoihms. Then we pesen he hee new echniques. In Secion 3 we pesen \auomaon supeimposiion", which exends a bi-paallel simulaion of a nondeeminisic nie auomaon (NFA) [6]. In Secion 4 we pesen \exac paiioning", ha exends a le based on exac seaching of paen pieces [7, 6, 24]. In Secion 5 we pesen \couning", based on couning paen lees in a ex window [4]. In Secion 6 we analyze ou algoihms and in Secion 7 we compae hem expeimenally. Finally, in Secion 8 we give ou conclusions. Some deailed analyses ae lef fo Appendices A and B. Alhough [7] allows seaching fo many paens, i is limied o only one eo. Ous ae he s algoihms fo mulipaen maching allowing moe han one eo. Moeove, even fo one eo, we impove [7] when he numbe of paens is no vey lage (say, less han 5{ on English ex, depending on he paen lengh). Ou mulipaen exensions impove ove hei sequenial counepas (i.e. one sepaae seach pe paen using he base algoihm) when he eo level is no vey high (abou :4 on English ex). The le based on exac seaching is he fases fo small eo levels, while he bi-paallel simulaion of he NFA adaps bee o moe eos on elaively sho paens. Pevious paial and peliminay vesions of his wok appeaed in [5, 2, 2]. 2 Basic Conceps We eview in his secion some basic conceps ha ae used in all he algoihms ha follow. In he pape S i denoes he i-h chaace of sing S (being S he s chaace), and S i::j sands fo he subsing S i S i+ :::S j. In paicula, if i < j, S i::j =, he empy sing. 2. Fileing Techiques All he mulipaen seach algoihms ha we conside in his wok ae based in he concep of leing, and heefoe i is useful o dene i hee. 2

Fileing is based on he fac ha i is nomally easie o ell ha a ex posiion does no mach han o ensue ha i maches. Theefoe, a le is a fas algoihm ha checks fo a simple necessay (hough no sucien) condiion fo an appoximae mach o occu. The ex aeas ha do no saisfy he necessay condiion can be safely discaded, and a moe expensive algoihm has o be un on he ex aeas ha passed he le. Since he les can be much fase han appoximae seaching algoihms, leing algoihms can be vey compeiive (in fac, hey dominae on a lage ange of paamees). The pefomance of leing algoihms, howeve, is vey sensiive o he eo level. Mos les wok vey well on low eo levels and vey bad wih moe eos. This is elaed wih he amoun of ex ha he le is able o discad. When evaluaing leing algoihms, i is impoan no only o conside hei ime eciency bu also hei oleance o eos. A em nomally used when efeing o les is \sublineaiy". I is said ha a le is sublinea when i does no inspec all he chaaces of he ex (like he Boye-Mooe [8] algoihms fo exac seaching, which can be a bes O(n=m)). Thoughou his wok we make use of he wo following lemmas o deive leing condiions. Lemma [6]: If S = T a::b maches P wih a mos k eos, and P = p :::p j (a concaenaion of sub-paens), hen some subsing of S maches a leas one of he p i 's, wih a mos bk=jc eos. Poof: Ohewise, he bes mach of each p i inside S has a leas bk=jc + > k=j eos. An occuence of P involves he occuence of each of he p i 's, and he oal numbe of eos in he occuences is a leas he sum of he eos of he pieces. Bu hee, jus summing up he eos of all he pieces we have moe han jk=j = k eos and heefoe a complee mach is no possible. Noice ha his does no even conside ha he maches of he p i mus be in he pope ode, be disjoin, and ha some deleions in S may be needed o connec hem. In geneal, one can le he seach fo a paen of lengh m wih k eos by he seach of j subpaens of lengh m=j wih k=j eos. Only he ex aeas suounding occuences of pieces mus be checked fo complee maches. An impoan paicula case of Lemma aises when one consides j = k +, since in his case some paen piece appeas unaleed (zeo eos). Lemma 2: [32] If hee ae i j such ha ed(t i::j ; P ) k, hen T j?m+::j includes a leas m? k chaaces of P. Poof: Suppose he opposie. If j? i < m, hen we obseve ha hee ae less han m? k chaaces of P in T i::j. Hence, moe han k chaaces mus be deleed fom P o mach he ex. If j? i m, we obseve ha hee ae moe han k chaaces in T i::j ha ae no in P, and hence we mus inse moe han k chaaces in P o mach he ex. A conadicion in boh cases. Noe ha in case of epeaed chaaces in he paen, hey mus be couned as dieen occuences. Fo example, if we seach "aaaa" wih one eo in he ex, he las fou lees of each occuence mus include a leas hee a's. Lemma 2 (a simplicaion of ha in [32]) says essenially ha we can design a le fo appoximae seaching based on nding enough chaaces of he paen in a ex window (wihou egading hei odeing). Fo insance, he paen "suvey" canno appea wih one eo in he ex window "suge" because hee ae no ve lees of he paen in he ex. Howeve, he le canno discad he possibiliy ha he paen appeas in he ex window "yevus". 3

2.2 Bi-Paallelism Bi-paallelism is a echnique of common use in sing maching [3]. I was s poposed in [2, 4]. The echnique consiss in aking advanage of he ininsic paallelism of he bi opeaions inside a compue wod. By using clevely his fac, he numbe of opeaions ha an algoihm pefoms can be cu down by a faco of a mos w, whee w is he numbe of bis in he compue wod. Since in cuen achiecues w is 32 o 64, he speedup is vey signican in pacice (and impoves wih echnological pogess). In ode o elae he behavio of bi-paallel algoihms o ohe woks, i is nomally assumed ha w = (log n), as dicaed by he RAM model of compuaion. We pefe, howeve, o keep w as an independen value. Some noaion we use fo bi-paallel algoihms is in ode. We denoe as b`:::b he bis of a mask of lengh `, which is soed somewhee inside he compue wod. We use C-like synax fo opeaions on he bis of compue wods, e.g. \j" is he biwise-o and \<<" moves he bis o he lef and enes zeos fom he igh, e.g. b m b m? :::b 2 b << 3 = b m?3 :::b 2 b. We can also pefom aihmeic opeaions on he bis, such as addiion and subacion, which opeaes he bis as if hey fomed a numbe. Fo insance, b`:::b x? = b`:::b x. We explain now he s bi-paallel algoihm, since i is he basis of much of which follows in his wok. The algoihm seaches a paen in a ex (wihou eos) by paallelizing he opeaion of a non-deeminisic nie auomaon ha looks fo he paen. Figue illusaes his auomaon. a l o h a Figue : Nondeeminisic auomaon ha seaches "aloha" exacly. This auomaon has m + saes, and can be simulaed in is non-deeminisic fom in O(mn) ime. The Shif-O algoihm achieves O(mn=w) wos-case ime (i.e. opimal speedup). Noice ha if we conve he non-deeminisic auomaon o a deeminisic one o have O(n) seach ime, we ge an impoved vesion of he KMP algoihm [5]. Howeve, KMP is wice as slow fo m w. The algoihm s builds a able B[ ] which fo each chaace c soes a bi mask B[c] = b m :::b. The mask B[c] has he bi b i in zeo if and only if P i = c. The sae of he seach is kep in a machine wod D = d m :::d, whee d i is zeo wheneve P ::i maches he end of he ex ead up o now (i.e. he sae numbeed i in Figue is acive). Theefoe, a mach is epoed wheneve d m is zeo. D is se o all ones oiginally, and fo each new ex chaace T j, D is updaed using he fomula D (D << ) j B[T j ] The fomula is coec because he i-h bi is zeo if and only if he (i? )-h bi was zeo fo he pevious ex chaace and he new ex chaace maches he paen a posiion i. In ohe wods, T j?i+::j = P ::i if and only if T j?i+::j? = P ::i? and T j = P i. I is possible o 4

elae his fomula o he movemen ha occus in he non-deeminisic auomaon fo each new ex chaace: each sae ges he value of he pevious sae, bu his happens only if he ex chaace maches he coesponding aow. Fo paens longe han he compue wod (i.e. m > w), he algoihm uses dm=we compue wods fo he simulaion (no all hem ae acive all he ime). The algoihm is O(mn=w) wos case ime, and he pepocessing is O(m + ) ime and O() space. On aveage, he algoihm is O(n) even when m > w, since only he s O() saes of he auomaon have acive saes on aveage (and hence he s O() compue wods need o be updaed on aveage). I is easy o exend Shif-O o handle classes of chaaces. In his exension, each posiion in he paen maches wih a se of chaaces ahe han wih a single chaace. The classical sing maching algoihms ae no so easily exended. In Shif-O, i is enough o se he i-h bi of B[c] fo evey c 2 P i (P i is a se now). Fo insance, o seach fo "suvey" in case-insensiive fom, we jus se he s bi of B["s"] and of B["S"] o \mach" (zeo), and he same wih he es. Shif-O can also seach fo muliple paens (whee he complexiy is O(mn=w) if we conside ha m is he oal lengh of all he paens) by aanging many masks B and D in he same machine wod. Shif-O was lae enhanced [34] o suppo a lage se of exended paens and even egula expessions. Recenly, in [25], Shif-O was combined wih a sublinea sing maching algoihm, obaining he same exibiliy and an eciency compeiive agains he bes classical algoihms. Many on-line ex algoihms can be seen as implemenaions of cleve auomaa (classically, in hei deeminisic fom). Bi-paallelism has since is invenion became a geneal way o simulae simple non-deeminisic auomaa insead of conveing hem o deeminisic. I has he advanage of being much simple, in many cases fase (since i makes bee usage of he egises of he compue wod), and easie o exend o handle complex paens han is classical counepas. Is main disadvanage is he limiaions i imposes wih egad o he size of he compue wod. In many cases is adapaions o cope wih longe paens ae no so ecien. 2.3 Bi-paallelism fo Appoximae Paen Maching We pesen now an applicaion of bi-paallelism o appoximae paen maching, which is especially elevan fo he pesen wok. Conside he NFA fo seaching "pa" wih a mos k = 2 eos shown in Figue 2. Evey ow denoes he numbe of eos seen. The s one, he second one, and so on. Evey column epesens maching he paen up o a given posiion. A each ieaion, a new ex chaace is consideed and he auomaon changes is saes. Hoizonal aows epesen maching a chaace (hey can only be followed if he coesponding mach occus). All he ohes epesen eos, as hey move o he nex ow. Veical aows epesen inseing a chaace in he paen (since hey advance in he ex and no in he paen), solid diagonal aows epesen eplacing a chaace (since hey advance in he ex and he paen), and dashed diagonal aows epesen deleing a chaace of he paen (since, as -ansiions, hey advance in he paen bu no in he ex). The loop a he iniial sae allows consideing any chaace as a poenial saing poin of a mach. The auomaon acceps a chaace (as he end of a mach) wheneve a ighmos sae is acive. Iniially, he acive saes a ow i (i 2 ::k) ae hose a he columns fom o i, o epesen he deleion of he s i chaaces of he paen P ::m. 5

p a no eos p a eo p a 2 eos D D 2 Figue 2: An NFA fo appoximae sing maching. We show he acive saes afe eading he ex "pai". An ineesing applicaion of bi-paallelism is o simulae his auomaon in is nondeeminisic fom. A s appoach [34] obained O(kdm=wen) ime, by packing each auomaon ow in a machine wod and exending he Shif-O algoihm o accoun fo he veical and diagonal aows. Noe ha even if all he saes in a single machine wod, he k + ows have o be sequenially updaed because of he -ansiions. The same happens in he classical dynamic pogamming algoihm [26], which can be egaded as a column-wise simulaion of his NFA. In his pape we ae ineesed in a moe ecen simulaion echnique [6], whee we show ha by packing diagonals of he auomaon insead of ows o columns all he new values can be compued in one sep if hey in a compue wod. We give a bief descipion of he idea. Because of he -ansiions, once a sae in a diagonal is acive, all he subsequen saes in ha diagonal become acive oo, so we can dene he minimal acive ow of each diagonal, D i (diagonals ae numbeed by looking he column hey sa a, e.g. D and D 2 ae enclosed in doed lines in Figue 2). The new values fo D i (i 2 ::m? k) afe we ead a new ex chaace c can be compued by D i = min( D i + ; D i+ + ; g(d i? ; c) ) whee g(d i ; c) = min ( fk + g [ f j = j D i ^ P i+j == c g ) whee i always holds D = D = and we epo a mach wheneve D m?k k. The fomula fo Di accouns fo eplacemens, inseions and maches, especively. Deleions ae accouned fo by keeping he minimum acive ow. All he ineesing maches ae caugh by consideing only he diagonals D :::D m?k. 6

We use bi-paallelism o epesen he D i 's in unay. Each one is hold in k + bis (plus an oveow bi) and soed sequenially inside a bi mask D. Ineesingly, he eec is he same if we ead he diagonals boom-up and exchange $, wih each bi epesening a sae of he NFA. The updae fomula can be seen eihe as an aihmeic implemenaion of he pevious fomula in unay o as a logical simulaion of he ow of bis acoss he aows of he NFA. As in Shif-O, a able of (m bis long) masks b[ ] is buil epesening mach o mismach agains he paen. A able B[c] is buil by mapping he bis of b[ ] o hei appopiae posiions inside D. Figue 3 shows how he saes ae epesened inside he masks D and B. sepaao sepaao nal sae D a p B[''] (2,3) (,2) (,) (2,4) (,3) (,2) a Figue 3: Bi-paallel epesenaion of he NFA of Figue 2. This epesenaion equies k +2 bis pe diagonal, so he oal numbe of bis is (m?k)(k +2). If his numbe of bis does no exceed he compue wod size w, he updae can be done in O() opeaions. The esuling algoihm is linea and vey fas in pacice. Fo ou puposes, i is impoan o ealize ha he only connecion beween he paen and he algoihm is given by he b[ ] able, and ha he paen can use classes of chaaces jus as in he Shif-O algoihm. We use his popey nex o seach fo muliple paens. 3 Supeimposed Auomaa In his secion we descibe an appoach based on he bi-paallel simulaion of he NFA jus descibed. Suppose we have o seach paens P ; :::; P. We ae ineesed in he occuences of any one of hem, wih a mos k eos. We can exend he pevious bi-paallelism appoach by building he auomaon fo each one, and hen \supeimpose" all he auomaa. Assume ha all paens have he same lengh (ohewise, uncae hem o he shoes one). Hence, all he auomaa have he same sucue, dieing only in he labels of he hoizonal aows. The supeimposiion is dened as follows: we build he b[ ] able fo each paen, and hen ake he biwise-and of all he ables (ecall ha means mach and means mismach). The esuling b[ ] able maches a posiion i wih he i-h chaace of any of he paens. We hen build he auomaon as befoe using his able. 7

The esuling auomaon acceps a ex posiion if i ends an occuence of a much moe elaxed paen wih classes of chaaces, namely fp ; :::; P g fp 2 ; :::; P 2g ::: fp m; :::; P mg fo example, if he seach is fo "pa" and "wai", as shown in Figue 4, he sing "pai" is acceped wih zeo eos. p o w a o i no eos p o w a o i eo p o w a o i 2 eos Figue 4: An NFA o le he seach fo "pa" and "wai". Fo a modeae numbe of paens, he le is sic enough a he same cos of a single seach. Each occuence epoed by he auomaon has o be veied fo all he involved paens (we use he single-paen auomaon fo his sep). Tha is, we have o eavese he las m + k = O(m) chaaces o deemine if hee is acually an occuence of some of he paens. If he numbe of paens is oo lage, he le will be oo elaxed and will igge oo many veicaions. In ha case, we paiion he se of paens ino goups of paens each, build he auomaon of each goup and pefom d= e independen seaches. The cos of his seach is O(= n), whee is small enough o make he cos of veicaions negligible. This always exiss, since fo = we have a single paen pe auomaon and no veicaion is needed. When gouping, we use he heuisic of soing he paens and packing neighbos in he same goup, ying o have he same s chaaces. 3. Hieachical Veicaion The simples veicaion alenaive (which we call \plain") is ha, once a supeimposed auomaon epos a mach, we y he individual paens one by one in he candidae aea. Howeve, a smae veicaion echnique (which we call hieachical) is possible. 8

Assume s ha is a powe of wo. Then, when he auomaon epos a mach, un wo new auomaa ove he candidae aea: one which supeimposes he s half of he paens and anohe wih he second half. Repea he pocess ecusively wih each of he wo auomaa ha nds again a mach. A he end, he auomaa will epesen single paens and if hey nd a mach we know ha hei paens have been eally found (see Figue 5). Of couse he auomaa fo he equied subses of paens ae all pepocessed. Since hey coespond o he inenal nodes of a binay ee of leaves, hey ae 2? = O(), so he space and pepocessing cos does no change. If is no a powe of wo hen one of he halves may have one moe paen han he ohe. 2 3 4 2 3 4 2 3 4 Figue 5: The hieachical veicaion mehod fo 4 paens. Each node of he ee epesens a check (he oo epesens in fac he global le). If a node passes he check, is wo childen ae esed. If a leaf passes he check, is paen has been found. The advanage of hieachical veicaion is ha i can emove a numbe of candidaes fom consideaion in a single es. Moeove, i can even nd ha no paen has eally mached befoe acually checking any specic paen (i.e. i may happen ha none of he wo halves mach in a spuious mach of he whole goup). The wos-case ovehead ove plain veicaion is jus a consan faco, ha is, wice as many ess ove he candidae aea (2? insead of ). On aveage, as we show lae analyically and expeimenally, hieachical veicaion is by fa supeio o plain veicaion. 3.2 Auomaon Paiioning Up o now we have consideed sho paens, whose NFA ino a compue wod. If his is no he case (i.e. (m? k)(k + 2) > w), we paiion he poblem. In his subsecion and he nex we adap he wo paiioning echniques descibed in [6]. The simples echnique o cope wih a lage auomaon is o use a numbe of machine wods fo he simulaion. The idea is as follows: once he (lage) auomaa have been supeimposed, we paiion he supeimposed auomaon ino a maix of subauomaa, each one ing in a compue wod. Those subauomaa behave slighly dieenly han he simple one, since hey mus popagae bis o hei neighbos. Figue 6 illusaes. Once he auomaon is paiioned, we un i ove he ex updaing is subauomaa. Each sep akes ime popoional o he numbe of cells o updae, i.e. O(k(m? k)=w). Obseve, howeve, ha i is no necessay o updae all he subauomaa, since hose on he igh may no have any acive sae. Following [3], we keep ack of up o whee we need o updae he maix of subauomaa, woking only on he \acive" cells. 9

c I ows J columns Infomaion flow Affeced aea Figue 6: A lage NFA paiioned ino a maix of I J compue wods, saisfying (` + )`c w. 3.3 Paen Paiioning This echnique is based on Lemma of Secion 2.. We can educe he size of he poblem if we divide he paen in j pas, povided we seach all he sub-paens wih bk=jc eos. Each mach of a sub-paen mus be veied o deemine if i is in fac a complee mach. To pefom he paiion, we pick he smalles j such ha he poblem s in a single compue wod (i.e. (dm=je? bk=jc)(bk=jc + 2) w). The limi of his mehod is eached fo j = k +, since in ha case we seach wih zeo eos. The algoihm fo his case is qualiaively dieen and is descibed in Secion 4. We divide each paen in j subpaens as evenly as possible. Once we paiion all he paens, we ae lef wih j subpaens o be seached wih bk=jc eos. We simply goup hem as if hey wee independen paens o seach wih he geneal mehod. The only dieence is ha, afe deemining ha a subpaen has appeaed, we have o veify is complee paen. Anohe kind of hieachical veicaion, which we call \hieachical piece veicaion", is applied in his case oo. As shown in [23, 24], he single-paen algoihm can veify hieachically whehe he complee paen maches given ha a piece maches (see Figue 7). Tha is, insead of checking he complee paen we check he concaenaion of wo pieces conaining he one ha mached, and if i maches hen we check he concaenaion of fou pieces, and so on. This woks because Lemma applies a each level of he ee of Figue 7. The mehod is ohogonal o ou hieachical veicaion idea because hieachical piece veicaion woks boom-up insead of op-down and opeaes on pieces of he paen ahe han on ses of paens. As we ae using ou hieachical veicaion on he ses of paen pieces o deemine which piece mached given ha a supeimposiion of hem mached, we ae coupling wo dieen hieachical veicaion echniques in his case: we s use ou new mechanism o deemine which piece mached fom he supeimposed goup and hen use hieachical piece veicaion o deemine he occuence of he complee paen he piece belongs o. Figue 8 illusaes he whole pocess.

aaabbbcccddd aaabbb cccddd aaa bbb ccc ddd Figue 7: The hieachical piece veicaion mehod fo a paen spli in 4 pas. The boxes (leaves) ae he elemens which ae acually seached, and he oo epesens he whole paen. A leas one paen a each level mus mach in any occuence of he complee paen. If he bold box is found, all he bold lines may be veied. 3 pieces o seach P P2 P3 each one is spli in 4 p33 p3 p22 p4 p3 p24 hieachical veif. p22 is found p22 p2 p22 hieachical piece veif. P2 is finally found P2 p23 p24 p p2 p3 p2 p22 p32 p3 p23 p33 p4 p24 p34 p p23 p32 p2 p2 p34 p2 p22 p23 p24 he pieces ae aanged in 2 supeimposed goups and seached Figue 8: The whole pocess of paen paiioning wih hieachical veicaions. 4 Paiioning ino Exac Seaching This echnique (called \exac paiioning" fo sho) is based on a single-paen le which educes he poblem of appoximae seaching o a poblem of mulipaen exac seaching. The algoihm s appeaed in [34], and was lae impoved in [7, 6, 24]. We s pesen he singlepaen vesion and hen ou exension o muliple paens. 4. A File Based on Exac Seaching A paicula case of Lemma shows ha if a paen maches a ex posiion wih k eos, and we spli he paen in k+ pieces, hen a leas one of he pieces mus be pesen wih no eos in each occuence (his is a folkloe popey which has been used seveal imes [34, 8, 2]). Seaching wih zeo eos leads o a compleely dieen echnique. Since hee ae ecien algoihms o seach fo a se of paens exacly, we paiion he paen in k + pieces (of simila lengh), and apply a mulipaen exac seach fo he pieces. Each occuence of a piece is veied o check if i is suounded by a complee mach. If hee ae

no oo many veicaions, his algoihm is exemely fas. Fom he many algoihms fo mulipaen seach, an exension of Sunday's algoihm [27] gave us he bes esuls. We build a ie wih he sub-paens. Fom each ex posiion we seach he ex chaaces ino he ie, unil a leaf is found (mach) o hee is no pah o follow (mismach). The jump o he nex ex posiion is pecompued as he minimum of he jumps allowed in each sub-paen by he Sunday algoihm. As in [24], we use he same echnique fo hieachical piece veicaion of a single paen pesened in Secion 3.3. 4.2 Seaching Muliple Paens Obseve ha we can easily add moe paens o his scheme. Suppose we have o seach fo paens P ; :::; P. We cu each one ino k + pieces and seach in paallel fo all he (k + ) pieces. When a piece is found in he ex, we use a classical algoihm o veify is paen in he candidae aea. Noe an impoan dieence wih supeimposed auomaa. In his mulipaen seach we know which piece has mached. This is no he case in supeimposed auomaa, whee no only we do no know which piece mached, bu i is even possible ha no piece has eally mached. The wok o deemine which is he maching piece (caied ou by hieachical veicaion in supeimposed auomaa) is no necessay hee. Moeove, we only deec eal maches, so hee ae no moe maches in he union of paens han he sum of he individual maches. Theefoe, hee is no poin in sepaaing he seach fo he (k + ) pieces in goups. The only eason o supeimpose less paens is ha he shifs of he Sunday algoihm ae educed as he numbe of paens gow, bu as we show in he expeimens, his neve jusies in pacice spliing one seach ino wo. 5 A Couning File We pesen now a le based on couning lees in common beween he paen and a ex window. This le was s pesened in [4] (a simple vaian of [3]), bu we use a slighly dieen vesion hee. Ou vaian uses a xed-size insead of vaiable-size ex window (a possibiliy aleady noed in [32]), which makes i bee suied fo paallelizaion. We s explain he single-paen le and hen exend i o handle many paens using bi-paallelism. 5. A Simple Coune This le is based in Lemma 2 of Secion 2.. I passes ove he ex examining an m-lees long window. I keeps ack of how many chaaces of P ae pesen in he cuen ex window (accouning fo mulipliciies oo). If, a a given ex posiion j, m? k o moe chaaces of P ae in he window T j?m+::j, he window aea is veied wih a classical algoihm. We implemen he leing algoihm as follows. We keep a coune coun of paen chaaces appeaing in he ex window. We also keep a able A[ ] whee, iniially, he numbe of imes ha each chaace c appeas in P is kep in A[c]. Thoughou he algoihm, each eny A[c] indicaes how many occuences of c can sill be aken as belonging o P. Fo example, if 'h' appeas once 2

in P, we coun only one of he 'h's of he ex window as belonging o P. When A[c] is negaive, i means ha c mus exi he ex window?a[c] imes befoe we ake i again as belonging o P. Fo example, if we un he paen "aloha" ove he ex "aaaaaaaa", i will hold A["a"] =?3, and he value of he coune will be 2. This is independen on k. To advance he window, we mus include he new chaace T j+ and exclude he las chaace, T j?m+. To include he new chaace, we subac one fom A[T j+ ]. If i was geae han zeo befoe being decemened, i is because he new chaace T j+ is in P, so we incemen coun. To exclude he old chaace T j?m+, we add one o A[T j?m+ ]. If is is geae han zeo afe being incemened, i is because T j?m+ was consideed o be in P, so we decemen coun. Wheneve coun eaches m? k we veify he peceding aea. As can be seen, he algoihm is no only linea (excluding veicaions), bu he numbe of opeaions pe chaace is vey small. 5.2 Keeping Many Counes in Paallel To seach paens in he same ex, we use bi-paallelism o keep all he counes in a single machine wod. We mus do ha fo he A[ ] able and fo coun. The values of he enies of A[ ] lie in he ange [?m::m], so we need exacly `+ = +dlog 2 (m+ )e bis o soe hem. This is also enough fo coun, since i is in he ange [::m]. Hence, we can pack bw=( + dlog 2 (m + )e)c w= log 2 m paens of lengh m in a single seach (ecall ha w is he numbe of bis in he compue wod). If he paens have dieen lenghs, we can eihe uncae hem o he shoes lengh o use a window size of he longes lengh. If we have moe paens, we mus divide he se in subses of maximal size and seach each subse sepaaely. We focus ou aenion on a single subse now. The algoihm simulaes he simple one as follows. We have a able MA[ ] ha packs all he A[ ] ables. Each eny of MA[ ] is divided in bi aeas of lengh ` +. In he aea of he machine wod coesponding o each paen, we soe is nomal A[ ] value, se o he mos signican bi of he aea, and subac (i.e. we soe 2`? + A[ ]). When, in he algoihm, we have o add o subac o all A[ ]'s, we can easily do i in paallel wihou causing oveow fom an aea o he nex. Moeove, he coesponding A[ ] value is no posiive if and only if he mos signican bi of he aea is zeo. We have also a paallel coune M coun, whee he aeas ae aligned wih MA[ ]. I is iniialized by seing o he mos signican bi of each aea and hen subacing m? k a each one, i.e. we soe 2`? (m? k). Lae, we can add o subac in paallel wihou causing oveow. Moeove, he window mus be veied fo a paen wheneve he mos signican bi of is aea eaches. The condiion can be checked in paallel, bu when some of he mos signican bis each, we need o sequenially check which one i was. Finally, obseve ha he counes ha we wan o selecively incemen o decemen coespond exacly o he MA[ ] aeas ha have a in hei mos signican bi (i.e. hose whose A[ ] value is posiive). This allows an obvious bi mask-shif-add mechanism o pefom his opeaion in paallel on all he counes. Figue 9 illusaes. 3

m = 5; k = ; = 3; ` = 3 M A [a] M A [l] A[c] M A [o] M A [h] +2`? M A[c] M A [e] coun?(m?k) A[c] >? +2` M coun M coun coun m?k? (false) Figue 9: The bi-paallel counes. The example coesponds o he paen "aloha" seached wih eo and he ex window "hello". The A values ae A[ a ] = 2; A[ l ] = A[ e ] =?; A[ o ] = A[ h ] =, and coun = 3. 6 Analysis We ae ineesed in he complexiy of he pesened algoihms, as well as in he esicions ha and mus saisfy fo each mechanism o be ecien in leing mos of he unelevan pa of he ex. To his eec, we dene wo conceps. Fis, we say ha a mulipaen seach algoihm is opimal if i seaches paens in he same ime i akes o seach one paen. If we call C n; he cos o seach paens in a ex of size n, hen an algoihm is opimal if C n; = C n;. Second, we say ha a mulipaen seach algoihm is useful if i seaches paens in less han he ime i akes o seach hem one by one wih he coesponding sequenial algoihm, i.e. C n; < C n;. As we wok wih les, we ae ineesed in he aveage case analysis, since in he wos case none is useful. We compae in Table he complexiies and limis of applicabiliy of all he algoihms. Muh & Manbe ae included fo compleeness. The analysis leading o hese esuls is pesened lae in his secion. Algoihm Complexiy Opimaliy Usefulness Simple Supeimp. (?) 2 n <? eq <? e= p q m Auomaon Pa. 2 n <? e <? e= p w(?) q m Paen Pa. p n <? e <? e= p w(?) Pa. Exac Seach + m n < < = log (m)+(log log (m)) log m Couning n < e?m= < e?m= w Muh & Manbe mn k = k = log m+(log log m) Table : Complexiy, opimaliy and limi of applicabiliy fo he dieen algoihms. 4

We pesen in Figue a schemaical epesenaion of he aeas whee each algoihm is he bes in ems of complexiy. We show lae how he expeimens mach hose gues. Exac paiioning is he fases choice in mos easonable scenaios, fo he eo levels whee i can be applied. Fis, i is fase han couning fo m= log m < = =w, which does no hold asympoically bu holds in pacice fo easonable values of m. Second, i is fase han supeimposing auomaa fo min( p w; w=m) < =? =(=? ), which is ue in mos pacical cases. The only algoihm which can be fase han exac paiioning is ha of Muh & Manbe [7], namely fo > =. Howeve, i is limied o k =. Fo inceasing m, couning is asympoically he fases algoihm since is cos gows as O(log m) insead of O(m) hanks o is opimal use of he bis of he compue wod. Howeve, is applicabiliy is educed as m gows, being useless a he poin whee i wins ove exac paiioning. When he eo level is oo high fo exac paiioning, supeimposing auomaa is he only emaining alenaive. Auomaon paiioning is bee fo m p w, while paen paiioning is asympoically bee. Boh algoihms have he same limi of usefulness, and fo highe eo levels no le can impove ove a sequenial seach. Auomaon Paiioning NONE USEFUL Paiioning ino Exac Seach p w?e= p Paen Paiioning = log m m NONE USEFUL Supeimposed Auomaa Paiioning ino Exac Seach?e= p = log m Muh-Manbe = Figue : The aeas whee each algoihm is bee, in ems of, m and. In he lef plo (vaying m), we have assumed a modeae (i.e. less han 5). 6. Supeimposed Auomaa Suppose ha we seach paens. As explained befoe, we can paiion he se in goups of paens each, and seach each goup sepaaely (wih is auomaa supeimposed). The size of he goups should be as lage as possible, bu small enough fo he veicaions o be no signican. We analyze which is he opimal value fo and which is he complexiy of he seach. 5

In [6] we pove ha he pobabiliy of a given ex posiion maching a andom paen wih eo level is O( m ), whee = =(? 2 (? ) 2(?) ). I is also poved ha < wheneve <? e= p, and expeimenally shown ha his holds vey pecisely in pacice if we eplace e by.9. In fac, a vey abup phenomenon occus, since he maching pobabiliy is vey low fo? :9= p and vey high ohewise. In his fomula, = sands fo he pobabiliy of a chaace cossing a hoizonal edge of he auomaon (i.e. he pobabiliy of wo chaaces being equal). To exend his esul, we noe ha we have chaaces on each edge now, so he above menioned pobabiliy is? (? =), which is smalle han =. We use his uppe bound as a pessimisic appoximaion (which sands fo he case of all he chaaces being dieen, and is igh fo << ). As he single-paen algoihm is O(n) ime, he mulipaen algoihm is opimal on aveage wheneve he oal cos of veicaions is O() pe chaace. Since each veicaion coss O(m) (because we use a linea-ime algoihm on an aea of lengh m + k = O(m)), we need ha he oal numbe of veicaions pefomed is O(=m) pe chaace, on aveage. If we used he plain veicaion scheme, his would mean ha he pobabiliy ha a supeimposed auomaon maches a ex posiion should be O(=(m)), as we have o pefom veicaions. If hieachical veicaion is no used we have ha, as inceases, maching becomes moe pobable (because i is easie o coss a hoizonal edge of he auomaon) and i coss moe (because we have o check he paens one by one). This esuls in wo dieen limis on he maximum allowable, one fo each of he wo facs jus saed. The limi due o he inceased cos of each veicaion is moe singen han ha of inceased maching pobabiliy. The esuling analysis wihou hieachical veicaion is vey complex and is omied hee because hieachical veicaion yields consideably bee esuls and a simple analysis. As we show in Appendix A, he aveage cos o veify a mach of he supeimposed auomaon is O(m) when hieachical veicaion is used, insead of he O(m) cos of plain veicaion. Tha is, he cos does no gow as he numbe of paens inceases. Hence, he only limi ha pevens us fom supeimposing all he paens is ha he maching pobabiliy becomes highe. Tha is, if >? e p =, hen he maching pobabiliy is oo high and we will spend oo much ime veifying almos all ex posiions. On he ohe hand, we can supeimpose as much as we like befoe ha limi is eached. This ells ha he bes (which we call ) is he maximum one no eaching he limi, i.e. = (? )2 e 2 () Since we paiion in ses small enough o make he veicaions no signican, he cos is simply O(= n) = O(n=((? ) 2 )). This means ha he algoihm is opimal fo = O() (aking he eo level as a consan), o alenaively? e p =. On he ohe hand, fo >? e= p, he cos is O(n), no bee han he ivial soluion (i.e. = and hence no supeimposiion occus and he algoihm is no useful). Figue illusaes. Auomaon Paiioning: he analysis fo his case is simila o he simple one, excep because each sep of he lage auomaon akes ime popoional o he oal numbe of subauomaa, i.e. 6

p p s e 2 (?) 2? ep p? e p Figue : Behavio of supeimposed auomaa. On he lef, he cos inceases linealy wih, wih slope depending on. On he igh, he cos of a paallel seach ( p ) appoaches single seaches ( s ) when gows. O(k(m? k)=w). In fac, his is a wos case since on aveage no all cells ae acive, bu we use he wos case because we supeimpose all he paens we can unil he wos case of he seach is almos eached. Theefoe, he cos fomula is e 2 (? ) 2 k(m? k) w n = O! m 2 w(? ) n This is opimal fo = O(w) (fo consan ), o alenaively fo? e p =. I is useful fo? e= p. Paen Paiioning: we have now j paens o seach wih bk=jc eos. The eo level is he same fo subpoblems (ecall ha he subpaens ae of lengh m=j). To deemine which piece mached fom he supeimposed goup, we pay O(m) independenly of he numbe of pieces supeimposed (hanks o he hieachical veicaion). Hence he limi fo ou gouping is given by Eq. (). In boh he supeimposed and in he single-paen algoihm, we also pay o veify if he mach of he piece is pa of a complee mach. As we show in [23], his cos is negligible fo <? e= p, which is less sic han he limi given by Eq. (). As we have j pieces o seach, we need an analyical expession fo j. Since j is jus lage enough so ha he subpaens in a compue wod, j = (m? k)d(w; ), fo d(w; ) = + p + w=(? ) w whee d(w; ) can be shown o be O(= p w) by maximizing i in ems of (see [23]). Theefoe, he complexiy is je 2 (? ) n = O m 2 p w(? ) n On he ohe hand, he seach cos of he single-paen algoihm is O(jn). Wih espec o he simple algoihm fo sho paens, boh coss have been muliplied by j, and heefoe he limis fo opimaliy and usefulness ae he same. 7

If we compae he complexiies of paen vesus auomaon paiioning, we have ha paen paiioning is bee fo k > p w. This means ha fo consan and inceasing m, paen paiioning is asympoically bee. 6.2 Paiioning ino Exac Seaching In [6] we analyze his algoihm as follows. Excep fo veicaions, he seach ime can be made O(n) in he wos case by using an Aho-Coasick machine [], and O(n) in he bes case if we use a mulipaen Boye-Mooe algoihm. This is because we seach pieces of lengh m=(k +) =. We ae ineesed in analyzing he cos of veicaions. Since we cu he paen in k + pieces, hey ae of lengh bm=(k + )c o dm=(k + )e. The pobabiliy of each piece maching is a mos = bm=(k+)c. Hence, he pobabiliy of any piece maching is a mos (k + )= bm=(k+)c. We can easily exend ha analysis o he case of muliple seach, since we have now (k + ) pieces of he same lengh. Hence, he pobabiliy of veifying is (k + )= bm=(k+)c. We check he maches using a classical algoihm such as dynamic pogamming. Noe ha in his case we know which paen o veify, since we know which piece mached. As we show in [23], he oal veicaion cos if he pieces ae of lengh ` is O(`2) (in ou case, ` = m=(k + )). Hence, he seach cos is O + m = whee he \" mus be changed o \" if we conside he bes case. We conside opimaliy and usefulness now. An opimal algoihm should pay O(n) oal seach ime, which holds fo < log (m) + log (=) = n log (m) + (log log (m)) The algoihm is always useful, since i seaches a he same cos independenly on he numbe of paens, and he numbe of veicaions iggeed is exacly he same as if we seached each paen sepaaely. Howeve, if > =(log m + (log log m)), hen boh algoihms (single and mulipaen) wok as much as dynamic pogamming and hence he mulipaen seach is no useful. The ohe case when he algoihm could no be useful is when he shifs of a Boye-Mooe seach ae shoened by having many paens up o he poin whee i is bee o pefom sepaae seaches. This neve happens in pacice. 6.3 Couning If he numbe of veicaions is negligible, each pass of he algoihms is O(n). In he case of muliple paens, only O(w= log m) paens can be packed in a single seach, so he cos o seach paens is O(n log(m)=w). The dicul pa of he analysis is he maximum eo level ha he laion scheme can oleae while keeping he numbe of veicaions low. We assume ha we use dynamic pogamming o veify poenial maches. We call he pobabiliy of veifying. If log(m)=(wm 2 ) he algoihm keeps linea (i.e. opimal) on aveage. The algoihm is always useful since he numbe of veicaions iggeed wih he mulipaen seach is he same as fo he single-paen vesion. 8

Howeve, if =m boh algoihms wok O(mn) as fo dynamic pogamming and hence he le is no useful. We deive in Appendix B a pessimisic bound fo he limi of opimaliy and usefulness, namely < e?m= (Eq. (5)). Hence, as m gows, we can oleae smalle eo levels. This limi holds fo any condiion of he ype = O(=m c ), independenly of he consan c. In ou case, we need c = 2 fo opimaliy and c = fo usefulness. 7 Expeimenal Resuls We expeimenally sudy ou algoihms and compae hem agains pevious wok. We esed wih megabyes of lowe-case English ex. The paens wee andomly seleced fom he same ex. We use a Sun UlaSpac- unning Solais 2.5., wih 64 megabyes of RAM, and w = 32. Each daa poin was obained by aveaging he Unix's use ime ove ials. We pesen all he imes in enhs of seconds pe megabye. We do no pesen esuls on andom ex o avoid an excessively lenghly exposiion. In geneal, all he les impove as he alphabe size gows. Lowe-case English ex behaves appoximaely as andom ex wih = 5, which is he invese of he pobabiliy ha wo andom lees ae equal. Figue 2 (lef) compaes he plain and hieachical veicaion mehods agains a sequenial applicaion of he seaches, fo he case of supeimposed auomaa when he auomaon s in a compue wod. We show he cases of inceasing and of inceasing k. I is clea ha hieachical veicaion oupefoms plain veicaion in all cases. Moeove, he analysis fo hieachical vei- caion is conmed since he maximum up o whee he cos of he paallel algoihm does no gow linealy is vey close o = (? ) 2 =:9 2. On he ohe hand, he algoihm wih simple veicaion degades soone, since he veicaion cos gows wih. The menioned maximum value is he poin whee he paallelism aio is maximized. Tha is, if we have o seach fo 2 paens, i is bee o spli hem in wo goups of size and seach each goup sequenially. To sess his poin, Figue 2 (igh) shows he quoien beween he paallel and he sequenial algoihms, whee he opimum is clea fo supeimposed auomaa. On he ohe hand, he paallelism aio of exac paiioning keeps impoving as gows, as pediced by he analysis (hee is an opimum fo lage m, elaed o he Sunday shifs, bu i sill does no jusify o spli a seach in wo). When we compae ou algoihms agains he ohes, we conside only hieachical veicaion and use his value o obain he opimal gouping fo he supeimposed auomaa algoihms. The exac paiioning, on he ohe hand, pefoms all he seaches in a single pass. In couning, i is clea ha he speedup is opimal and we pack as many paens as we can in a single seach. Noice ha he plos which depend on show he poin whee should be seleced. Those which depend on k (fo xed ), on he ohe hand, jus show how he paallelizaion woks as he eo level inceases, which canno be conolled by he algoihm. We compae now ou algoihms among hem and agains ohes. We begin wih sho paens whose NFA in a compue wod. Figue 3 shows he esuls fo inceasing and fo inceasing k. Fo low and modeae eo levels, exac paiioning is he fases algoihm. In paicula, i is fase han pevious wok [7] when he numbe of paens is below 5 (fo English ex). When 9

2 6 2 8 4 p s..8.6.4.2 3 5 5 2 25 3.. 5 5 2 25 3 24 8 2 6 p s.8.6.4.2 5 5 2 25 3.. 5 5 2 25 3 8 6 4 2 p s.8.6.4.2 2 3 4 5 6 7 k. 2 3 4 5 6 7 k Sequenial NFA Supeimposed, plain veif. Exac Paiioning Supeimposed, hieachical veif. Figue 2: Compaison of sequenial and mulipaen algoihms fo m = 9. The ows coespond o k =, k = 3 and = 5, especively. The lef plos show seach ime and he igh plos show he aio beween he paallel ( p ) and he sequenial ime ( s ). 2

he eo level inceases, supeimposed auomaa is he bes choice. This agees wih he analysis. 5 8 4 6 3 4 2 2 2 2 4 6 8 3 2 4 6 8 5 25 2 5 5 2 3 4 5 6 7 k 5 2 3 4 5 6 7 Exac Paiioning Supeimposed Auomaa Couning Muh & Manbe (k = ) k Figue 3: Compaison among algoihms fo m = 9. The op plos show inceasing fo k = and k = 3. The boom plos show inceasing k fo = 8 and = 6. We conside longe paens now (m = 3). Figue 4 shows he esuls fo inceasing and fo inceasing k. As befoe, exac paiioning is he bes whee i can be applied, and impoves ove pevious wok [7] fo up o 9{. Fo hese longe paens he supeimposed auomaa echnique also degades, and only aely is i able o impove ove exac paiioning. In mos cases i only begins o be he bes when i (and all he ohes) ae no longe useful. Figue 5 summaizes some of ou expeimenal esuls, becoming a pacical vesion of he heoeical Figue. The main dieences ae ha exac paiioning is bee in pacice han wha is complexiy suggess, and ha hee is no clea winne beween paen and auomaon paiioning. 2

3 25 2 5 5 7 6 5 4 3 2 2 4 6 8 3 5 7 9 3 5 k 3 25 2 5 5 2 4 6 8 3 5 7 9 3 5 Exac Paiioning Paen Paiioning Auomaon Paiioning Couning Muh & Manbe (k = ) Figue 4: Compaison among algoihms fo m = 3. The op plos show, fo inceasing, k = and k = 4. The boom plos show, fo inceasing k, = 8 and = 6. Paen paiioning is no un fo k = because i would eso o exac paiioning. 7 6 5 4 3 2 k 22

.4 NONE USEFUL.4 NONE USEFUL α Supeimposed Auomaa α Supeimposed Auomaa.3.3 Paiioning ino Exac Seach Paiioning ino Exac Seach 9 3 m 5 - Muh-Manbe Figue 5: The aeas whee each algoihm is bee, in pacice, on English ex. In he igh plo we assumed m = 9. Compae wih Figue. 8 Conclusions We have pesened a numbe of dieen leing algoihms fo mulipaen appoximae seaching. These ae he only algoihms ha allow an abiay numbe of eos. On he ohe hand, he only pevious wok allows jus one eo and we have oupefomed i when he numbe of paens o seach is below 5{ on English ex, depending on he paen lengh. We have explained, analyzed and expeimenally esed ou algoihms. We have also pesened a map of he bes algoihms fo each case. Many of he ideas we popose hee can be used o adap ohe single-paen appoximae seaching algoihms o he case of mulipaen seaching. Fo insance, he idea of supeimposing auomaa can be adaped o mos bi-paallel algoihms, such as [9]. Anohe fuiful idea is ha of exac paiioning, whee a mulipaen exac seach is easily adaped o seach he pieces of many paens. Thee ae many ohe leing algoihms of he same ype, e.g. [28]. On he ohe hand, ohe exac mulipaen seach algoihms may be bee suied o ohe seach paamees (e.g. woking bee on many paens). A numbe of pacical opimizaions o ou algoihms ae possible, fo insance If he paens have dieen lenghs, we uncae hem o he shoes one when supeimposing auomaa. We can selec clevely he subsings o use, since having he same chaace a he same posiion in wo paens impoves he leing mechanism. We used simple heuisics o goup subpaens in supeimposed auomaa. These can be impoved o maximize common lees oo. A moe geneal echnique could goup paens which ae simila in ems of numbe of eos needed o conve one ino he ohe (i.e. a cluseing echnique). We ae fee o paiion each paen in k + pieces as we like in exac paiioning. This is used in [24] o minimize he expeced numbe of veicaions when he lees of he alphabe do no have he same pobabiliy of occuence (e.g. in English ex). An O(m 3 ) dynamic pogamming algoihm is pesened hee o selec he bes paiion, and his could be applied o mulipaen seach. 23

Acknowledgemens We hank Robe Muh and Udi Manbe fo hei implemenaion of [7]. anonymous efeees fo hei deailed commens ha impoved his wok. We also hank he Refeences [] A. Aho and M. Coasick. Ecien sing maching: an aid o bibliogaphic seach. CACM, 8(6):333{34, June 975. [2] R. Baeza-Yaes. Ecien Tex Seaching. PhD hesis, Dep. of Compue Science, Univ. of Waeloo, May 989. Also as Reseach Repo CS-89-7. [3] R. Baeza-Yaes. Tex eieval: Theoy and pacice. In 2h IFIP Wold Compue Congess, volume I, pages 465{476. Elsevie Science, Sepembe 992. [4] R. Baeza-Yaes and G. Gonne. A new appoach o ex seaching. CACM, 35():74{82, Ocobe 992. [5] R. Baeza-Yaes and G. Navao. Muliple appoximae sing maching. In Poc. WADS'97, LNCS 272, pages 74{84, 997. [6] R. Baeza-Yaes and G. Navao. Fase appoximae sing maching. Algoihmica, 23(2):27{ 58, 999. [7] R. Baeza-Yaes and C. Pelebeg. Fas and pacical appoximae paen maching. In Poc. CPM'92, pages 85{92, 992. LNCS 644. [8] R. S. Boye and J. S. Mooe. A fas sing seaching algoihm. CACM, 2():762{772, 977. [9] W. Chang and J. Lampe. Theoeical and empiical compaisons of appoximae sing maching algoihms. In Poc. CPM'92, pages 72{8, 992. LNCS 644. [] W. Chang and E. Lawle. Sublinea appoximae sing maching and biological applicaions. Algoihmica, 2(4/5):327{344, 994. [] Z. Galil and K. Pak. An impoved algoihm fo appoximae sing maching. SIAM J. of Compuing, 9(6):989{999, 99. [2] D. Geene, M. Panas, and F. Yao. Muli-index hashing fo infomaion eieval. In Poc. FOCS'94, pages 722{73, 994. [3] R. Gossi and F. Luccio. Simple and ecien sing maching wih k mismaches. Infomaion Pocessing Lees, 33(3):3{2, 989. [4] P. Jokinen, J. Tahio, and E. Ukkonen. A compaison of appoximae sing maching algoihms. Sofwae Pacice and Expeience, 26(2):439{458, 996. 24

[5] D. E. Knuh, J. H. Mois, J, and V. R. Pa. Fas paen maching in sings. SIAM J. on Compuing, 6():323{35, 977. [6] G. Landau and U. Vishkin. Fas paallel and seial appoximae sing maching. J. of Algoihms, :57{69, 989. [7] R. Muh and U. Manbe. Appoximae muliple sing seach. In Poc. CPM'96, LNCS 75, pages 75{86, 996. [8] E. Myes. A sublinea algoihm fo appoximae keywod seaching. Algoihmica, 2(4/5):345{374, 994. [9] G. Myes. A fas bi-veco algoihm fo appoximae paen maching based on dynamic pogamming. In Poc. CPM'98, LNCS 448, pages {3, 998. [2] G. Navao. Muliple appoximae sing maching by couning. In Poc. WSP'97, pages 25{39. Caleon Univesiy Pess, 997. [2] G. Navao. Appoximae Tex Seaching. PhD hesis, Depamen of Compue Science, Univesiy of Chile, Decembe 998. fp://fp.dcc.uchile.cl/pub/uses/gnavao/- hesis98.ps.gz. Also as Tech. Repo TR/DCC-98-4. [22] G. Navao. A guided ou o appoximae sing maching. Technical Repo TR/DCC- 99-5, Dep. of Compue Science, Univ. of Chile, July 999. Submied. fp://- fp.dcc.uchile.cl/pub/uses/gnavao/suvasm.ps.gz. [23] G. Navao and R. Baeza-Yaes. Impoving an algoihm fo appoximae paen maching. Technical Repo TR/DCC-98-5, Dep. of Compue Science, Univ. of Chile, 998. Submied. fp://fp.dcc.uchile.cl/pub/uses/gnavao/dexp.ps.gz. [24] G. Navao and R. Baeza-Yaes. Vey fas and simple appoximae sing maching. Infomaion Pocessing Lees, 999. To appea. fp://fp.dcc.uchile.cl/pub/uses/gnavao/- hpexac.ps.gz. [25] G. Navao and M. Rano. A bi-paallel appoach o sux auomaa: Fas exended sing maching. In Poc. CPM'98, LNCS 448, pages 4{33, 998. [26] P. Selles. The heoy and compuaion of evoluionay disances: paen ecogniion. J. of Algoihms, :359{373, 98. [27] D. Sunday. A vey fas subsing seach algoihm. CACM, 33(8):32{42, Augus 99. [28] E. Suinen and J. Tahio. On using q-gam locaions in appoximae sing maching. In Poc. ESA'95, 995. LNCS 979. [29] J. Tahio and E. Ukkonen. Appoximae Boye-Mooe sing maching. SIAM Jounal on Compuing, 22(2):243{26, 993. 25