A Metric Index for Approximate String Matching

Size: px

Start display at page:

Download "A Metric Index for Approximate String Matching"

Allen Lester
6 years ago
Views:

1 A Metri Index for Approximate String Mathing Edgar Chávez 1 and Gonzalo Navarro 2 1 Esuela de Cienias Físio-Matemátias, Universidad Mihoaana. Edifiio B, Ciudad Universitaria, Morelia, Mih. Méxio elhavez@fismat.umih.mx 2 Depto. de Cienias de la Computaión, Universidad de Chile. Blano Enalada 2120, Santiago, Chile. gnavarro@d.uhile.l Abstrat. We present a radially new indexing approah for approximate string mathing. The sheme uses the metri properties of the edit distane and an be applied to any other metri between strings. We build a metri spae where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metri spae. This permits us finding the R ourrenes of a pattern of length m in a text of length n in average time O(m log 2 n+m 2 +R), using O(n log n) spaeando(n log 2 n) index onstrution time. This omplexity improves by far over all other previous methods. We also show a simpler sheme needing O(n) spae. 1 Introdution and Related Work Indexing text to permit effiient approximate searhing on it is one of the main open problems in ombinatorial pattern mathing. The approximate string mathing problem is: Given a long text T of length n, a (omparatively short) pattern P of length m, and a threshold r, retrieve all the pattern ourrenes, that is, text substrings whose edit distane to the pattern is at most r. Theedit distane between two strings is defined as the minimum number of harater insertions, deletions and substitutions needed to make them equal. This distane is used in many appliations, but several other distanes are of interest. In the on-line version of the problem, the pattern an be preproessed but the text annot. There are numerous solutions to this problem [25], but none is aeptable when the text is too long sine the searh time is proportional to the text length. Indexing text for approximate string mathing has reeived attention only reently. Despite some progress in the last deade, the indexing shemes for this problem are still rather immature. There exist some indexing shemes speialized to word-wise searhing on natural language text [21,3]. These indexes perform quite well in that ase but Supported by CYTED VII.19 RIBIDI projet (both authors), CONACyT grant (first author), and Fondeyt grant (seond author). On leave of absene at Universidad de Chile.

2 Edgar Chávez and Gonzalo Navarro they annot be extended to handle the general ase. Extremely important appliations suh as DNA, proteins, musi or oriental languages fall outside this ase. The indexes that solve the general problem an be divided into three lasses. Baktraking [17,34,11,15] uses the suffix tree [2], suffix array [20] or DAWG [12] of the text in order to fator out its repetitions. A sequential algorithm on the text is simulated by baktraking on the data struture. These algorithms take time exponential on m or r but in many ases independent of n, the text size. This makes them attrative when searhing for very short patterns. Partitioning [31,30,5] partitions the pattern into piees to ensure that some of the piees must appear without alterations inside every ourrene. An index able of exat searhing is used to detet the piees and the text areas that have enough evidene of ontaining an ourrene are heked with a sequential algorithm. These algorithms work well only when r/m is small. The third lass [24,6] is a hybrid between the other two. The pattern is divided into large piees that an still ontain (less) errors, they are searhed for using baktraking, and the potential text ourrenes are heked as in the partitioning methods. The hybrid algorithms are more effetive beause they an find the right point between length of the piees to searh for and error level permitted. Using the appropriate partition of the pattern, these methods ahieve on average O(n λ ) searh time, for some 0 <λ<1 that depends on r. They tolerate moderate error ratios r/m. We propose in this paper a brand new approah to the problem. We take into aount that the edit distane satisfies the triangle inequality and hene it defines a metri spae on the set of text substrings. We an re-express the approximate searh problem as a range searh problem on this metri spae. This approah has been attempted before [8,4], but in those ases the partiularities of the problem made it possible to index O(n) elements. In the general ase we have O(n 2 )textsubstrings. The main ontribution of this paper is to devise a method (based on the suffix tree of the text) to meaningfully ollapse the O(n 2 ) text substring into O(n) sets, and to find a way to build a metri spae out of those sets. The result is an indexing method that, at the ost of requiring on average O(n log n) spaeand O(n log 2 n) onstrution time, permits finding the R approximate ourrenes of the pattern in O(m log 2 n + m 2 + R) average time. This is a omplexity breakthrough over previous work, and it is easier than in other approahes to extend the idea to other distane funtions suh as reversals. Moreover, it represents an original approah to the problem that opens a vast number of possibilities for improvements. We onsider also a simpler version of the index needing O(n) spae and that, despite not involving a omplexity breakthrough, promises to be better in pratie. We use the following notation in the paper. Given a string s Σ we denote its length as s. Wealsodenotes i the i-th harater of s, for an integer i {1.. s }. Wedenotes i...j = s i s i+1...s j (whih is the empty string if i>j)and

3 AMetri Index for Approximate String Mathing s i... = s i... s. The empty string is denoted as ε. Astringx is said to be a prefix of xy, asuffix of yx and a substring of yxz. 2 Metri Spaes We desribe in this setion some onepts related to searhing metri spaes. We have onentrated only in the part that is relevant for this paper. There exist reent surveys if more omplete information is desired [10]. A metri spae is, informally, a set of blak-box objets and a distane funtion defined among them, whih satisfies the triangle inequality. The problem of proximity searhing in metri spaes onsists of indexing the set suh that later, given a query, all the elements of the set that are lose enough to the query an be quikly found. This has appliations in a vast number of fields, suh as non-traditional databases (where the onept of exat searh is of no use and we searh for similar objets, e.g. databases storing images, fingerprints or audio lips); mahine learning and lassifiation (where a new element must be lassified aording to its losest existing element); image quantization and ompression (where only some vetors an be represented and those that annot must be oded as their losest representable point); text retrieval (where we look for douments that are similar to a given query or doument); omputational biology (where we want to find a DNA or protein sequene in a database allowing some errors due to typial variations); funtion predition (where we want to searh for the most similar behavior of a funtion in the past so as to predit its probable future behavior); et. Formally, a metri spae is a pair (X,d), where X is a universe of objets and d : X X R + is a distane funtion defined on it that returns nonnegative values. This distane satisfies the properties of reflexivity (d(x, x) =0), strit positiveness (x y d(x, y) > 0), symmetry (d(x, y) =d(y, x)) and triangle inequality (d(x, y) d(x, z)+ d(z, y)). A finite subset U of X, ofsizen = U, is the set of objets we searh. Among the many queries of interest on a metri spae, we are interested in the so-alled range queries: Givenaqueryq X and a tolerane radius r, find the set of all elements in U that are at distane at most r to q. Formally, the outome of the query is (q, r) d = {u U, d(q, u) r}. The goal is to preproess the set so as to minimize the omputational ost of produing the answer (q, r) d. From the plethora of existing algorithms to index metri spaes, we fous on the so-alled pivot-based ones, whih are built on a single general idea: Selet k elements {p 1,...,p k } from U (alled pivots), and identify eah element u U with a k-dimensional point (d(u, p 1 ),...,d(u, p k )) (i.e. its distanes to the pivots). The index is basially the set of kn oordinates. At query time, map q to the k-dimensional point (d(q, p 1 ),...,d(q, p k )). With this information at hand, we an filter out using the triangle inequality any element u suh that d(q, p i ) d(u, p i ) >rfor some pivot p i, sine in that ase we know that d(q, u) > r without need to evaluate d(u, q). Those elements that annot be filtered out using this rule are diretly ompared against q.

4 Edgar Chávez and Gonzalo Navarro An interesting feature of pivot-based algorithms is that they an redue the number of final distane evaluations by inreasing the number of pivots. Define D k (x, y) =max 1 j k d(x, p j ) d(y, p j ). Usingthepivotsp 1,..., p k is equivalent to disarding elements u suh that D k (q, u) > r. As more pivots are added we need to perform more distane evaluations (exatly k) to ompute D k (q, ) (these are alled internal evaluations), but on the other hand D k (q, ) inreases its value and hene it has a higher hane of filtering out more elements (those omparisons against elements that annot be filtered out are alled external). It follows that there exists an optimum k. If one is not only interested in the number of distane evaluations performed but also in the total pu time required, then sanning all the n elements to filter out some of them may be unaeptable. In that ase, one needs multidimensional range searh methods, whih inlude data strutures suh as the kd-tree, R- tree, X-tree, et. [36,14]. Those strutures permit indexing a set of objets in k-dimensional spae in order to proess range queries. In this paper we are interested in a metri spae where the universe is the set of strings over some alphabet, i.e. X = Σ, and the distane funtion is the so-alled edit distane or Levenshtein distane. This is defined as the minimum number of harater insertions, deletions and substitutions neessary to make two strings equal [19,25]. The edit distane, and in fat any other distane defined as the best way to onvert one element into the other, is reflexive, stritly positive (as long as there are no zero-ost operations), symmetri (as long as the operations allowed are symmetri), and satisfies the triangle inequality. The algorithm to ompute the edit distane ed() is based on dynami programming. Imagine that we need to ompute ed(x, y). A matrix C 0.. x,0.. y is filled, where C i,j = ed(x 1..i,y 1..j ), so C x, y = ed(x, y). This is omputed as C i,0 = i, C 0,j = j, C i,j = if (x i = y j )thenc i 1,j 1 else 1 + min(c i 1,j,C i,j 1,C i 1,j 1 ) The algorithm takes O( x y ) time. The matrix an be filled olumn-wise or row-wise (there are more sophistiated ways as well). For reasons that will be made lear later, we prefer the row-wise filling. The spae required is only O( y ), sine only the previous row must be stored in order to ompute the new one, and therefore we just keep one row and update it. 3 Text Indexing Suffix trees are widely used data strutures for text proessing [2,1]. Any position i in a text T defines a suffix of T,namelyT i....asuffix trie is a trie data struture built over all the suffixes of T. At the leaf nodes the pointers to the suffixes are stored. Every substring of T an be found by traversing a path from the root. Roughly speaking, eah suffix trie leaf represents a suffix and eah internal node represents a different substring of T. To improve spae utilization, this trie is ompated into a Patriia tree [23] by ompressing unary paths. The edges that replae a ompressed path store

5 AMetri Index for Approximate String Mathing the whole string that they represent (via two pointers to their initial and final text position). One unary paths are not present the trie, now alled suffix tree, has O(n) nodes instead of the worst-ase O(n 2 ) of the trie. The suffix tree an be diretly built in O(n) time [22,35]. Any algorithm on a suffix trie an be simulated at the same ost in the suffix tree. We all expliit those suffix trie nodes that survive in the suffix tree, and impliit those that are ollapsed. Figure 1 shows the suffix trie and tree of the text "abraadabra". Note that a speial endmarker "$", smaller than any other harater, is appended to the text so that all the suffixes are external nodes a b r a a d a b r a Text 0 r a 8 d 7 5 b 5 r 9 3 $ 10 a 6 7 Suffix Trie 9 $ $ 3 10 ra d 7 5 bra $ Suffix Tree a 1 d 6 b $ r 3 a 4 $ 1 8 a 1 d 6 bra $ 11 4 $ Suffix Array Fig. 1. The suffix trie, suffix tree and suffix array of the text "abraadabra". The figure shows the internal nodes of the trie (numbered 0 to 9 in italis inside irles), whih represent text substrings that appear more than one, and the external nodes (numbered 1 to 11 inside squares), whih represent text substrings that appear just one. Those leaves do not only represent the unique substrings but all their extensions until the full suffix. In the suffix tree, only some internal nodes are left, and they represent the same substring as before plus the prefixes that may have been ollapsed. For example the internal node (7) of the suffix tree represents now the ompressed nodes (5) and (6), and hene the strings "b", "br" and "bra". The external node (1) represents "abra", but also "abraa", "abraad", et. until the full suffix "abraadabra".

6 Edgar Chávez and Gonzalo Navarro Finally, the suffix array [20] is a more ompat version of the suffix tree, whih requires muh less spae and poses a small penalty over the searh time. If the leaves of the suffix tree are traversed in left-to-right order, all the suffixes of the text are retrieved in lexiographial order. A suffix array is simply an array ontaining all the pointers to the text suffixes listed in lexiographial order, as shown in Figure 1. The suffix array stores one pointer per text position. The suffix array an be diretly built (without building the suffix tree) in O(n log n) worst ase time and O(n log log n) average time [20]. While suffix trees are searhed as tries, suffix arrays are binary searhed. However, almost every algorithm on suffix trees an be adapted to work on suffix arrays at an O(log n) penalty fator in the time ost. This is beause eah subtree of the suffix tree orresponds to an interval in the suffix array, namely the one ontaining all the leaves of the subtree. To follow an edge of the suffix trie, we use binary searh to find the new limits in the suffix array. For example, the internal node (7) in the suffix tree orresponds to the interval 6, 7 in the suffix array. Note that impliit nodes have the same interval than their representing expliit node. 4 Our Algorithm 4.1 Indexing A straightforward approah to text indexing for approximate string mathing using metri spaes tehniques has the problem that, in priniple, there are O(n 2 ) different substrings in a text, and therefore we should index O(n 2 ) objets, whih is unaeptable. The suffix tree provides a onise representation of all the substrings of a text in O(n) spae. So instead of indexing all the text substrings, we index only the (expliit) suffix tree nodes. Therefore, we have O(n) objets to be indexed as a metri spae under the edit distane. Now, eah expliit internal node represents itself and the nodes that desend to it by a unary path. Hene, eah expliit node that orresponds to a string xy and its parent orresponds to the string x represents the following set of strings x[y] = {xy 1,xy 1 y 2,..., xy} where x[y] is a notation we have just introdued. For example, the internal node (4) in Figure 1 represents the strings "a[bra]" = { ab, abr, abra }. The leaves of the suffix tree represent a unique text substring and all its extensions until the full text suffix is obtained. Hene, if T = zxy and x is a unique text substring (whose prefixes are not unique), then the orresponding suffix tree node is an expliit leaf, whih for us represents the set {x} x[y]. Table 1 shows the substrings represented by eah node in our running example. Note that the external nodes that desend by the terminator harater "$", i.e. e(8 11), represent a substring that is also represented at its parent and hene it an be disregarded.

7 AMetri Index for Approximate String Mathing Node Suffix trie Suffix tree Node Suffix trie/tree i(0) ε ε e(1) abra[adabra] i(1) a [a] e(2) bra[adabra] i(2) ab e(3) ra[adabra] i(3) abr e(4) a[adabra] i(4) abra a[bra] e(5) [adabra] i(5) b e(6) a[dabra] i(6) br e(7) [dabra] i(7) bra [bra] e(8) abra i(8) r e(9) bra i(9) ra [ra] e(10) ra e(11) a Table 1. The text substrings represented by eah node of the suffix trie and tree of Figure 1. Internal nodes are represented as i(x) andexternalsase(x). Hene, instead of indexing all the O(n 2 ) text substrings, we index O(n) sets of strings, whih are the sets represented by the expliit internal and the external nodes of the suffix tree. In our example, this set is U = {ε, [a], a[bra], [bra], [ra], abra[adabra], bra[adabra], ra[adabra], a[adabra], [adabra], a[dabra], [dabra]}. We have now to deide how to index this metri spae formed by O(n) sets of strings. Many options are possible, but we have onentrated on a pivot based approah. We selet at random k different text substrings that will be our pivots. For reasons that are made lear later, we hoose to selet pivots of lengths 0, 1, 2,, k 1. For eah expliit suffix tree node x[y] and eah pivot p i,we ompute the distane between p i and all the strings represented by x[y]. From the set of distanes from a node x[y] top i, we store the minimum and maximum ones. Sine all these strings are of the form {xy 1...y j, 1 j y }, alltheedit distanes an be omputed in O( p i xy ) time. Following our example, let us assume that we have seleted k =5pivots p 0 = "", p 1 = "a", p 2 = "br", p 3 = "ad" and p 4 = "raa". Figure 2 (left) shows the omputation of the edit distanes between i(4) = "a[bra]" and p 3 = "ad". The result shows that the minimum and maximum values of this node with respet to this pivot are 2 and 4, respetively. In the ase of external suffix tree nodes, the string y tends to be quite long (O(n) length on average), whih yields a very high omputation time for all the edit distanes and anyway a very large value for the maximum edit distane (note that ed(p i,xy) xy p i ). We solve this by pessimistially assuming that the maximum distane is n when the suffix tree node is external. The minimum edit distane an be found in O( p i max( p i, x )) time, beause it is not neessary to onsider arbitrarily long strings xy 1...y j : If we ompute the matrix row by row, then after having proessed x we have a minimum value seen up to now, v. Then there is no point in onsidering rows j suh that x + j p i >v. Hene we work until row j = v + p i x p i.

8 Edgar Chávez and Gonzalo Navarro a d a b r a a d abra a d a b r a Fig. 2. The dynami programming matrix to ompute the edit distane between "ad" and "a[bra]" (left) or "abra[adabra]" (right). The emphasized area is where the minima and maxima are taken from. Figure 2 (right) illustrates this ase with e(1) = "abra[adabra]" and the same p 4 = "ad". Note that to ompute the new set of edit distanes we have started from i(4), whih is the parent node of e(1) in the suffix tree. This an always be done in a depth first traversal of the suffix tree and saves onstrution time. Note also that it is not neessary to ompute the last 4 rows, sine they measure the edit distane between strings of length 8 or more against one of length 3. The distane annot be smaller than 5 and we have found at that point a minimum equal to 4. In fat we just assume that the maximum is 11, so the minimum and maximum value for this external node and this pivot are 4 and 11. In partiular, sine when indexing external nodes x[y] we always have ed(p i,x) already omputed, they an be indexed in O( p i 2 )time. One this is done for all the suffix tree nodes and all the pivots we have a set of k minimum and maximum values for eah expliit suffix tree node. This an be regarded as a hyperretangle in k dimensions: x[y] (min(ed(x[y],p 0 )),...,min(ed(x[y],p k 1 ))), (max(ed(x[y],p 0 )),...,max(ed(x[y],p k 1 ))) where we are sure that all the strings in x[y] lie inside the retangle. In our example, the minima and maxima for i(4) with respet to p 0 to p 4 are 2, 4, 1, 3, 1, 2, 2, 4 and 3, 3. Therefore i(4) is represented by the hyperretangle (2, 1, 1, 2, 3), (4, 3, 2, 4, 3). On the other hand, the ranges for e(1) are 5, 11, 4, 11, 3, 11, 4, 11 and 2, 11 and its hyperretangle is therefore (5, 4, 3, 4, 2), (11, 11, 11, 11, 11). 4.2 Searhing Let us now onsider a given query P searhed for with at most r errors. This is a range query with radius r inthemetrispaeofthesuffixtreenodes.as for pivot based algorithms, we ompare the pattern P against the k pivots and obtain a k-dimensional oordinate (ed(p, p 1 ),...,ed(p, p k )).

9 AMetri Index for Approximate String Mathing Let p i be a given pivot and x[y] a given node. If it holds that ed(p, p i )+r<min(ed(x[y],p i )) ed(p, p i ) r>max(ed(x[y],p i )) then, by the triangle inequality, we know that ed(p, xy ) >rfor any xy x[y]. The elimination an be done using any pivot p i. In fat, the nodes that are not eliminated are those whose retangle has nonempty intersetion with the retangle (ed(p, p 1 ) r,...,ed(p, p k ) r), (ed(p, p 1 )+r,...,ed(p, p k )+r). Figure 3 illustrates. The node ontains a set of points and we store its minimum and maximum distane to two pivots. These define a (2-dimensional) retangle where all the distanes from any substring of the node to the pivots lie. The query is a pattern P and a tolerane r, whih defines a irle around P. After taking the distanes from P to the pivots we reate a hyperube (a square in this ase) of width 2r + 1. If the square does not interset the retangle, then no substring in the node an be lose enough to P. p1 d min1 P d2 ed(node,p2) d2 max2 P 2r r max1 min2 max2 p2 min2 node node min1 Fig. 3. The elimination rule using two pivots. max1 d1 ed(node,p1) We have to solve the problem of finding all the k-dimensional retangles that interset a given query retangle. This is a lassial multidimensional range searh problem [36,14]. We ould for example use some variant of R-trees [16,7], whih would also yield a good data struture to work on seondary memory. Those nodes x[y] that annot be eliminated using any pivot must be diretly ompared against P. For those whose minimum distane to P is at most r, we report all their ourrenes, whose starting points are written in the leaves of the subtree rooted by the node that has mathed. In our running example, if we are searhing for "abr" with tolerane r =1,thennodei(4) qualifies, so we report the text positions in the orresponding tree leaves: 1 and 8. Observe that in order to ompare P against a given suffix tree node x[y], the edit distane algorithm fores us to ompare it against every prefix of x as well. Those prefixes orrespond to suffix tree nodes in the path from the root to x[y]. In order not to repeat work, we mark in the suffix tree the nodes that we have to ompare expliitly against P, and also mark every node in their path

10 Edgar Chávez and Gonzalo Navarro to the root. Then, we baktrak on the suffix tree entering every marked node and keeping trak of the edit distane between P and the node. The new row is omputed using the row of the parent, just as done with the pivots. This avoids reomputing the same prefixes for different suffix tree nodes, and inidentally is similar to the simplest baktraking approah [15], exept that in this ase we only follow marked paths. In this respet, our algorithm an be thought of as a preproessing to a baktraking algorithm, whih filters out some paths. As a pratial matter, note that this is the only step where the suffix tree is required. We an even print the text substrings that math the pattern without the help of the suffix tree, but we need it in order to report all their text positions. For this sake, a suffix array is muh heaper and does a better job (beause all the text positions are listed in a ontiguous interval). In fat, the suffix array an also replae the suffix tree at indexing time. 5 Analysis Let us first onsider the onstrution ost. The maximum length of a repeated text substring, or whih is the same, the average height of the suffix trie, is O(log n) [32,29]. Reall also that our pivots are of length O(k). Hene omputing all the kn minima and maxima takes time O(kn k log n) =O(k 2 n log n), where O(k log n) is the time to ompute eah distane. However, we start the omputation of the edit distanes of a node from the last row of its parent. This redues the average onstrution ost to O(k 2 n), sine for eah pivot we ompute one dynami programming row per suffix trie node, and on average the number of nodes in the suffix trie is O(n) [32,29].Then leaves are also omputed in O( p i 2 n)=o(k 2 n)time. The total spae required by the data struture is O(kn),sineweneedto store for eah expliit node a pointer to the suffix tree and its k oordinates. ThesuffixtreeitselftakesO(n) spae. It remains to determine the average searh time. A key element of the analysis is a onstant α, whih is the probability that, for a random hyperretangle of the set, along some fixed oordinate, the orresponding segment of the query hyperube intersets with the orresponding segment of the hyperretangle. Another way to put it is that, along that oordinate, the query point falls inside the hyperretangle projetion onto that oordinate after it is enlarged in r units along eah dimension. In operational terms, α is the probability that some given pivot (that orresponding to the seleted oordinate) does not permit disarding a given element. Note that α does not depend on k, onlyonr. The first part of the searh is the omputation of the edit distanes between the k pivots and the pattern P of length m. ThistakesO(k 2 m)time. The seond part is the searh for the retangles that interset the query retangle. Many analyses of the performane of R-trees exist in the literature [33,18,26,27,13]. Despite that most of them deal with the exat number of disk aesses, their abstrat result is that the expeted amount of work on the R-tree (and variants suh as the KDB-tree [28]) is O(nα k log n).

11 AMetri Index for Approximate String Mathing The third part, finally, is the diret hek of the pattern against the suffix tree nodes whose retangles interset the query retangle. Sine we disard using any of k random pivots, the probability of not disarding a node is α k.asthere are O(n) suffix tree nodes, we hek on average α k n nodes, with a total ost of O(α k n m 2 ). The m 2 is the ost to ompute the edit distane between a pattern of length m and a andidate whose length must be between m r and m + r. This is beause the pivot ε removes all shorter or longer andidates. At the end, we report the R results in O(R) time using a suffix tree traversal. Hene our total average ost is bounded by k 2 m + nα k log n + nα k m 2 + R for 0 α 1. This is optimized for k =log 1/α n + O(log log n) log 1/α n = Θ(log n). If we use log 1/α n pivots the searh ost beomes O(m log 2 n + m 2 + R) on average. Note that the influene of the searh radius r is embedded in α. This is muh better omplexity than all previous work, whih obtains O(mn λ )time for some 0 <λ<1. Moreover, muh of previous work requires m = Ω(log n) to obtain sublinearity, while our approah does not. The prie is in the onstrution time and spae, whih beome O(n log 2 n) and O(n log n), respetively. Espeially the latter an be prohibitive and we may have to ontent ourselves with a smaller k. There seems to be no good tradeoff between spae and time, e.g., to obtain O(n λ )timewealsoneedθ(log n) pivots. Most other indexes require O(n) spae and onstrution time. Finally, it is worth mentioning that, sine we automatially disard any internal node not in the length [m r, m + r] thanks to the pivot p 0 = ε, thereis a worst-ase limit σ m+r on the number of suffix tree nodes to onsider for the last phase. Although this limit is exponential on m and r, it is independent of n. Other indexing shemes based on the suffix tree share the same property. 6 Towards a Pratial Implementation Despite that we have obtained an important redution in time omplexity with respet to n and m, our result is hiding a multiplying fator that depends on the searh radius. It is possible that this onstant is too large (that is, α too lose to 1) and makes the whole approah useless. Also, the extra spae requirement (whih also inreases as α tends to 1) an be unmanageable. In this setion we onsider an alternative approah that is simpler and likely to obtain better results in pratie, despite not involving a omplexity breakthrough. 6.1 Indexing Only Suffixes A simpler index that derives from the same ideas of the paper onsiders only the n text suffixes and no internal nodes. Eah suffix [T j... ] represents all the text substrings starting at i, and it is indexed aording to the minimum distane between those substrings and eah pivot. The good point of the approah is redued spae. Not only the set U an have up to half the elements of the original approah, but also only k values (not 2k)

12 Edgar Chávez and Gonzalo Navarro are stored for eah element, sine all the maximum values are the same. This permits using up to four times the number of pivots of the previous approah at the same memory requirement. Note that we do not even need to build or store the suffix array: We just read the suffixes from the text and index them. Our only storage need is that of the metri index. The bad point is that the seletivity of the pivots is redued and some redundant work is done. The first is a onsequene of storing only minimum values, while the seond is a onsequene of not fatoring out repeated text substrings. That is, if some substring P of T is lose enough to P and it appears many times in T, we will have to hek all its ourrenes one by one. Without using a suffix tree struture, the onstrution of the index an be done in time O(k p i n) as follows. The algorithm depited in Setion 2 to ompute edit distane an be modified so as to make C 0,j =0,inwhihaseC i,j beomes the minimum edit distane between x 1...i and a suffix of y 1...j.Ifx is the reverse of p i and y the reverse of T,thenC pi,j will be the minimum edit distane between p i and a prefix of T n j+1..., whih is preisely min(ed(p i,t n j+1... )). So we need O( p i n) time per pivot. The spae to ompute this is just O( p i )by doing the omputation olumn-wise. 6.2 Using an Index for High Dimensions The spae of strings has a distane distribution that is rather onentrated around its mean µ. The same happens to the distanes between a pivot p i and suffixes [T j... ] or the pattern P. Sine we an only disard suffixes [T j... ] suh that ed(p i,p)+r < min(ed(p i, [T j... ])), only the suffixes with a large min(ed(p i, [T j... ])) value are likely to be disarded using p i. Storing all the other O(n) distanes to p i is likely to be a waste of spae. Moreover, we an use that memory to introdue more pivots. Figure 4 illustrates. The idea is to fix a number s and, for eah pivot p i, store only the s largest min(ed(p i, [T j... ])) values. Only those suffixes an be disarded using pivot p i. The spae of this index is O(ks) and its onstrution time is unhanged. We an still use an R-tree for the searh, although the retangles will over all the spae exept on s oordinates. The seletivity is likely to be similar sine we have disarded uninteresting oordinates, and we an tune number k versus seletivity s of the pivots for the same spae usage O(ks). One an go further to obtain O(n) spae as follows. Choose the first pivot and determine its s farthest suffixes. Store a list (in inreasing distane order) of those suffixes and their distane to the first pivot and remove them from further onsideration. Then hoose a seond pivot and find its s farthest suffixes from the remaining set. Continue until every suffix has been inluded in the list of some pivot. Note that every suffix appears exatly in one list. At searh time, ompare P against eah pivot p i,andifed(p, p i )+r is smaller than the smallest (first) distane in the list of p i, skip the whole list. Otherwise traverse the list until its end or until ed(p, p i )+r is smaller than the next element. Eah traversed suffix must be diretly ompared against P. A variant of this idea has proven extremely useful to deal with onentrated histograms [9]. It also permits

13 AMetri Index for Approximate String Mathing ed(p,p) +r ed(p,[t(j...)]) Fig. 4. The distane distribution to a pivot p, inluding that of pattern P.The grayed area represents the suffixes that an be disarded using p. effiient seondary storage implementation by paking the pivots in disk pages and storing the lists onseutively in the same order of the pivots. Sine we hoose k = n/s pivots, the onstrution time is high, namely O(n 2 p i /s). However, the spae is O(n), with a low onstant (lose to 5 in pratie) that makes it ompetitive against the most eonomial strutures for the problem. The searh time is O( p i mn/s) toomparep against the pivots, while the time to traverse the lists is diffiult to analyze. The pivots hosen must not be very short, beause their minimum distane to any [T j... ]isatmost p i. In fat, any pivot not longer than m + r is useless. 6.3 Using Speifi Strings Properties We an omplement the information given by the metri index with knowledge of the string properties we are indexing. For example, if suffix [T j... ]isproven to be at distane r + t from P, then we an also disard suffixes starting in the range j t +1...j+ t 1. Another idea is to ompute the edit distane between the reverse pivot and the reverse pattern. Although the result is the same, we learn also the distanes between the pivot and suffixes of the pattern. This an also be useful to disard suffixes at verifiation time: If d = ed(p 1...l,T i...i ) and we know from the index that ed(p l+1...,t i +1...) >r d, then a math is not possible. Other ideas, suh as hybrid algorithms that partition the pattern and searh for the piees permitting less errors [6], an be implemented over our metri index instead of over a suffix tree or array. Indeed, our data struture should ompete in the area of baktraking algorithms, as the others are orthogonal.

14 Edgar Chávez and Gonzalo Navarro 7 Conlusions We have presented a novel approah to the approximate string mathing problem. The idea is to give the set of text substrings the struture of a metri spae and then use an algorithm for range queries on metri spaes. The suffix tree is used as a oneptual devie to map the O(n 2 )textsubstringstoo(n) sets of strings. We show analytially that the searh omplexity is better than that obtained in previous work, at the prie of slightly higher spae usage. More preisely we an searh at an average ost of O(m log 2 n + m 2 + R) timeusing O(n log n) spae, while by using a suffix tree (the best known tehnique) one an searh in O(mn λ )timefor0 λ 1usingO(n) spae. Moreover, our tehnique an be extended to any other distane funtion among strings, some of whih, like the reversals distane, are problemati to handle with the previous approahes. The proposal opens a number of possibilities for future work. We plan to explore other methods to redue the number of substrings (we have used the suffix tree nodes and the suffixes), other metri spae indexing methods (we have used pivots), other multidimensional range searh tehniques (we have used R- trees), other pivot seletion tehniques (we took them at random), et. A more pratial setup needing O(n) spae has been desribed in Setion 6. Finally, the method promises an effiient implementation on seondary memory (e.g., with R-trees), whih is a weak point in most urrent approahes. Referenes 1. A. Apostolio. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages Springer-Verlag, A. Apostolio and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, New York, R. Baeza-Yates and G. Navarro. Blok-addressing indies for approximate text retrieval. In Pro. ACM CIKM 97, pages 1 8, R. Baeza-Yates and G. Navarro. Fast approximate string mathing in a ditionary. In Pro. SPIRE 98, pages IEEE Computer Press, R. Baeza-Yates and G. Navarro. A pratial q-gram index for text retrieval allowing errors. CLEI Eletroni Journal, 1(2), R. Baeza-Yates and G. Navarro. A hybrid indexing method for approximate string mathing. J. of Disrete Algorithms (JDA), 1(1): , Speial issue on Mathing Patterns. 7. N. Bekmann, H. Kriegel, R. Shneider, and B. Seeger. The R -tree: an effiient and robust aess method for points and retangles. In Pro. ACM SIGMOD 90, pages , E. Bugnion, T. Roos, F. Shi, P. Widmayer, and F. Widmer. Approximate multiple string mathing using spatial indexes. In Pro. 1st South Amerian Workshop on String Proessing (WSP 93), pages 43 54, E. Chávez and G. Navarro. An effetive lustering algorithm to index high dimensional metri spaes. In Pro. SPIRE 2000, pages IEEE CS Press, E. Chávez, G. Navarro, R. Baeza-Yates, and J. Marroquín. Searhing in metri spaes. ACM Comp. Surv., To appear.

15 AMetri Index for Approximate String Mathing 11. A. Cobbs. Fast approximate mathing using suffix trees. In Pro. CPM 95, pages 41 54, LNCS M. Crohemore. Transduers and repetitions. Theor. Comp. Si., 45:63 86, C. Faloutsos and I. Kamel. Beyond uniformity and independene: analysis of R- trees using the onept of fratal dimension. In Pro. ACM PODS 94, pages 4 13, V. Gaede and O. Günther. Multidimensional aess methods. ACM Comp. Surv., 30(2): , G. Gonnet. A tutorial introdution to Computational Biohemistry using Darwin. Tehnial report, Informatik E.T.H., Zuerih, Switzerland, A. Guttman. R-trees: a dynami index struture for spatial searhing. In Pro. ACM SIGMOD 84, pages 47 57, P. Jokinen and E. Ukkonen. Two algorithms for approximate string mathing in stati texts. In Pro. MFCS 91, volume 16, pages , I. Kamel and C. Faloutsos. On paking R-trees. In Pro. ACM CIKM 93, pages , V. Levenshtein. Binary odes apable of orreting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8 17, U. Manber and E. Myers. Suffix arrays: a new method for on-line string searhes. SIAM J. on Computing, pages , U. Manber and S. Wu. glimpse: A tool to searh through entire file systems. In Pro. USENIX Tehnial Conferene, pages 23 32, Winter E. MCreight. A spae-eonomial suffix tree onstrution algorithm. J. of the ACM, 23(2): , D. Morrison. PATRICIA pratial algorithm to retrieve information oded in alphanumeri. J. of the ACM, 15(4): , E. Myers. A sublinear algorithm for approximate keyword searhing. Algorithmia, 12(4/5): , Ot/Nov G. Navarro. A guided tour to approximate string mathing. ACM Comp. Surv., 33(1):31 88, B. Pagel, H. Six, H. Toben, and P. Widmayer. Towards an analysis of range queries. In Pro. ACM PODS 93, pages , D. Papadias, Y. Theodoridis, and E. Stefanakis. Multidimensional range queries with spatial relations. Geographial Systems, 4(4): , J. Robinson. The K-D-B-tree: a searh struture for large multidimensional dynami indexes. In Pro. ACM PODS 81, pages 10 18, R. Sedgewik and P. Flajolet. Analysis of Algorithms. Addison-Wesley, F. Shi. Fast approximate string mathing with q-bloks sequenes. In Pro. WSP 96, pages Carleton University Press, E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string mathing. In Pro. CPM 96, LNCS 1075, pages 50 61, W. Szpankowski. Probabilisti analysis of generalized suffix trees. In Pro. CPM 92, LNCS 644, pages 1 14, Y. Thedoridis and T. Sellis. A model for the predition of R-tree performane. In Pro. ACM PODS 96, pages , E. Ukkonen. Approximate string mathing over suffix trees. In Pro. CPM 93, pages , E. Ukkonen. Construting suffix trees on-line in linear time. Algorithmia, 14(3): , D. White and R. Jain. Algorithms and strategies for similarity retrieval. Tehnial Report VCL , Visual Comp. Lab., Univ. of California, July 1996.

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fischer) January 21,

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fischer) January 21, Sequene Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fisher) January 21, 201511 9 Suffix Trees and Suffix Arrays This leture is based on the following soures, whih are all reommended