A Metric Index for Approximate String Matching

Size: px
Start display at page:

Download "A Metric Index for Approximate String Matching"

Transcription

1 A Metri Index for Approximate String Mathing Edgar Chávez 1 and Gonzalo Navarro 2 1 Esuela de Cienias Físio-Matemátias, Universidad Mihoaana. Edifiio B, Ciudad Universitaria, Morelia, Mih. Méxio elhavez@fismat.umih.mx 2 Depto. de Cienias de la Computaión, Universidad de Chile. Blano Enalada 2120, Santiago, Chile. gnavarro@d.uhile.l Abstrat. We present a radially new indexing approah for approximate string mathing. The sheme uses the metri properties of the edit distane and an be applied to any other metri between strings. We build a metri spae where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metri spae. This permits us finding the R ourrenes of a pattern of length m in a text of length n in average time O(m log 2 n+m 2 +R), using O(n log n) spaeando(n log 2 n) index onstrution time. This omplexity improves by far over all other previous methods. We also show a simpler sheme needing O(n) spae. 1 Introdution and Related Work Indexing text to permit effiient approximate searhing on it is one of the main open problems in ombinatorial pattern mathing. The approximate string mathing problem is: Given a long text T of length n, a (omparatively short) pattern P of length m, and a threshold r, retrieve all the pattern ourrenes, that is, text substrings whose edit distane to the pattern is at most r. Theedit distane between two strings is defined as the minimum number of harater insertions, deletions and substitutions needed to make them equal. This distane is used in many appliations, but several other distanes are of interest. In the on-line version of the problem, the pattern an be preproessed but the text annot. There are numerous solutions to this problem [25], but none is aeptable when the text is too long sine the searh time is proportional to the text length. Indexing text for approximate string mathing has reeived attention only reently. Despite some progress in the last deade, the indexing shemes for this problem are still rather immature. There exist some indexing shemes speialized to word-wise searhing on natural language text [21,3]. These indexes perform quite well in that ase but Supported by CYTED VII.19 RIBIDI projet (both authors), CONACyT grant (first author), and Fondeyt grant (seond author). On leave of absene at Universidad de Chile.

2 Edgar Chávez and Gonzalo Navarro they annot be extended to handle the general ase. Extremely important appliations suh as DNA, proteins, musi or oriental languages fall outside this ase. The indexes that solve the general problem an be divided into three lasses. Baktraking [17,34,11,15] uses the suffix tree [2], suffix array [20] or DAWG [12] of the text in order to fator out its repetitions. A sequential algorithm on the text is simulated by baktraking on the data struture. These algorithms take time exponential on m or r but in many ases independent of n, the text size. This makes them attrative when searhing for very short patterns. Partitioning [31,30,5] partitions the pattern into piees to ensure that some of the piees must appear without alterations inside every ourrene. An index able of exat searhing is used to detet the piees and the text areas that have enough evidene of ontaining an ourrene are heked with a sequential algorithm. These algorithms work well only when r/m is small. The third lass [24,6] is a hybrid between the other two. The pattern is divided into large piees that an still ontain (less) errors, they are searhed for using baktraking, and the potential text ourrenes are heked as in the partitioning methods. The hybrid algorithms are more effetive beause they an find the right point between length of the piees to searh for and error level permitted. Using the appropriate partition of the pattern, these methods ahieve on average O(n λ ) searh time, for some 0 <λ<1 that depends on r. They tolerate moderate error ratios r/m. We propose in this paper a brand new approah to the problem. We take into aount that the edit distane satisfies the triangle inequality and hene it defines a metri spae on the set of text substrings. We an re-express the approximate searh problem as a range searh problem on this metri spae. This approah has been attempted before [8,4], but in those ases the partiularities of the problem made it possible to index O(n) elements. In the general ase we have O(n 2 )textsubstrings. The main ontribution of this paper is to devise a method (based on the suffix tree of the text) to meaningfully ollapse the O(n 2 ) text substring into O(n) sets, and to find a way to build a metri spae out of those sets. The result is an indexing method that, at the ost of requiring on average O(n log n) spaeand O(n log 2 n) onstrution time, permits finding the R approximate ourrenes of the pattern in O(m log 2 n + m 2 + R) average time. This is a omplexity breakthrough over previous work, and it is easier than in other approahes to extend the idea to other distane funtions suh as reversals. Moreover, it represents an original approah to the problem that opens a vast number of possibilities for improvements. We onsider also a simpler version of the index needing O(n) spae and that, despite not involving a omplexity breakthrough, promises to be better in pratie. We use the following notation in the paper. Given a string s Σ we denote its length as s. Wealsodenotes i the i-th harater of s, for an integer i {1.. s }. Wedenotes i...j = s i s i+1...s j (whih is the empty string if i>j)and

3 AMetri Index for Approximate String Mathing s i... = s i... s. The empty string is denoted as ε. Astringx is said to be a prefix of xy, asuffix of yx and a substring of yxz. 2 Metri Spaes We desribe in this setion some onepts related to searhing metri spaes. We have onentrated only in the part that is relevant for this paper. There exist reent surveys if more omplete information is desired [10]. A metri spae is, informally, a set of blak-box objets and a distane funtion defined among them, whih satisfies the triangle inequality. The problem of proximity searhing in metri spaes onsists of indexing the set suh that later, given a query, all the elements of the set that are lose enough to the query an be quikly found. This has appliations in a vast number of fields, suh as non-traditional databases (where the onept of exat searh is of no use and we searh for similar objets, e.g. databases storing images, fingerprints or audio lips); mahine learning and lassifiation (where a new element must be lassified aording to its losest existing element); image quantization and ompression (where only some vetors an be represented and those that annot must be oded as their losest representable point); text retrieval (where we look for douments that are similar to a given query or doument); omputational biology (where we want to find a DNA or protein sequene in a database allowing some errors due to typial variations); funtion predition (where we want to searh for the most similar behavior of a funtion in the past so as to predit its probable future behavior); et. Formally, a metri spae is a pair (X,d), where X is a universe of objets and d : X X R + is a distane funtion defined on it that returns nonnegative values. This distane satisfies the properties of reflexivity (d(x, x) =0), strit positiveness (x y d(x, y) > 0), symmetry (d(x, y) =d(y, x)) and triangle inequality (d(x, y) d(x, z)+ d(z, y)). A finite subset U of X, ofsizen = U, is the set of objets we searh. Among the many queries of interest on a metri spae, we are interested in the so-alled range queries: Givenaqueryq X and a tolerane radius r, find the set of all elements in U that are at distane at most r to q. Formally, the outome of the query is (q, r) d = {u U, d(q, u) r}. The goal is to preproess the set so as to minimize the omputational ost of produing the answer (q, r) d. From the plethora of existing algorithms to index metri spaes, we fous on the so-alled pivot-based ones, whih are built on a single general idea: Selet k elements {p 1,...,p k } from U (alled pivots), and identify eah element u U with a k-dimensional point (d(u, p 1 ),...,d(u, p k )) (i.e. its distanes to the pivots). The index is basially the set of kn oordinates. At query time, map q to the k-dimensional point (d(q, p 1 ),...,d(q, p k )). With this information at hand, we an filter out using the triangle inequality any element u suh that d(q, p i ) d(u, p i ) >rfor some pivot p i, sine in that ase we know that d(q, u) > r without need to evaluate d(u, q). Those elements that annot be filtered out using this rule are diretly ompared against q.

4 Edgar Chávez and Gonzalo Navarro An interesting feature of pivot-based algorithms is that they an redue the number of final distane evaluations by inreasing the number of pivots. Define D k (x, y) =max 1 j k d(x, p j ) d(y, p j ). Usingthepivotsp 1,..., p k is equivalent to disarding elements u suh that D k (q, u) > r. As more pivots are added we need to perform more distane evaluations (exatly k) to ompute D k (q, ) (these are alled internal evaluations), but on the other hand D k (q, ) inreases its value and hene it has a higher hane of filtering out more elements (those omparisons against elements that annot be filtered out are alled external). It follows that there exists an optimum k. If one is not only interested in the number of distane evaluations performed but also in the total pu time required, then sanning all the n elements to filter out some of them may be unaeptable. In that ase, one needs multidimensional range searh methods, whih inlude data strutures suh as the kd-tree, R- tree, X-tree, et. [36,14]. Those strutures permit indexing a set of objets in k-dimensional spae in order to proess range queries. In this paper we are interested in a metri spae where the universe is the set of strings over some alphabet, i.e. X = Σ, and the distane funtion is the so-alled edit distane or Levenshtein distane. This is defined as the minimum number of harater insertions, deletions and substitutions neessary to make two strings equal [19,25]. The edit distane, and in fat any other distane defined as the best way to onvert one element into the other, is reflexive, stritly positive (as long as there are no zero-ost operations), symmetri (as long as the operations allowed are symmetri), and satisfies the triangle inequality. The algorithm to ompute the edit distane ed() is based on dynami programming. Imagine that we need to ompute ed(x, y). A matrix C 0.. x,0.. y is filled, where C i,j = ed(x 1..i,y 1..j ), so C x, y = ed(x, y). This is omputed as C i,0 = i, C 0,j = j, C i,j = if (x i = y j )thenc i 1,j 1 else 1 + min(c i 1,j,C i,j 1,C i 1,j 1 ) The algorithm takes O( x y ) time. The matrix an be filled olumn-wise or row-wise (there are more sophistiated ways as well). For reasons that will be made lear later, we prefer the row-wise filling. The spae required is only O( y ), sine only the previous row must be stored in order to ompute the new one, and therefore we just keep one row and update it. 3 Text Indexing Suffix trees are widely used data strutures for text proessing [2,1]. Any position i in a text T defines a suffix of T,namelyT i....asuffix trie is a trie data struture built over all the suffixes of T. At the leaf nodes the pointers to the suffixes are stored. Every substring of T an be found by traversing a path from the root. Roughly speaking, eah suffix trie leaf represents a suffix and eah internal node represents a different substring of T. To improve spae utilization, this trie is ompated into a Patriia tree [23] by ompressing unary paths. The edges that replae a ompressed path store

5 AMetri Index for Approximate String Mathing the whole string that they represent (via two pointers to their initial and final text position). One unary paths are not present the trie, now alled suffix tree, has O(n) nodes instead of the worst-ase O(n 2 ) of the trie. The suffix tree an be diretly built in O(n) time [22,35]. Any algorithm on a suffix trie an be simulated at the same ost in the suffix tree. We all expliit those suffix trie nodes that survive in the suffix tree, and impliit those that are ollapsed. Figure 1 shows the suffix trie and tree of the text "abraadabra". Note that a speial endmarker "$", smaller than any other harater, is appended to the text so that all the suffixes are external nodes a b r a a d a b r a Text 0 r a 8 d 7 5 b 5 r 9 3 $ 10 a 6 7 Suffix Trie 9 $ $ 3 10 ra d 7 5 bra $ Suffix Tree a 1 d 6 b $ r 3 a 4 $ 1 8 a 1 d 6 bra $ 11 4 $ Suffix Array Fig. 1. The suffix trie, suffix tree and suffix array of the text "abraadabra". The figure shows the internal nodes of the trie (numbered 0 to 9 in italis inside irles), whih represent text substrings that appear more than one, and the external nodes (numbered 1 to 11 inside squares), whih represent text substrings that appear just one. Those leaves do not only represent the unique substrings but all their extensions until the full suffix. In the suffix tree, only some internal nodes are left, and they represent the same substring as before plus the prefixes that may have been ollapsed. For example the internal node (7) of the suffix tree represents now the ompressed nodes (5) and (6), and hene the strings "b", "br" and "bra". The external node (1) represents "abra", but also "abraa", "abraad", et. until the full suffix "abraadabra".

6 Edgar Chávez and Gonzalo Navarro Finally, the suffix array [20] is a more ompat version of the suffix tree, whih requires muh less spae and poses a small penalty over the searh time. If the leaves of the suffix tree are traversed in left-to-right order, all the suffixes of the text are retrieved in lexiographial order. A suffix array is simply an array ontaining all the pointers to the text suffixes listed in lexiographial order, as shown in Figure 1. The suffix array stores one pointer per text position. The suffix array an be diretly built (without building the suffix tree) in O(n log n) worst ase time and O(n log log n) average time [20]. While suffix trees are searhed as tries, suffix arrays are binary searhed. However, almost every algorithm on suffix trees an be adapted to work on suffix arrays at an O(log n) penalty fator in the time ost. This is beause eah subtree of the suffix tree orresponds to an interval in the suffix array, namely the one ontaining all the leaves of the subtree. To follow an edge of the suffix trie, we use binary searh to find the new limits in the suffix array. For example, the internal node (7) in the suffix tree orresponds to the interval 6, 7 in the suffix array. Note that impliit nodes have the same interval than their representing expliit node. 4 Our Algorithm 4.1 Indexing A straightforward approah to text indexing for approximate string mathing using metri spaes tehniques has the problem that, in priniple, there are O(n 2 ) different substrings in a text, and therefore we should index O(n 2 ) objets, whih is unaeptable. The suffix tree provides a onise representation of all the substrings of a text in O(n) spae. So instead of indexing all the text substrings, we index only the (expliit) suffix tree nodes. Therefore, we have O(n) objets to be indexed as a metri spae under the edit distane. Now, eah expliit internal node represents itself and the nodes that desend to it by a unary path. Hene, eah expliit node that orresponds to a string xy and its parent orresponds to the string x represents the following set of strings x[y] = {xy 1,xy 1 y 2,..., xy} where x[y] is a notation we have just introdued. For example, the internal node (4) in Figure 1 represents the strings "a[bra]" = { ab, abr, abra }. The leaves of the suffix tree represent a unique text substring and all its extensions until the full text suffix is obtained. Hene, if T = zxy and x is a unique text substring (whose prefixes are not unique), then the orresponding suffix tree node is an expliit leaf, whih for us represents the set {x} x[y]. Table 1 shows the substrings represented by eah node in our running example. Note that the external nodes that desend by the terminator harater "$", i.e. e(8 11), represent a substring that is also represented at its parent and hene it an be disregarded.

7 AMetri Index for Approximate String Mathing Node Suffix trie Suffix tree Node Suffix trie/tree i(0) ε ε e(1) abra[adabra] i(1) a [a] e(2) bra[adabra] i(2) ab e(3) ra[adabra] i(3) abr e(4) a[adabra] i(4) abra a[bra] e(5) [adabra] i(5) b e(6) a[dabra] i(6) br e(7) [dabra] i(7) bra [bra] e(8) abra i(8) r e(9) bra i(9) ra [ra] e(10) ra e(11) a Table 1. The text substrings represented by eah node of the suffix trie and tree of Figure 1. Internal nodes are represented as i(x) andexternalsase(x). Hene, instead of indexing all the O(n 2 ) text substrings, we index O(n) sets of strings, whih are the sets represented by the expliit internal and the external nodes of the suffix tree. In our example, this set is U = {ε, [a], a[bra], [bra], [ra], abra[adabra], bra[adabra], ra[adabra], a[adabra], [adabra], a[dabra], [dabra]}. We have now to deide how to index this metri spae formed by O(n) sets of strings. Many options are possible, but we have onentrated on a pivot based approah. We selet at random k different text substrings that will be our pivots. For reasons that are made lear later, we hoose to selet pivots of lengths 0, 1, 2,, k 1. For eah expliit suffix tree node x[y] and eah pivot p i,we ompute the distane between p i and all the strings represented by x[y]. From the set of distanes from a node x[y] top i, we store the minimum and maximum ones. Sine all these strings are of the form {xy 1...y j, 1 j y }, alltheedit distanes an be omputed in O( p i xy ) time. Following our example, let us assume that we have seleted k =5pivots p 0 = "", p 1 = "a", p 2 = "br", p 3 = "ad" and p 4 = "raa". Figure 2 (left) shows the omputation of the edit distanes between i(4) = "a[bra]" and p 3 = "ad". The result shows that the minimum and maximum values of this node with respet to this pivot are 2 and 4, respetively. In the ase of external suffix tree nodes, the string y tends to be quite long (O(n) length on average), whih yields a very high omputation time for all the edit distanes and anyway a very large value for the maximum edit distane (note that ed(p i,xy) xy p i ). We solve this by pessimistially assuming that the maximum distane is n when the suffix tree node is external. The minimum edit distane an be found in O( p i max( p i, x )) time, beause it is not neessary to onsider arbitrarily long strings xy 1...y j : If we ompute the matrix row by row, then after having proessed x we have a minimum value seen up to now, v. Then there is no point in onsidering rows j suh that x + j p i >v. Hene we work until row j = v + p i x p i.

8 Edgar Chávez and Gonzalo Navarro a d a b r a a d abra a d a b r a Fig. 2. The dynami programming matrix to ompute the edit distane between "ad" and "a[bra]" (left) or "abra[adabra]" (right). The emphasized area is where the minima and maxima are taken from. Figure 2 (right) illustrates this ase with e(1) = "abra[adabra]" and the same p 4 = "ad". Note that to ompute the new set of edit distanes we have started from i(4), whih is the parent node of e(1) in the suffix tree. This an always be done in a depth first traversal of the suffix tree and saves onstrution time. Note also that it is not neessary to ompute the last 4 rows, sine they measure the edit distane between strings of length 8 or more against one of length 3. The distane annot be smaller than 5 and we have found at that point a minimum equal to 4. In fat we just assume that the maximum is 11, so the minimum and maximum value for this external node and this pivot are 4 and 11. In partiular, sine when indexing external nodes x[y] we always have ed(p i,x) already omputed, they an be indexed in O( p i 2 )time. One this is done for all the suffix tree nodes and all the pivots we have a set of k minimum and maximum values for eah expliit suffix tree node. This an be regarded as a hyperretangle in k dimensions: x[y] (min(ed(x[y],p 0 )),...,min(ed(x[y],p k 1 ))), (max(ed(x[y],p 0 )),...,max(ed(x[y],p k 1 ))) where we are sure that all the strings in x[y] lie inside the retangle. In our example, the minima and maxima for i(4) with respet to p 0 to p 4 are 2, 4, 1, 3, 1, 2, 2, 4 and 3, 3. Therefore i(4) is represented by the hyperretangle (2, 1, 1, 2, 3), (4, 3, 2, 4, 3). On the other hand, the ranges for e(1) are 5, 11, 4, 11, 3, 11, 4, 11 and 2, 11 and its hyperretangle is therefore (5, 4, 3, 4, 2), (11, 11, 11, 11, 11). 4.2 Searhing Let us now onsider a given query P searhed for with at most r errors. This is a range query with radius r inthemetrispaeofthesuffixtreenodes.as for pivot based algorithms, we ompare the pattern P against the k pivots and obtain a k-dimensional oordinate (ed(p, p 1 ),...,ed(p, p k )).

9 AMetri Index for Approximate String Mathing Let p i be a given pivot and x[y] a given node. If it holds that ed(p, p i )+r<min(ed(x[y],p i )) ed(p, p i ) r>max(ed(x[y],p i )) then, by the triangle inequality, we know that ed(p, xy ) >rfor any xy x[y]. The elimination an be done using any pivot p i. In fat, the nodes that are not eliminated are those whose retangle has nonempty intersetion with the retangle (ed(p, p 1 ) r,...,ed(p, p k ) r), (ed(p, p 1 )+r,...,ed(p, p k )+r). Figure 3 illustrates. The node ontains a set of points and we store its minimum and maximum distane to two pivots. These define a (2-dimensional) retangle where all the distanes from any substring of the node to the pivots lie. The query is a pattern P and a tolerane r, whih defines a irle around P. After taking the distanes from P to the pivots we reate a hyperube (a square in this ase) of width 2r + 1. If the square does not interset the retangle, then no substring in the node an be lose enough to P. p1 d min1 P d2 ed(node,p2) d2 max2 P 2r r max1 min2 max2 p2 min2 node node min1 Fig. 3. The elimination rule using two pivots. max1 d1 ed(node,p1) We have to solve the problem of finding all the k-dimensional retangles that interset a given query retangle. This is a lassial multidimensional range searh problem [36,14]. We ould for example use some variant of R-trees [16,7], whih would also yield a good data struture to work on seondary memory. Those nodes x[y] that annot be eliminated using any pivot must be diretly ompared against P. For those whose minimum distane to P is at most r, we report all their ourrenes, whose starting points are written in the leaves of the subtree rooted by the node that has mathed. In our running example, if we are searhing for "abr" with tolerane r =1,thennodei(4) qualifies, so we report the text positions in the orresponding tree leaves: 1 and 8. Observe that in order to ompare P against a given suffix tree node x[y], the edit distane algorithm fores us to ompare it against every prefix of x as well. Those prefixes orrespond to suffix tree nodes in the path from the root to x[y]. In order not to repeat work, we mark in the suffix tree the nodes that we have to ompare expliitly against P, and also mark every node in their path

10 Edgar Chávez and Gonzalo Navarro to the root. Then, we baktrak on the suffix tree entering every marked node and keeping trak of the edit distane between P and the node. The new row is omputed using the row of the parent, just as done with the pivots. This avoids reomputing the same prefixes for different suffix tree nodes, and inidentally is similar to the simplest baktraking approah [15], exept that in this ase we only follow marked paths. In this respet, our algorithm an be thought of as a preproessing to a baktraking algorithm, whih filters out some paths. As a pratial matter, note that this is the only step where the suffix tree is required. We an even print the text substrings that math the pattern without the help of the suffix tree, but we need it in order to report all their text positions. For this sake, a suffix array is muh heaper and does a better job (beause all the text positions are listed in a ontiguous interval). In fat, the suffix array an also replae the suffix tree at indexing time. 5 Analysis Let us first onsider the onstrution ost. The maximum length of a repeated text substring, or whih is the same, the average height of the suffix trie, is O(log n) [32,29]. Reall also that our pivots are of length O(k). Hene omputing all the kn minima and maxima takes time O(kn k log n) =O(k 2 n log n), where O(k log n) is the time to ompute eah distane. However, we start the omputation of the edit distanes of a node from the last row of its parent. This redues the average onstrution ost to O(k 2 n), sine for eah pivot we ompute one dynami programming row per suffix trie node, and on average the number of nodes in the suffix trie is O(n) [32,29].Then leaves are also omputed in O( p i 2 n)=o(k 2 n)time. The total spae required by the data struture is O(kn),sineweneedto store for eah expliit node a pointer to the suffix tree and its k oordinates. ThesuffixtreeitselftakesO(n) spae. It remains to determine the average searh time. A key element of the analysis is a onstant α, whih is the probability that, for a random hyperretangle of the set, along some fixed oordinate, the orresponding segment of the query hyperube intersets with the orresponding segment of the hyperretangle. Another way to put it is that, along that oordinate, the query point falls inside the hyperretangle projetion onto that oordinate after it is enlarged in r units along eah dimension. In operational terms, α is the probability that some given pivot (that orresponding to the seleted oordinate) does not permit disarding a given element. Note that α does not depend on k, onlyonr. The first part of the searh is the omputation of the edit distanes between the k pivots and the pattern P of length m. ThistakesO(k 2 m)time. The seond part is the searh for the retangles that interset the query retangle. Many analyses of the performane of R-trees exist in the literature [33,18,26,27,13]. Despite that most of them deal with the exat number of disk aesses, their abstrat result is that the expeted amount of work on the R-tree (and variants suh as the KDB-tree [28]) is O(nα k log n).

11 AMetri Index for Approximate String Mathing The third part, finally, is the diret hek of the pattern against the suffix tree nodes whose retangles interset the query retangle. Sine we disard using any of k random pivots, the probability of not disarding a node is α k.asthere are O(n) suffix tree nodes, we hek on average α k n nodes, with a total ost of O(α k n m 2 ). The m 2 is the ost to ompute the edit distane between a pattern of length m and a andidate whose length must be between m r and m + r. This is beause the pivot ε removes all shorter or longer andidates. At the end, we report the R results in O(R) time using a suffix tree traversal. Hene our total average ost is bounded by k 2 m + nα k log n + nα k m 2 + R for 0 α 1. This is optimized for k =log 1/α n + O(log log n) log 1/α n = Θ(log n). If we use log 1/α n pivots the searh ost beomes O(m log 2 n + m 2 + R) on average. Note that the influene of the searh radius r is embedded in α. This is muh better omplexity than all previous work, whih obtains O(mn λ )time for some 0 <λ<1. Moreover, muh of previous work requires m = Ω(log n) to obtain sublinearity, while our approah does not. The prie is in the onstrution time and spae, whih beome O(n log 2 n) and O(n log n), respetively. Espeially the latter an be prohibitive and we may have to ontent ourselves with a smaller k. There seems to be no good tradeoff between spae and time, e.g., to obtain O(n λ )timewealsoneedθ(log n) pivots. Most other indexes require O(n) spae and onstrution time. Finally, it is worth mentioning that, sine we automatially disard any internal node not in the length [m r, m + r] thanks to the pivot p 0 = ε, thereis a worst-ase limit σ m+r on the number of suffix tree nodes to onsider for the last phase. Although this limit is exponential on m and r, it is independent of n. Other indexing shemes based on the suffix tree share the same property. 6 Towards a Pratial Implementation Despite that we have obtained an important redution in time omplexity with respet to n and m, our result is hiding a multiplying fator that depends on the searh radius. It is possible that this onstant is too large (that is, α too lose to 1) and makes the whole approah useless. Also, the extra spae requirement (whih also inreases as α tends to 1) an be unmanageable. In this setion we onsider an alternative approah that is simpler and likely to obtain better results in pratie, despite not involving a omplexity breakthrough. 6.1 Indexing Only Suffixes A simpler index that derives from the same ideas of the paper onsiders only the n text suffixes and no internal nodes. Eah suffix [T j... ] represents all the text substrings starting at i, and it is indexed aording to the minimum distane between those substrings and eah pivot. The good point of the approah is redued spae. Not only the set U an have up to half the elements of the original approah, but also only k values (not 2k)

12 Edgar Chávez and Gonzalo Navarro are stored for eah element, sine all the maximum values are the same. This permits using up to four times the number of pivots of the previous approah at the same memory requirement. Note that we do not even need to build or store the suffix array: We just read the suffixes from the text and index them. Our only storage need is that of the metri index. The bad point is that the seletivity of the pivots is redued and some redundant work is done. The first is a onsequene of storing only minimum values, while the seond is a onsequene of not fatoring out repeated text substrings. That is, if some substring P of T is lose enough to P and it appears many times in T, we will have to hek all its ourrenes one by one. Without using a suffix tree struture, the onstrution of the index an be done in time O(k p i n) as follows. The algorithm depited in Setion 2 to ompute edit distane an be modified so as to make C 0,j =0,inwhihaseC i,j beomes the minimum edit distane between x 1...i and a suffix of y 1...j.Ifx is the reverse of p i and y the reverse of T,thenC pi,j will be the minimum edit distane between p i and a prefix of T n j+1..., whih is preisely min(ed(p i,t n j+1... )). So we need O( p i n) time per pivot. The spae to ompute this is just O( p i )by doing the omputation olumn-wise. 6.2 Using an Index for High Dimensions The spae of strings has a distane distribution that is rather onentrated around its mean µ. The same happens to the distanes between a pivot p i and suffixes [T j... ] or the pattern P. Sine we an only disard suffixes [T j... ] suh that ed(p i,p)+r < min(ed(p i, [T j... ])), only the suffixes with a large min(ed(p i, [T j... ])) value are likely to be disarded using p i. Storing all the other O(n) distanes to p i is likely to be a waste of spae. Moreover, we an use that memory to introdue more pivots. Figure 4 illustrates. The idea is to fix a number s and, for eah pivot p i, store only the s largest min(ed(p i, [T j... ])) values. Only those suffixes an be disarded using pivot p i. The spae of this index is O(ks) and its onstrution time is unhanged. We an still use an R-tree for the searh, although the retangles will over all the spae exept on s oordinates. The seletivity is likely to be similar sine we have disarded uninteresting oordinates, and we an tune number k versus seletivity s of the pivots for the same spae usage O(ks). One an go further to obtain O(n) spae as follows. Choose the first pivot and determine its s farthest suffixes. Store a list (in inreasing distane order) of those suffixes and their distane to the first pivot and remove them from further onsideration. Then hoose a seond pivot and find its s farthest suffixes from the remaining set. Continue until every suffix has been inluded in the list of some pivot. Note that every suffix appears exatly in one list. At searh time, ompare P against eah pivot p i,andifed(p, p i )+r is smaller than the smallest (first) distane in the list of p i, skip the whole list. Otherwise traverse the list until its end or until ed(p, p i )+r is smaller than the next element. Eah traversed suffix must be diretly ompared against P. A variant of this idea has proven extremely useful to deal with onentrated histograms [9]. It also permits

13 AMetri Index for Approximate String Mathing ed(p,p) +r ed(p,[t(j...)]) Fig. 4. The distane distribution to a pivot p, inluding that of pattern P.The grayed area represents the suffixes that an be disarded using p. effiient seondary storage implementation by paking the pivots in disk pages and storing the lists onseutively in the same order of the pivots. Sine we hoose k = n/s pivots, the onstrution time is high, namely O(n 2 p i /s). However, the spae is O(n), with a low onstant (lose to 5 in pratie) that makes it ompetitive against the most eonomial strutures for the problem. The searh time is O( p i mn/s) toomparep against the pivots, while the time to traverse the lists is diffiult to analyze. The pivots hosen must not be very short, beause their minimum distane to any [T j... ]isatmost p i. In fat, any pivot not longer than m + r is useless. 6.3 Using Speifi Strings Properties We an omplement the information given by the metri index with knowledge of the string properties we are indexing. For example, if suffix [T j... ]isproven to be at distane r + t from P, then we an also disard suffixes starting in the range j t +1...j+ t 1. Another idea is to ompute the edit distane between the reverse pivot and the reverse pattern. Although the result is the same, we learn also the distanes between the pivot and suffixes of the pattern. This an also be useful to disard suffixes at verifiation time: If d = ed(p 1...l,T i...i ) and we know from the index that ed(p l+1...,t i +1...) >r d, then a math is not possible. Other ideas, suh as hybrid algorithms that partition the pattern and searh for the piees permitting less errors [6], an be implemented over our metri index instead of over a suffix tree or array. Indeed, our data struture should ompete in the area of baktraking algorithms, as the others are orthogonal.

14 Edgar Chávez and Gonzalo Navarro 7 Conlusions We have presented a novel approah to the approximate string mathing problem. The idea is to give the set of text substrings the struture of a metri spae and then use an algorithm for range queries on metri spaes. The suffix tree is used as a oneptual devie to map the O(n 2 )textsubstringstoo(n) sets of strings. We show analytially that the searh omplexity is better than that obtained in previous work, at the prie of slightly higher spae usage. More preisely we an searh at an average ost of O(m log 2 n + m 2 + R) timeusing O(n log n) spae, while by using a suffix tree (the best known tehnique) one an searh in O(mn λ )timefor0 λ 1usingO(n) spae. Moreover, our tehnique an be extended to any other distane funtion among strings, some of whih, like the reversals distane, are problemati to handle with the previous approahes. The proposal opens a number of possibilities for future work. We plan to explore other methods to redue the number of substrings (we have used the suffix tree nodes and the suffixes), other metri spae indexing methods (we have used pivots), other multidimensional range searh tehniques (we have used R- trees), other pivot seletion tehniques (we took them at random), et. A more pratial setup needing O(n) spae has been desribed in Setion 6. Finally, the method promises an effiient implementation on seondary memory (e.g., with R-trees), whih is a weak point in most urrent approahes. Referenes 1. A. Apostolio. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages Springer-Verlag, A. Apostolio and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, New York, R. Baeza-Yates and G. Navarro. Blok-addressing indies for approximate text retrieval. In Pro. ACM CIKM 97, pages 1 8, R. Baeza-Yates and G. Navarro. Fast approximate string mathing in a ditionary. In Pro. SPIRE 98, pages IEEE Computer Press, R. Baeza-Yates and G. Navarro. A pratial q-gram index for text retrieval allowing errors. CLEI Eletroni Journal, 1(2), R. Baeza-Yates and G. Navarro. A hybrid indexing method for approximate string mathing. J. of Disrete Algorithms (JDA), 1(1): , Speial issue on Mathing Patterns. 7. N. Bekmann, H. Kriegel, R. Shneider, and B. Seeger. The R -tree: an effiient and robust aess method for points and retangles. In Pro. ACM SIGMOD 90, pages , E. Bugnion, T. Roos, F. Shi, P. Widmayer, and F. Widmer. Approximate multiple string mathing using spatial indexes. In Pro. 1st South Amerian Workshop on String Proessing (WSP 93), pages 43 54, E. Chávez and G. Navarro. An effetive lustering algorithm to index high dimensional metri spaes. In Pro. SPIRE 2000, pages IEEE CS Press, E. Chávez, G. Navarro, R. Baeza-Yates, and J. Marroquín. Searhing in metri spaes. ACM Comp. Surv., To appear.

15 AMetri Index for Approximate String Mathing 11. A. Cobbs. Fast approximate mathing using suffix trees. In Pro. CPM 95, pages 41 54, LNCS M. Crohemore. Transduers and repetitions. Theor. Comp. Si., 45:63 86, C. Faloutsos and I. Kamel. Beyond uniformity and independene: analysis of R- trees using the onept of fratal dimension. In Pro. ACM PODS 94, pages 4 13, V. Gaede and O. Günther. Multidimensional aess methods. ACM Comp. Surv., 30(2): , G. Gonnet. A tutorial introdution to Computational Biohemistry using Darwin. Tehnial report, Informatik E.T.H., Zuerih, Switzerland, A. Guttman. R-trees: a dynami index struture for spatial searhing. In Pro. ACM SIGMOD 84, pages 47 57, P. Jokinen and E. Ukkonen. Two algorithms for approximate string mathing in stati texts. In Pro. MFCS 91, volume 16, pages , I. Kamel and C. Faloutsos. On paking R-trees. In Pro. ACM CIKM 93, pages , V. Levenshtein. Binary odes apable of orreting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8 17, U. Manber and E. Myers. Suffix arrays: a new method for on-line string searhes. SIAM J. on Computing, pages , U. Manber and S. Wu. glimpse: A tool to searh through entire file systems. In Pro. USENIX Tehnial Conferene, pages 23 32, Winter E. MCreight. A spae-eonomial suffix tree onstrution algorithm. J. of the ACM, 23(2): , D. Morrison. PATRICIA pratial algorithm to retrieve information oded in alphanumeri. J. of the ACM, 15(4): , E. Myers. A sublinear algorithm for approximate keyword searhing. Algorithmia, 12(4/5): , Ot/Nov G. Navarro. A guided tour to approximate string mathing. ACM Comp. Surv., 33(1):31 88, B. Pagel, H. Six, H. Toben, and P. Widmayer. Towards an analysis of range queries. In Pro. ACM PODS 93, pages , D. Papadias, Y. Theodoridis, and E. Stefanakis. Multidimensional range queries with spatial relations. Geographial Systems, 4(4): , J. Robinson. The K-D-B-tree: a searh struture for large multidimensional dynami indexes. In Pro. ACM PODS 81, pages 10 18, R. Sedgewik and P. Flajolet. Analysis of Algorithms. Addison-Wesley, F. Shi. Fast approximate string mathing with q-bloks sequenes. In Pro. WSP 96, pages Carleton University Press, E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string mathing. In Pro. CPM 96, LNCS 1075, pages 50 61, W. Szpankowski. Probabilisti analysis of generalized suffix trees. In Pro. CPM 92, LNCS 644, pages 1 14, Y. Thedoridis and T. Sellis. A model for the predition of R-tree performane. In Pro. ACM PODS 96, pages , E. Ukkonen. Approximate string mathing over suffix trees. In Pro. CPM 93, pages , E. Ukkonen. Construting suffix trees on-line in linear time. Algorithmia, 14(3): , D. White and R. Jain. Algorithms and strategies for similarity retrieval. Tehnial Report VCL , Visual Comp. Lab., Univ. of California, July 1996.

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fischer) January 21,

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fischer) January 21, Sequene Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson & J. Fisher) January 21, 201511 9 Suffix Trees and Suffix Arrays This leture is based on the following soures, whih are all reommended

More information

Complexity of Regularization RBF Networks

Complexity of Regularization RBF Networks Complexity of Regularization RBF Networks Mark A Kon Department of Mathematis and Statistis Boston University Boston, MA 02215 mkon@buedu Leszek Plaskota Institute of Applied Mathematis University of Warsaw

More information

Searching All Approximate Covers and Their Distance using Finite Automata

Searching All Approximate Covers and Their Distance using Finite Automata Searhing All Approximate Covers and Their Distane using Finite Automata Ondřej Guth, Bořivoj Melihar, and Miroslav Balík České vysoké učení tehniké v Praze, Praha, CZ, {gutho1,melihar,alikm}@fel.vut.z

More information

CMSC 451: Lecture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017

CMSC 451: Lecture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017 CMSC 451: Leture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017 Reading: Chapt 11 of KT and Set 54 of DPV Set Cover: An important lass of optimization problems involves overing a ertain domain,

More information

Danielle Maddix AA238 Final Project December 9, 2016

Danielle Maddix AA238 Final Project December 9, 2016 Struture and Parameter Learning in Bayesian Networks with Appliations to Prediting Breast Caner Tumor Malignany in a Lower Dimension Feature Spae Danielle Maddix AA238 Final Projet Deember 9, 2016 Abstrat

More information

15.12 Applications of Suffix Trees

15.12 Applications of Suffix Trees 248 Algorithms in Bioinformatis II, SoSe 07, ZBIT, D. Huson, May 14, 2007 15.12 Appliations of Suffix Trees 1. Searhing for exat patterns 2. Minimal unique substrings 3. Maximum unique mathes 4. Maximum

More information

Developing Excel Macros for Solving Heat Diffusion Problems

Developing Excel Macros for Solving Heat Diffusion Problems Session 50 Developing Exel Maros for Solving Heat Diffusion Problems N. N. Sarker and M. A. Ketkar Department of Engineering Tehnology Prairie View A&M University Prairie View, TX 77446 Abstrat This paper

More information

Nonreversibility of Multiple Unicast Networks

Nonreversibility of Multiple Unicast Networks Nonreversibility of Multiple Uniast Networks Randall Dougherty and Kenneth Zeger September 27, 2005 Abstrat We prove that for any finite direted ayli network, there exists a orresponding multiple uniast

More information

Control Theory association of mathematics and engineering

Control Theory association of mathematics and engineering Control Theory assoiation of mathematis and engineering Wojieh Mitkowski Krzysztof Oprzedkiewiz Department of Automatis AGH Univ. of Siene & Tehnology, Craow, Poland, Abstrat In this paper a methodology

More information

The Laws of Acceleration

The Laws of Acceleration The Laws of Aeleration The Relationships between Time, Veloity, and Rate of Aeleration Copyright 2001 Joseph A. Rybzyk Abstrat Presented is a theory in fundamental theoretial physis that establishes the

More information

Assessing the Performance of a BCI: A Task-Oriented Approach

Assessing the Performance of a BCI: A Task-Oriented Approach Assessing the Performane of a BCI: A Task-Oriented Approah B. Dal Seno, L. Mainardi 2, M. Matteui Department of Eletronis and Information, IIT-Unit, Politenio di Milano, Italy 2 Department of Bioengineering,

More information

Robust Recovery of Signals From a Structured Union of Subspaces

Robust Recovery of Signals From a Structured Union of Subspaces Robust Reovery of Signals From a Strutured Union of Subspaes 1 Yonina C. Eldar, Senior Member, IEEE and Moshe Mishali, Student Member, IEEE arxiv:87.4581v2 [nlin.cg] 3 Mar 29 Abstrat Traditional sampling

More information

Maximum Entropy and Exponential Families

Maximum Entropy and Exponential Families Maximum Entropy and Exponential Families April 9, 209 Abstrat The goal of this note is to derive the exponential form of probability distribution from more basi onsiderations, in partiular Entropy. It

More information

Packing Plane Spanning Trees into a Point Set

Packing Plane Spanning Trees into a Point Set Paking Plane Spanning Trees into a Point Set Ahmad Biniaz Alfredo Garía Abstrat Let P be a set of n points in the plane in general position. We show that at least n/3 plane spanning trees an be paked into

More information

State Diagrams. Margaret M. Fleck. 14 November 2011

State Diagrams. Margaret M. Fleck. 14 November 2011 State Diagrams Margaret M. Flek 14 November 2011 These notes over state diagrams. 1 Introdution State diagrams are a type of direted graph, in whih the graph nodes represent states and labels on the graph

More information

Hankel Optimal Model Order Reduction 1

Hankel Optimal Model Order Reduction 1 Massahusetts Institute of Tehnology Department of Eletrial Engineering and Computer Siene 6.245: MULTIVARIABLE CONTROL SYSTEMS by A. Megretski Hankel Optimal Model Order Redution 1 This leture overs both

More information

Lightpath routing for maximum reliability in optical mesh networks

Lightpath routing for maximum reliability in optical mesh networks Vol. 7, No. 5 / May 2008 / JOURNAL OF OPTICAL NETWORKING 449 Lightpath routing for maximum reliability in optial mesh networks Shengli Yuan, 1, * Saket Varma, 2 and Jason P. Jue 2 1 Department of Computer

More information

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1 Computer Siene 786S - Statistial Methods in Natural Language Proessing and Data Analysis Page 1 Hypothesis Testing A statistial hypothesis is a statement about the nature of the distribution of a random

More information

Estimating the probability law of the codelength as a function of the approximation error in image compression

Estimating the probability law of the codelength as a function of the approximation error in image compression Estimating the probability law of the odelength as a funtion of the approximation error in image ompression François Malgouyres Marh 7, 2007 Abstrat After some reolletions on ompression of images using

More information

A simple expression for radial distribution functions of pure fluids and mixtures

A simple expression for radial distribution functions of pure fluids and mixtures A simple expression for radial distribution funtions of pure fluids and mixtures Enrio Matteoli a) Istituto di Chimia Quantistia ed Energetia Moleolare, CNR, Via Risorgimento, 35, 56126 Pisa, Italy G.

More information

Coding for Random Projections and Approximate Near Neighbor Search

Coding for Random Projections and Approximate Near Neighbor Search Coding for Random Projetions and Approximate Near Neighbor Searh Ping Li Department of Statistis & Biostatistis Department of Computer Siene Rutgers University Pisataay, NJ 8854, USA pingli@stat.rutgers.edu

More information

The Second Postulate of Euclid and the Hyperbolic Geometry

The Second Postulate of Euclid and the Hyperbolic Geometry 1 The Seond Postulate of Eulid and the Hyperboli Geometry Yuriy N. Zayko Department of Applied Informatis, Faulty of Publi Administration, Russian Presidential Aademy of National Eonomy and Publi Administration,

More information

Advanced Computational Fluid Dynamics AA215A Lecture 4

Advanced Computational Fluid Dynamics AA215A Lecture 4 Advaned Computational Fluid Dynamis AA5A Leture 4 Antony Jameson Winter Quarter,, Stanford, CA Abstrat Leture 4 overs analysis of the equations of gas dynamis Contents Analysis of the equations of gas

More information

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion Millennium Relativity Aeleration Composition he Relativisti Relationship between Aeleration and niform Motion Copyright 003 Joseph A. Rybzyk Abstrat he relativisti priniples developed throughout the six

More information

A Spatiotemporal Approach to Passive Sound Source Localization

A Spatiotemporal Approach to Passive Sound Source Localization A Spatiotemporal Approah Passive Sound Soure Loalization Pasi Pertilä, Mikko Parviainen, Teemu Korhonen and Ari Visa Institute of Signal Proessing Tampere University of Tehnology, P.O.Box 553, FIN-330,

More information

IMPEDANCE EFFECTS OF LEFT TURNERS FROM THE MAJOR STREET AT A TWSC INTERSECTION

IMPEDANCE EFFECTS OF LEFT TURNERS FROM THE MAJOR STREET AT A TWSC INTERSECTION 09-1289 Citation: Brilon, W. (2009): Impedane Effets of Left Turners from the Major Street at A TWSC Intersetion. Transportation Researh Reord Nr. 2130, pp. 2-8 IMPEDANCE EFFECTS OF LEFT TURNERS FROM THE

More information

Aharonov-Bohm effect. Dan Solomon.

Aharonov-Bohm effect. Dan Solomon. Aharonov-Bohm effet. Dan Solomon. In the figure the magneti field is onfined to a solenoid of radius r 0 and is direted in the z- diretion, out of the paper. The solenoid is surrounded by a barrier that

More information

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach Amerian Journal of heoretial and Applied tatistis 6; 5(-): -8 Published online January 7, 6 (http://www.sienepublishinggroup.om/j/ajtas) doi:.648/j.ajtas.s.65.4 IN: 36-8999 (Print); IN: 36-96 (Online)

More information

A model for measurement of the states in a coupled-dot qubit

A model for measurement of the states in a coupled-dot qubit A model for measurement of the states in a oupled-dot qubit H B Sun and H M Wiseman Centre for Quantum Computer Tehnology Centre for Quantum Dynamis Griffith University Brisbane 4 QLD Australia E-mail:

More information

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

Likelihood-confidence intervals for quantiles in Extreme Value Distributions Likelihood-onfidene intervals for quantiles in Extreme Value Distributions A. Bolívar, E. Díaz-Franés, J. Ortega, and E. Vilhis. Centro de Investigaión en Matemátias; A.P. 42, Guanajuato, Gto. 36; Méxio

More information

REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS. 1. Introduction

REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS. 1. Introduction Version of 5/2/2003 To appear in Advanes in Applied Mathematis REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS MATTHIAS BECK AND SHELEMYAHU ZACKS Abstrat We study the Frobenius problem:

More information

Evaluation of effect of blade internal modes on sensitivity of Advanced LIGO

Evaluation of effect of blade internal modes on sensitivity of Advanced LIGO Evaluation of effet of blade internal modes on sensitivity of Advaned LIGO T0074-00-R Norna A Robertson 5 th Otober 00. Introdution The urrent model used to estimate the isolation ahieved by the quadruple

More information

McCreight s Suffix Tree Construction Algorithm. Milko Izamski B.Sc. Informatics Instructor: Barbara König

McCreight s Suffix Tree Construction Algorithm. Milko Izamski B.Sc. Informatics Instructor: Barbara König 1. Introution MCreight s Suffix Tree Constrution Algorithm Milko Izamski B.S. Informatis Instrutor: Barbara König The main goal of MCreight s algorithm is to buil a suffix tree in linear time. This is

More information

Counting Idempotent Relations

Counting Idempotent Relations Counting Idempotent Relations Beriht-Nr. 2008-15 Florian Kammüller ISSN 1436-9915 2 Abstrat This artile introdues and motivates idempotent relations. It summarizes haraterizations of idempotents and their

More information

The Effectiveness of the Linear Hull Effect

The Effectiveness of the Linear Hull Effect The Effetiveness of the Linear Hull Effet S. Murphy Tehnial Report RHUL MA 009 9 6 Otober 009 Department of Mathematis Royal Holloway, University of London Egham, Surrey TW0 0EX, England http://www.rhul.a.uk/mathematis/tehreports

More information

A Functional Representation of Fuzzy Preferences

A Functional Representation of Fuzzy Preferences Theoretial Eonomis Letters, 017, 7, 13- http://wwwsirporg/journal/tel ISSN Online: 16-086 ISSN Print: 16-078 A Funtional Representation of Fuzzy Preferenes Susheng Wang Department of Eonomis, Hong Kong

More information

arxiv:physics/ v4 [physics.gen-ph] 9 Oct 2006

arxiv:physics/ v4 [physics.gen-ph] 9 Oct 2006 The simplest derivation of the Lorentz transformation J.-M. Lévy Laboratoire de Physique Nuléaire et de Hautes Energies, CNRS - IN2P3 - Universités Paris VI et Paris VII, Paris. Email: jmlevy@in2p3.fr

More information

A Queueing Model for Call Blending in Call Centers

A Queueing Model for Call Blending in Call Centers A Queueing Model for Call Blending in Call Centers Sandjai Bhulai and Ger Koole Vrije Universiteit Amsterdam Faulty of Sienes De Boelelaan 1081a 1081 HV Amsterdam The Netherlands E-mail: {sbhulai, koole}@s.vu.nl

More information

Product Policy in Markets with Word-of-Mouth Communication. Technical Appendix

Product Policy in Markets with Word-of-Mouth Communication. Technical Appendix rodut oliy in Markets with Word-of-Mouth Communiation Tehnial Appendix August 05 Miro-Model for Inreasing Awareness In the paper, we make the assumption that awareness is inreasing in ustomer type. I.e.,

More information

Modal Horn Logics Have Interpolation

Modal Horn Logics Have Interpolation Modal Horn Logis Have Interpolation Marus Kraht Department of Linguistis, UCLA PO Box 951543 405 Hilgard Avenue Los Angeles, CA 90095-1543 USA kraht@humnet.ula.de Abstrat We shall show that the polymodal

More information

arxiv:math/ v1 [math.ca] 27 Nov 2003

arxiv:math/ v1 [math.ca] 27 Nov 2003 arxiv:math/011510v1 [math.ca] 27 Nov 200 Counting Integral Lamé Equations by Means of Dessins d Enfants Sander Dahmen November 27, 200 Abstrat We obtain an expliit formula for the number of Lamé equations

More information

Singular Event Detection

Singular Event Detection Singular Event Detetion Rafael S. Garía Eletrial Engineering University of Puerto Rio at Mayagüez Rafael.Garia@ee.uprm.edu Faulty Mentor: S. Shankar Sastry Researh Supervisor: Jonathan Sprinkle Graduate

More information

c-perfect Hashing Schemes for Binary Trees, with Applications to Parallel Memories

c-perfect Hashing Schemes for Binary Trees, with Applications to Parallel Memories -Perfet Hashing Shemes for Binary Trees, with Appliations to Parallel Memories (Extended Abstrat Gennaro Cordaso 1, Alberto Negro 1, Vittorio Sarano 1, and Arnold L.Rosenberg 2 1 Dipartimento di Informatia

More information

7 Max-Flow Problems. Business Computing and Operations Research 608

7 Max-Flow Problems. Business Computing and Operations Research 608 7 Max-Flow Problems Business Computing and Operations Researh 68 7. Max-Flow Problems In what follows, we onsider a somewhat modified problem onstellation Instead of osts of transmission, vetor now indiates

More information

Convergence of reinforcement learning with general function approximators

Convergence of reinforcement learning with general function approximators Convergene of reinforement learning with general funtion approximators assilis A. Papavassiliou and Stuart Russell Computer Siene Division, U. of California, Berkeley, CA 94720-1776 fvassilis,russellg@s.berkeley.edu

More information

Average Rate Speed Scaling

Average Rate Speed Scaling Average Rate Speed Saling Nikhil Bansal David P. Bunde Ho-Leung Chan Kirk Pruhs May 2, 2008 Abstrat Speed saling is a power management tehnique that involves dynamially hanging the speed of a proessor.

More information

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM NETWORK SIMPLEX LGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM Cen Çalışan, Utah Valley University, 800 W. University Parway, Orem, UT 84058, 801-863-6487, en.alisan@uvu.edu BSTRCT The minimum

More information

Sensitivity Analysis in Markov Networks

Sensitivity Analysis in Markov Networks Sensitivity Analysis in Markov Networks Hei Chan and Adnan Darwihe Computer Siene Department University of California, Los Angeles Los Angeles, CA 90095 {hei,darwihe}@s.ula.edu Abstrat This paper explores

More information

arxiv: v2 [math.pr] 9 Dec 2016

arxiv: v2 [math.pr] 9 Dec 2016 Omnithermal Perfet Simulation for Multi-server Queues Stephen B. Connor 3th Deember 206 arxiv:60.0602v2 [math.pr] 9 De 206 Abstrat A number of perfet simulation algorithms for multi-server First Come First

More information

Grasp Planning: How to Choose a Suitable Task Wrench Space

Grasp Planning: How to Choose a Suitable Task Wrench Space Grasp Planning: How to Choose a Suitable Task Wrenh Spae Ch. Borst, M. Fisher and G. Hirzinger German Aerospae Center - DLR Institute for Robotis and Mehatronis 8223 Wessling, Germany Email: [Christoph.Borst,

More information

JAST 2015 M.U.C. Women s College, Burdwan ISSN a peer reviewed multidisciplinary research journal Vol.-01, Issue- 01

JAST 2015 M.U.C. Women s College, Burdwan ISSN a peer reviewed multidisciplinary research journal Vol.-01, Issue- 01 JAST 05 M.U.C. Women s College, Burdwan ISSN 395-353 -a peer reviewed multidisiplinary researh journal Vol.-0, Issue- 0 On Type II Fuzzy Parameterized Soft Sets Pinaki Majumdar Department of Mathematis,

More information

Taste for variety and optimum product diversity in an open economy

Taste for variety and optimum product diversity in an open economy Taste for variety and optimum produt diversity in an open eonomy Javier Coto-Martínez City University Paul Levine University of Surrey Otober 0, 005 María D.C. Garía-Alonso University of Kent Abstrat We

More information

Flexible Word Design and Graph Labeling

Flexible Word Design and Graph Labeling Flexible Word Design and Graph Labeling Ming-Yang Kao Manan Sanghi Robert Shweller Abstrat Motivated by emerging appliations for DNA ode word design, we onsider a generalization of the ode word design

More information

arxiv:gr-qc/ v2 6 Feb 2004

arxiv:gr-qc/ v2 6 Feb 2004 Hubble Red Shift and the Anomalous Aeleration of Pioneer 0 and arxiv:gr-q/0402024v2 6 Feb 2004 Kostadin Trenčevski Faulty of Natural Sienes and Mathematis, P.O.Box 62, 000 Skopje, Maedonia Abstrat It this

More information

Normative and descriptive approaches to multiattribute decision making

Normative and descriptive approaches to multiattribute decision making De. 009, Volume 8, No. (Serial No.78) China-USA Business Review, ISSN 57-54, USA Normative and desriptive approahes to multiattribute deision making Milan Terek (Department of Statistis, University of

More information

Parallel disrete-event simulation is an attempt to speed-up the simulation proess through the use of multiple proessors. In some sense parallel disret

Parallel disrete-event simulation is an attempt to speed-up the simulation proess through the use of multiple proessors. In some sense parallel disret Exploiting intra-objet dependenies in parallel simulation Franeso Quaglia a;1 Roberto Baldoni a;2 a Dipartimento di Informatia e Sistemistia Universita \La Sapienza" Via Salaria 113, 198 Roma, Italy Abstrat

More information

Indian Institute of Technology Bombay. Department of Electrical Engineering. EE 325 Probability and Random Processes Lecture Notes 3 July 28, 2014

Indian Institute of Technology Bombay. Department of Electrical Engineering. EE 325 Probability and Random Processes Lecture Notes 3 July 28, 2014 Indian Institute of Tehnology Bombay Department of Eletrial Engineering Handout 5 EE 325 Probability and Random Proesses Leture Notes 3 July 28, 2014 1 Axiomati Probability We have learned some paradoxes

More information

DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS

DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS CHAPTER 4 DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS 4.1 INTRODUCTION Around the world, environmental and ost onsiousness are foring utilities to install

More information

Discrete Bessel functions and partial difference equations

Discrete Bessel functions and partial difference equations Disrete Bessel funtions and partial differene equations Antonín Slavík Charles University, Faulty of Mathematis and Physis, Sokolovská 83, 186 75 Praha 8, Czeh Republi E-mail: slavik@karlin.mff.uni.z Abstrat

More information

Multi-version Coding for Consistent Distributed Storage of Correlated Data Updates

Multi-version Coding for Consistent Distributed Storage of Correlated Data Updates Multi-version Coding for Consistent Distributed Storage of Correlated Data Updates Ramy E. Ali and Vivek R. Cadambe 1 arxiv:1708.06042v1 [s.it] 21 Aug 2017 Abstrat Motivated by appliations of distributed

More information

F = F x x + F y. y + F z

F = F x x + F y. y + F z ECTION 6: etor Calulus MATH20411 You met vetors in the first year. etor alulus is essentially alulus on vetors. We will need to differentiate vetors and perform integrals involving vetors. In partiular,

More information

Error Bounds for Context Reduction and Feature Omission

Error Bounds for Context Reduction and Feature Omission Error Bounds for Context Redution and Feature Omission Eugen Bek, Ralf Shlüter, Hermann Ney,2 Human Language Tehnology and Pattern Reognition, Computer Siene Department RWTH Aahen University, Ahornstr.

More information

Simplified Buckling Analysis of Skeletal Structures

Simplified Buckling Analysis of Skeletal Structures Simplified Bukling Analysis of Skeletal Strutures B.A. Izzuddin 1 ABSRAC A simplified approah is proposed for bukling analysis of skeletal strutures, whih employs a rotational spring analogy for the formulation

More information

On the Licensing of Innovations under Strategic Delegation

On the Licensing of Innovations under Strategic Delegation On the Liensing of Innovations under Strategi Delegation Judy Hsu Institute of Finanial Management Nanhua University Taiwan and X. Henry Wang Department of Eonomis University of Missouri USA Abstrat This

More information

A Characterization of Wavelet Convergence in Sobolev Spaces

A Characterization of Wavelet Convergence in Sobolev Spaces A Charaterization of Wavelet Convergene in Sobolev Spaes Mark A. Kon 1 oston University Louise Arakelian Raphael Howard University Dediated to Prof. Robert Carroll on the oasion of his 70th birthday. Abstrat

More information

A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS

A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS Vietnam Journal of Mehanis, VAST, Vol. 4, No. (), pp. A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS Le Thanh Tung Hanoi University of Siene and Tehnology, Vietnam Abstrat. Conventional ship autopilots are

More information

MultiPhysics Analysis of Trapped Field in Multi-Layer YBCO Plates

MultiPhysics Analysis of Trapped Field in Multi-Layer YBCO Plates Exerpt from the Proeedings of the COMSOL Conferene 9 Boston MultiPhysis Analysis of Trapped Field in Multi-Layer YBCO Plates Philippe. Masson Advaned Magnet Lab *7 Main Street, Bldg. #4, Palm Bay, Fl-95,

More information

An Adaptive Optimization Approach to Active Cancellation of Repeated Transient Vibration Disturbances

An Adaptive Optimization Approach to Active Cancellation of Repeated Transient Vibration Disturbances An aptive Optimization Approah to Ative Canellation of Repeated Transient Vibration Disturbanes David L. Bowen RH Lyon Corp / Aenteh, 33 Moulton St., Cambridge, MA 138, U.S.A., owen@lyonorp.om J. Gregory

More information

Determination of the reaction order

Determination of the reaction order 5/7/07 A quote of the wee (or amel of the wee): Apply yourself. Get all the eduation you an, but then... do something. Don't just stand there, mae it happen. Lee Iaoa Physial Chemistry GTM/5 reation order

More information

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont. Stat60/CS94: Randomized Algorithms for Matries and Data Leture 7-09/5/013 Leture 7: Sampling/Projetions for Least-squares Approximation, Cont. Leturer: Mihael Mahoney Sribe: Mihael Mahoney Warning: these

More information

max min z i i=1 x j k s.t. j=1 x j j:i T j

max min z i i=1 x j k s.t. j=1 x j j:i T j AM 221: Advaned Optimization Spring 2016 Prof. Yaron Singer Leture 22 April 18th 1 Overview In this leture, we will study the pipage rounding tehnique whih is a deterministi rounding proedure that an be

More information

Four-dimensional equation of motion for viscous compressible substance with regard to the acceleration field, pressure field and dissipation field

Four-dimensional equation of motion for viscous compressible substance with regard to the acceleration field, pressure field and dissipation field Four-dimensional equation of motion for visous ompressible substane with regard to the aeleration field, pressure field and dissipation field Sergey G. Fedosin PO box 6488, Sviazeva str. -79, Perm, Russia

More information

Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems 009 9th IEEE International Conferene on Distributed Computing Systems Modeling Probabilisti Measurement Correlations for Problem Determination in Large-Sale Distributed Systems Jing Gao Guofei Jiang Haifeng

More information

Sensor management for PRF selection in the track-before-detect context

Sensor management for PRF selection in the track-before-detect context Sensor management for PRF seletion in the tra-before-detet ontext Fotios Katsilieris, Yvo Boers, and Hans Driessen Thales Nederland B.V. Haasbergerstraat 49, 7554 PA Hengelo, the Netherlands Email: {Fotios.Katsilieris,

More information

Supporting Information

Supporting Information Supporting Information Olsman and Goentoro 10.1073/pnas.1601791113 SI Materials Analysis of the Sensitivity and Error Funtions. We now define the sensitivity funtion Sð, «0 Þ, whih summarizes the steepness

More information

Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations

Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations Computers and Chemial Engineering (00) 4/448 www.elsevier.om/loate/omphemeng Modeling of disrete/ontinuous optimization problems: haraterization and formulation of disjuntions and their relaxations Aldo

More information

Planning with Uncertainty in Position: an Optimal Planner

Planning with Uncertainty in Position: an Optimal Planner Planning with Unertainty in Position: an Optimal Planner Juan Pablo Gonzalez Anthony (Tony) Stentz CMU-RI -TR-04-63 The Robotis Institute Carnegie Mellon University Pittsburgh, Pennsylvania 15213 Otober

More information

Lecture 3 - Lorentz Transformations

Lecture 3 - Lorentz Transformations Leture - Lorentz Transformations A Puzzle... Example A ruler is positioned perpendiular to a wall. A stik of length L flies by at speed v. It travels in front of the ruler, so that it obsures part of the

More information

Relativistic Dynamics

Relativistic Dynamics Chapter 7 Relativisti Dynamis 7.1 General Priniples of Dynamis 7.2 Relativisti Ation As stated in Setion A.2, all of dynamis is derived from the priniple of least ation. Thus it is our hore to find a suitable

More information

The gravitational phenomena without the curved spacetime

The gravitational phenomena without the curved spacetime The gravitational phenomena without the urved spaetime Mirosław J. Kubiak Abstrat: In this paper was presented a desription of the gravitational phenomena in the new medium, different than the urved spaetime,

More information

Chapter 8 Hypothesis Testing

Chapter 8 Hypothesis Testing Leture 5 for BST 63: Statistial Theory II Kui Zhang, Spring Chapter 8 Hypothesis Testing Setion 8 Introdution Definition 8 A hypothesis is a statement about a population parameter Definition 8 The two

More information

arxiv:math/ v4 [math.ca] 29 Jul 2006

arxiv:math/ v4 [math.ca] 29 Jul 2006 arxiv:math/0109v4 [math.ca] 9 Jul 006 Contiguous relations of hypergeometri series Raimundas Vidūnas University of Amsterdam Abstrat The 15 Gauss ontiguous relations for F 1 hypergeometri series imply

More information

Physical Laws, Absolutes, Relative Absolutes and Relativistic Time Phenomena

Physical Laws, Absolutes, Relative Absolutes and Relativistic Time Phenomena Page 1 of 10 Physial Laws, Absolutes, Relative Absolutes and Relativisti Time Phenomena Antonio Ruggeri modexp@iafria.om Sine in the field of knowledge we deal with absolutes, there are absolute laws that

More information

23.1 Tuning controllers, in the large view Quoting from Section 16.7:

23.1 Tuning controllers, in the large view Quoting from Section 16.7: Lesson 23. Tuning a real ontroller - modeling, proess identifiation, fine tuning 23.0 Context We have learned to view proesses as dynami systems, taking are to identify their input, intermediate, and output

More information

Slenderness Effects for Concrete Columns in Sway Frame - Moment Magnification Method

Slenderness Effects for Concrete Columns in Sway Frame - Moment Magnification Method Slenderness Effets for Conrete Columns in Sway Frame - Moment Magnifiation Method Slender Conrete Column Design in Sway Frame Buildings Evaluate slenderness effet for olumns in a sway frame multistory

More information

SURFACE WAVES OF NON-RAYLEIGH TYPE

SURFACE WAVES OF NON-RAYLEIGH TYPE SURFACE WAVES OF NON-RAYLEIGH TYPE by SERGEY V. KUZNETSOV Institute for Problems in Mehanis Prosp. Vernadskogo, 0, Mosow, 75 Russia e-mail: sv@kuznetsov.msk.ru Abstrat. Existene of surfae waves of non-rayleigh

More information

LECTURE NOTES FOR , FALL 2004

LECTURE NOTES FOR , FALL 2004 LECTURE NOTES FOR 18.155, FALL 2004 83 12. Cone support and wavefront set In disussing the singular support of a tempered distibution above, notie that singsupp(u) = only implies that u C (R n ), not as

More information

Statistical physics analysis of the computational complexity of solving random satisfiability problems using backtrack algorithms

Statistical physics analysis of the computational complexity of solving random satisfiability problems using backtrack algorithms Eur. Phys. J. B, 55 53 ) THE EUROPEAN PHYSICAL JOURNAL B EDP Sienes Soietà Italiana di Fisia Springer-Verlag Statistial physis analysis of the omputational omplexity of solving random satisfiability problems

More information

EE 321 Project Spring 2018

EE 321 Project Spring 2018 EE 21 Projet Spring 2018 This ourse projet is intended to be an individual effort projet. The student is required to omplete the work individually, without help from anyone else. (The student may, however,

More information

FORMAL METHODS LECTURE VI BINARY DECISION DIAGRAMS (BDD S)

FORMAL METHODS LECTURE VI BINARY DECISION DIAGRAMS (BDD S) Alessandro Artale (FM First Semester 2009/2010) p. 1/38 FORMAL METHODS LECTURE VI BINARY DECISION DIAGRAMS (BDD S) Alessandro Artale Faulty of Computer Siene Free University of Bolzano artale@inf.unibz.it

More information

ELECTROMAGNETIC WAVES WITH NONLINEAR DISPERSION LAW. P. М. Меdnis

ELECTROMAGNETIC WAVES WITH NONLINEAR DISPERSION LAW. P. М. Меdnis ELECTROMAGNETIC WAVES WITH NONLINEAR DISPERSION LAW P. М. Меdnis Novosibirs State Pedagogial University, Chair of the General and Theoretial Physis, Russia, 636, Novosibirs,Viljujsy, 8 e-mail: pmednis@inbox.ru

More information

On the density of languages representing finite set partitions

On the density of languages representing finite set partitions On the density of languages representing finite set partitions Nelma Moreira Rogério Reis Tehnial Report Series: DCC-04-07 Departamento de Ciênia de Computadores Fauldade de Ciênias & Laboratório de Inteligênia

More information

Word of Mass: The Relationship between Mass Media and Word-of-Mouth

Word of Mass: The Relationship between Mass Media and Word-of-Mouth Word of Mass: The Relationship between Mass Media and Word-of-Mouth Roman Chuhay Preliminary version Marh 6, 015 Abstrat This paper studies the optimal priing and advertising strategies of a firm in the

More information

Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lock Loops

Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lock Loops Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lok Loops Mihael Lentmaier and Bernhard Krah, German Aerospae Center (DLR) BIOGRAPY Mihael Lentmaier reeived the Dipl.-Ing.

More information

Variation Based Online Travel Time Prediction Using Clustered Neural Networks

Variation Based Online Travel Time Prediction Using Clustered Neural Networks Variation Based Online Travel Time Predition Using lustered Neural Networks Jie Yu, Gang-Len hang, H.W. Ho and Yue Liu Abstrat-This paper proposes a variation-based online travel time predition approah

More information

Sufficient Conditions for a Flexible Manufacturing System to be Deadlocked

Sufficient Conditions for a Flexible Manufacturing System to be Deadlocked Paper 0, INT 0 Suffiient Conditions for a Flexile Manufaturing System to e Deadloked Paul E Deering, PhD Department of Engineering Tehnology and Management Ohio University deering@ohioedu Astrat In reent

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilisti Graphial Models David Sontag New York University Leture 12, April 19, 2012 Aknowledgement: Partially based on slides by Eri Xing at CMU and Andrew MCallum at UMass Amherst David Sontag (NYU)

More information

Predicting the confirmation time of Bitcoin transactions

Predicting the confirmation time of Bitcoin transactions Prediting the onfirmation time of Bitoin transations D.T. Koops Korteweg-de Vries Institute, University of Amsterdam arxiv:189.1596v1 [s.dc] 21 Sep 218 September 28, 218 Abstrat We study the probabilisti

More information

Stochastic Combinatorial Optimization with Risk Evdokia Nikolova

Stochastic Combinatorial Optimization with Risk Evdokia Nikolova Computer Siene and Artifiial Intelligene Laboratory Tehnial Report MIT-CSAIL-TR-2008-055 September 13, 2008 Stohasti Combinatorial Optimization with Risk Evdokia Nikolova massahusetts institute of tehnology,

More information

The universal model of error of active power measuring channel

The universal model of error of active power measuring channel 7 th Symposium EKO TC 4 3 rd Symposium EKO TC 9 and 5 th WADC Workshop nstrumentation for the CT Era Sept. 8-2 Kosie Slovakia The universal model of error of ative power measuring hannel Boris Stogny Evgeny

More information