Fast index for approximate string matching

Similar documents
Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Prefix-Free Regular-Expression Matching

Suffix Trays and Suffix Trists: Structures for Faster Text Indexing

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

CS 491G Combinatorial Optimization Lecture Notes

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

CS 573 Automata Theory and Formal Languages

NON-DETERMINISTIC FSA

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

Finite State Automata and Determinisation

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

Section 1.3 Triangles

T b a(f) [f ] +. P b a(f) = Conclude that if f is in AC then it is the difference of two monotone absolutely continuous functions.

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

Data Structures and Algorithm. Xiaoqing Zheng

Discrete Structures Lecture 11

Nondeterministic Automata vs Deterministic Automata

QUADRATIC EQUATION. Contents

1 Nondeterministic Finite Automata

On Suffix Tree Breadth

Homework 3 Solutions

Lecture 6: Coding theory

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Algorithm Design and Analysis

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

= state, a = reading and q j

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

Lesson 2: The Pythagorean Theorem and Similar Triangles. A Brief Review of the Pythagorean Theorem.

Algorithm Design and Analysis

CS241 Week 6 Tutorial Solutions

Exercise sheet 6: Solutions

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

2.4 Theoretical Foundations

Part 4. Integration (with Proofs)

More Properties of the Riemann Integral

8 THREE PHASE A.C. CIRCUITS

Hyers-Ulam stability of Pielou logistic difference equation

Periodic string comparison

ILLUSTRATING THE EXTENSION OF A SPECIAL PROPERTY OF CUBIC POLYNOMIALS TO NTH DEGREE POLYNOMIALS

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

AVL Trees. D Oisín Kidney. August 2, 2018

Chapter 4 State-Space Planning

General Suffix Automaton Construction Algorithm and Space Bounds

Nondeterminism and Nodeterministic Automata

, g. Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g. Solution 1.

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

CS 275 Automata and Formal Language Theory

15-451/651: Design & Analysis of Algorithms December 3, 2013 Lecture #28 last changed: November 28, 2013

Section 4.4. Green s Theorem

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Arrow s Impossibility Theorem

Formal Languages and Automata

Compression of Palindromes and Regularity.

Linear Algebra Introduction

Regular languages refresher

Comparing the Pre-image and Image of a Dilation

CIT 596 Theory of Computation 1. Graphs and Digraphs

arxiv: v1 [math.gr] 11 Jan 2019

Nondeterministic Finite Automata

(a) A partition P of [a, b] is a finite subset of [a, b] containing a and b. If Q is another partition and P Q, then Q is a refinement of P.

Regular expressions, Finite Automata, transition graphs are all the same!!

Subsequence Automata with Default Transitions

1 From NFA to regular expression

Metodologie di progetto HW Technology Mapping. Last update: 19/03/09

A CLASS OF GENERAL SUPERTREE METHODS FOR NESTED TAXA

LIP. Laboratoire de l Informatique du Parallélisme. Ecole Normale Supérieure de Lyon

@#? Text Search ] { "!" Nondeterministic Finite Automata. Transformation NFA to DFA and Simulation of NFA. Text Search Using Automata

AP Calculus BC Chapter 8: Integration Techniques, L Hopital s Rule and Improper Integrals

Introduction to Olympiad Inequalities

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

First Midterm Examination

The Word Problem in Quandles

CHAPTER 1 Regular Languages. Contents

Graph width-parameters and algorithms

2.4 Linear Inequalities and Interval Notation

Instructions. An 8.5 x 11 Cheat Sheet may also be used as an aid for this test. MUST be original handwriting.

( ) { } [ ] { } [ ) { } ( ] { }

Proving the Pythagorean Theorem

Lecture 2: Cayley Graphs

A Study on the Properties of Rational Triangles

CS 330 Formal Methods and Models Dana Richards, George Mason University, Spring 2016 Quiz Solutions

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

PAIR OF LINEAR EQUATIONS IN TWO VARIABLES

Activities. 4.1 Pythagoras' Theorem 4.2 Spirals 4.3 Clinometers 4.4 Radar 4.5 Posting Parcels 4.6 Interlocking Pipes 4.7 Sine Rule Notes and Solutions

Pythagoras Theorem. Pythagoras Theorem. Curriculum Ready ACMMG: 222, 245.

For a, b, c, d positive if a b and. ac bd. Reciprocal relations for a and b positive. If a > b then a ab > b. then

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

Efficient Parameterized Algorithms for Data Packing

Can one hear the shape of a drum?

Transcription:

Fst index for pproximte string mthing Dekel Tsur Astrt We present n index tht stores text of length n suh tht given pttern of length m, ll the sustrings of the text tht re within Hmming distne (or edit distne) t most k from the pttern re reported in O(m+ loglogn + #mthes) time (for onstnt k). The spe omplexity of the index is O(n 1+ǫ ) for ny onstnt ǫ > 0. 1 Introdution One of the fundmentl prolems in pttern mthing is indexing text t suh tht given query pttern p, ll the ourrenes of p in t n e reported effiiently. This n e solved optimlly using suffix trees [12]: The onstrution time nd spe omplexity of the index is O(n), nd the query time is O(m + #mthes), where n is the length of t, m is the length of p, nd #mthes is the numer of times p ppers in t. For simpliity, we shll ssume throughout the pper tht the size of the lphet is onstnt. A nturl extension of text indexing is to llow pproximte serh in the index. Formlly, given text t nd n integer k, the gol is to uild n index for t suh tht given query string p, ll the sustrings of t with Hmming distne (or edit distne) t most k from p n e reported effiiently. Agin for simpliity, we ssume throughout tht k is onstnt. Building n pproximte index with lmost liner spe nd query time ws mjor open prolem. The first effiient pproximte index ws otined for the se k = 1 y Amir et l. [1]. The index of Amir et l. uses O(nlog 2 n) spe, nd nswer queries in time O(mlognloglogn+ #mthes). A fster query time is otined using the dt-struture of [2]. Liner spe indies tht support one error were given in [7,8]. A ig rekthrough ws otinedy Cole et l. [4] whih presented n index tht supports n ritrry numer of errors. The index of Cole et l. uses O(nlog k n) spendnswers queries intimeo(m+log k n loglogn+#mthes). Chnet l.[3] gve n O(n)-spe index tht nswers queries in time O(m+(logn) k(k+1) loglogn+ #mthes). Most of the results ove work for oth Hmming distne or edit distne. We note tht the query time omplexity of the edit distne index in [4] is O(m+log k n Deprtment of Computer Siene, Ben-Gurion University of the Negev. Emil: dekelts@s. gu..il 1

loglogn+3 k #mthes). However, s we ssume here tht k is onstnt, the time omplexity eomes O(m+log k n loglogn+#mthes). The indies mentioned ove hve worst se performne gurntees. Indies with good performne on verge were given in [5,6,9 11]. In this pper, we show how to speed-up the query time in the index of Cole et l. This omes t ost of inresing the spe omplexity of the index. More preisely, we show tht for every integer α with 2 α n/2, there is n O(n(αlogαlogn) k )- spe index (for Hmming distne or edit distne) tht nswers queries in time O(m + (log α n) k loglogn + #mthes). In prtiulr, for every fixed ǫ > 0, one n tke α = log ǫ/2k n nd get n index with spe omplexity O(nlog k+ǫ n) nd query time O(m + log k n/(loglogn) k 1 + #mthes) (rell tht k is ssumed to e onstnt, so log α n = Θ(log/loglogn)). To get fster query time, one n tke α = n ǫ/2k for some ǫ > 0 nd get n index with spe omplexity O(n 1+ǫ ) nd query time O(m+loglogn+#mthes). 2 Preliminries Let s 1,...,s n e olletion of strings, where eh string ends with the hrter, nd does not pper elsewhere in s 1,...,s n. A ompressed trie for s 1,...,s n isrootedtreet ththsnleves ndeh internl vertex hstlest two hildren. Every edge of T is leled y string. Every string s i orresponds to distint lef v i of T suh tht the ontention of the lels of the edges on the pth from the root of T to v i is extly s i. A lotion l on ompressed trie T is pir (v,s) where v is vertex of T nd s is n empty string or proper prefix of the lel of some edge etween v nd hild of v. We will sometimes refer to vertex v s lotion (v,ǫ) nd vie vers. For vertex v in ompressed trie T, the string tht orresponds to v is the ontention of the lels on the pth from the root of T to v. For lotion l = (v,s), the string tht orresponds to l, denoted str(l), is the ontention of the string tht orresponds to v nd s. The weight of vertex v in tree T is the numer of desendent leves of v. A pth [v 1,...,v d ] in tree T is hevy pth if (1) v 1 is the root of T, (2) v d is lef, nd (3) for every i < d, there is no hild of v i with weight greter thn the weight of v i+1. A hevy pth deomposition of tree T is set C of pths in T suh tht (1) C ontins hevy pth C of T, nd (2) for every onneted omponent T in T C, C ontins the pths in hevy pth deomposition of T (T C is the grph otined from T y removing the verties of C). For hevy pth deomposition C define T C to e rooted tree whose set of verties is C, nd there is n edge from C to C in T C if there is vertex v C suh tht the topmost vertex in C is hild of v in T. Given hevy pth deomposition C of ompressed trie T nd lotion l = (v,s) in T, nextlo(l) is the lotion rehed when moving from l one hrter long thepth C C tht ontins v. Formlly, nextlo(l) is thelotionl = (v,s ) suh tht the string str(l ) is the prefix of length str(l) + 1 of the string tht orresponds to the ottommost vertex of C. If there is no suh lotion l then 2

nextlo(l) is undefined. We lso define next(l) to e the lst hrter of the string tht orresponds to nextlo(l). For vertex v in ompressed trie T, nexthrs(v) is the set of ll first hrters in the lels of the edges etween v nd its hildren. For hrter nexthrs(v), let w e the hild of v suh tht the first hrter of the lel of the edge (v,w) is. We define Su(T,v,) to e the tree otined y first tking the sutree of T indued y v, w, nd ll the desendents of w. Furthermore, if the lel of (v,w) ontins only one hrter then the vertex v nd the edge (v,w) re removed from Su(T,v,). Otherwise, the first hrter of the lel of (v,w) is ersed. Let T 1,...,T d e ompressed tries. The merge of T 1,...,T d is ompressed trie whose strings set is the union of the strings sets of T 1,...,T d. 3 k-mismthes index The following prolem is generliztion of the indexing prolem tht ws disussed in the introdution. Input A ompressed trie T over strings s 1,...,s n. Query A string p, nd lotion l on T. Output Allthestringss i suh thtstr(l) isprefix ofs i ndthehmming distne etween p nd s i [ str(l) +1.. str(l) +m] is extly k, where m is the length of p. A dt-struture tht solves the prolem ove is lled n unrooted k-mismthes index. A dt-struture tht solves simpler vrint of the prolem in whih str(l) is lwys empty is lled rooted k-mismthes index. To solve the indexing prolem mentioned in the introdution, one n onstrut rooted k -mismthes index on ll the suffixes of the input string t for ll k k. We note tht we use Hmming distne to simplify the presenttion. The sme tehniques n lso e used for edit distne. We first desrie the k-mismthes index of Cole et l. [4]. The min ide is to define new ompressed tries lled group trees, nd reursively uild rooted (k 1)-mismthes index on eh group tree (the reursion stops when k is equl to 0). A k-mismthes query on T is nswered y mking (k 1)-mismthes queries on O(logn) group trees. Let T e ompressed trie of the strings s 1,...,s n, nd let C e hevy pth deomposition of T. Consider some hevy pth C C, nd let v 1,...,v d e the verties long the pth C (where v 1 is the topmost vertex in the pth). We define error trees s follows: For every vertex v i nd every nexthrs(v i )\{next(v i )}, the error tree Err(T,v i,) is equl to Su(T,v i,). The error tree Err(T,v i ) is the tree otined y merging the trees Su(T,v i,) for every nexthrs(v i ) \ {next(v i )}. Then, if the root u of the resulting tree hs more thn one hild we dd new root u nd n edge (u,u) with lel s, where s is the string otined y ontenting the lels of the edges on the pth from v 1 to v i, nd the hrter 3

d d () () () (d) Figure 1: Exmple of error trees. Figure () shows hevy pth v 1,v 2,... nd the verties hnging from this pth. The error trees Err(T,v 2,) nd Err(T,v 2,) re shown in Figures () nd (), respetively. Figure (d) shows the error tree Err(T,v 2 ), whih is otined y merging Err(T,v 2,) nd Err(T,v 2,), nd dding new root u. next(v i ). If u hs only one hild we prepend the string s to lel of the edge etween u nd its hild. See Figure 1 for exmples of the definitions ove. The next step is to onstrut group trees from the error trees. Let w i e the numer of leves in the tree Err(T,v i ). For eh vertex v i we ssign n intervl I i = [ j<i w j, j i w j). For n intervl I = [,), we will denote left(i) = nd right(i) =. The merge of Err(T,v i ),...,Err(T,v j ) will e denoted Group 1 (T,v i,v j )ndwillelledtype 1 grouptree. WedonotreteGroup 1 (T,v i,v j ) for ll i nd j (s this would tke too muh spe). Insted, the type 1 group trees re onstruted y the following proedure (n exmple is given in Figure 2). 1: For every C C whih is not lef in T C do 2: Let v 1,...,v d e the verties of C with intervls I 1,...,I d. 3: L 1 {(1,d)}. 4: t 1. 5: While L t do 6: L t+1. 7: For every (i,i ) L t do 8: left(i i ), right(i i ). 9: Let j e the index suh tht + I 2 j. 10: If j i+1 then uild the group tree Group 1 (T,v i,v j 1 ) 11: Build the group tree Group 1 (T,v j,v j ). 12: If j i 1 then uild the group tree Group 1 (T,v j+1,v i ) 13: If j > i+1 then dd (i,j 1) to L t+1. 14: If j < i 1 then dd (j +1,i ) to L t+1. 4

Figure 2: An exmple of type 1 group tree onstrution. The top line shows intervls I 1,...,I 7 nd the point + I 2 3. Thus, the first itertion retes the group trees Group 1 (T,v 1,v 2 ), Group 1 (T,v 3,v 3 ), nd Group 1 (T,v 4,v 7 ). In the nextitertion, thefollowingtreesrereted: Group 1 (T,v 1,v 1 ), Group 1 (T,v 2,v 2 ), Group 1 (T,v 4,v 4 ), Group 1 (T,v 5,v 5 ), nd Group 1 (T,v 6,v 7 ). In the finl itertion, the group trees Group 1 (T,v 6,v 6 ) nd Group 1 (T,v 7,v 7 ) re reted. 15: t t+1. For every vertex v in T we rete group trees from the error trees Err(T,v,) in similr wy. These trees will e lled type 2 group trees. On every group tree (of type 1 or 2) we uild rooted (k 1)-mismthes index. Also, we uild n unrooted (k 1)-mismthes index on T. We now desrie how to nswer rooted query p. This is done y performing (k 1)-mismthes queries on some group trees or on T. Let l e the lotion in T suh tht str(l) is prefix of p, nd str(l) is mximl. The pth tht orresponds to p is the pth from the root of T to l. Let C 1,...,C r e the pths of C through whih the pth tht orresponds to p psses, in order from top to ottom. For t = 1,...,r, let l t e the lst lotion on C t through whih the pth tht orresponds to p psses. Note tht for t < r, l t must e vertex. For every pth C t, let v 1,...,v d e the verties of the pth, nd let j e the minimum index suh tht str(v j ) str(l t ). The following queries re performed: 1. If l t is not lef, do n unrooted (k 1)-mismthes query on T with query string p[ str(l t ) +2..m] nd strt position nextlo(l t ). 2. Identify the type 1 group trees whose merge inludes preisely the error trees Err(T,v 1 ),...,Err(T,v j 1 ). On eh group tree, do (k 1)-mismthes query with query string p[ str(v 1 ) +1..m]. 3. If l t = v j nd l t is not lef, identify the type 2 group trees whose merge inludes preisely the error trees Err(T,v j,) for ll p[ str(v j ) +1]. On eh group tree, do (k 1)-mismthes query with query string p[ str(v j ) + 2..m]. Hndling n unrooted query is done similrly: In this se the pth tht orresponds to p strts t the query lotion l insted of the strting t the root. Hndling the pths C 2,...,C r is the sme s efore. For the pth C 1, the type 1 group trees tht re queried re the trees whose merge inludes preisely the error trees Err(T,v i ),...,Err(T,v j 1 ), where i is the minimum index suh tht str(v i ) str(l) nd j is defined s efore. 5

4 New index Our onstrution is similr to the onstrution of Cole et l. We uild more group trees in order to redue the numer of group trees tht re serhed when nswering query. In prtiulr, while in the onstrution of Cole et l. group tree onsists of error trees tht ome from one hevy pth, in our onstrution some group trees (lled type 3 group trees) onsist of error trees from severl hevy pths. Let α e some integer with 2 α n/2. The type 1 group trees re uilt using proedure Build desried elow. 1: For every C C whih is not lef in T C do 2: Let v 1,...,v d e the verties of C with intervls I 1,...,I d. 3: L 1 {(1,d)}. 4: t 1. 5: While L t do 6: L t+1. 7: For every (i,i ) L t do 8: left(i i ), right(i i ). 9: i 0 i 1. 10: For j = 1,...,α 1 do 11: Let i j e the index suh tht + j ( ) I α i j. 12: If i j > i j 1 then 13: If i j i+1 then uild the group tree Group 1 (T,v i,v ij 1). 14: Build the group tree Group 1 (T,v ij,v ij ). 15: If i j i 1 then uild the group tree Group 1 (T,v ij +1,v i ). 16: If i j > i j 1 +2 then dd (i j 1 +1,i j 1) to L t+1. 17: If i α 1 < i 1 then dd (i α 1 +1,i ) to L t+1. 18: t t+1. The type 2 group trees re uilt similrly. We lso define type 3 group trees s follows. The weight of pth C C is the weight of the topmost vertex in C. A pth C C is lled d if weight(c ) > 1 weight(c), where C is the prent of C α in T C. We sn the verties of the tree T C in preorder. When we reh vertex C tht hs t lest one d hild, we uilt set B(C) ontining the pth C nd ll pths C C suh tht C is desendent of C in T C nd weight(c ) > 1 weight(c). α Note tht every C B(C)\{C} is d pth. For every C,C B(C) suh tht C is desendent of C we rete type 3 group tree, denoted Group 3 (T,C,C ), in the following wy. Let C = C 1,C 2,...,C r 1,C r = C e the pth from C to C in T C. Let u i e the first vertex in the pth C i, nd for i < r let v i e the prent of u i+1 in T (note tht v i C i ). Let i e the first hrter of the lel of the edge (v i,u i+1 ). Let s i e the ontention of the lels of the edges on the pth from u 1 to u i, nd let s i e the ontention of the lels of the edges on the pth from u 1 to v i, nd the hrter i. The group tree Group 3 (T,C,C ) is the merge of the following trees. 1. For every i < r nd every v C i whih is n nestor of v i, the tree otined y tking Err(T,v) nd prepending the string s i to the lel of the edge 6

d d d e d e () () () (d) Figure 3: Exmple of type 3 group trees. The pths C = C 1, C 2, nd C = C 3 re shown in Figure (). Two of the trees tht re merged when reting Group 3 (T,C,C ) re shown in () nd (d). The tree in () is otined from Err(T,v) (shown in ()) y dding the string s 2 = to the lel of the edge etween the root nd its hild. The tree in (d) is otined from Su(T,v 2,) y dding new root, where the lel of the new edge is s 2 =. etween the root of Err(T,v) nd its only hild. 2. For every i < r nd every nexthrs(v i ) \ { i } (note tht this inludes = next(v i )), the tree otined y tking Su(T,v i,) nd if the root of this tree hs only one hild, prepending the string s i to the edge etween the root nd its hild. Otherwise, new root is dded nd onneted to the old root y n edge, where the lel of the edge is s i. An exmple is given in Figure 3. Answering n unrooted query p is performed s follows. Let C 1,...,C r e the pths of C through whih the pth tht orresponds to p in T psses. Strt with t = 1. At eh itertion, if t = r or C t+1 is not d pth, perform queries for C t s desried in the previous setion, nd inrese t y 1. Otherwise, do rooted 7

(k 1)-mismthes query on Group 3 (T,C t,c t ) nd set t to t, where t > t is the mximum index suh tht C t B(C t ). In more detils, the lgorithm is s follows (we omit the queries on type 2 grouptrees whih re hndled similrly to the queries on type 1 group trees). 1: Let C 1,...,C r e the pths of C through whih the pth tht orresponds to p in T psses. 2: t 1. 3: While t r do 4: Let v 1,...,v d e the verties of C t, with intervls I 1,...,I d. 5: If t < r nd C t+1 is d pth 6: Let t > t e the mximum index suh tht C t B(C t ). 7: Do rooted (k 1)-mismthes query on Group 3 (T,C t,c t ) with query string p[ str(v 1 ) +1..m]. 8: t t. 9: Else 10: Let l t e the lst lotion on C t through whih the pth tht orresponds to p psses. 11: If l t is not lef then do n unrooted (k 1)-mismthes query on T with query string p[ str(l t ) +2..m] nd strt position nextlo(l t ). 12: Let j e the minimum index suh tht str(v j ) str(l t ). 13: p p[ str(v j ) +1..m]. 14: i 1, i d. 15: While i < j do 16: left(i i ), right(i i ). 17: Let β e the mximum integer suh tht + β ( ) < right(i α j). 18: If β > 0 then let j 1 e the index suh tht + β ( ) I α j 1 else j 1 i 1. 19: If β < α 1 then let j 2 e the index suh tht + β+1 α j 2 else j 2 i +1. 20: If j 1 i+1thendorooted(k 1)-mismthesqueryonGroup 1 (T,v i,v j1 1) with query string p. 21: If i j 1 < j thendorooted(k 1)-mismthesqueryonGroup 1 (T,v j1,v j1 ) with query string p. 22: i j 1 +1, i j 2 1. 23: t t+1 For n unrooted query, the pth C 1 is hndled s in the hndling of unrooted queries desried in the previous setion. Then, C 2,...,C r re hndled using the lgorithm ove. Theorem1. Thetime fornsweringqueryis O(m+(log α n) k loglogn+#mthes). Proof. Let t 1,...,t r e the different vlues of t during the run of the lgorithm. We first give ound on r. We lim tht for every i r 2, weight(c ti+2 ) 1 weight(c α t i ): If C ti +1 is not d pth then t i+1 = t i + 1 nd weight(c ti+1 ) 1 weight(c α t i ). Sine weight(c 1 ) > weight(c 2 ) > > weight(c t ) nd t i+2 t i+1, 8

we otin tht weight(c ti+2 ) 1 α weight(c t i ). If C ti +1 is d pth then C ti+1 +1 is not in B(C t ). Therefore, weight(c ti+2 ) weight(c ti+1 +1) 1 α weight(c t i ). Sine weight(c 1 ) = n nd weight(c t ) 1, we onlude tht r 2 + 2log α n. Therefore, the numer of (k 1)-mismthes queries performed t lines 7 nd 11 is t most r 2+2log α n. We next ound the numer of queries performed on type 1 group trees. During theexeution of lines 15 22, we sy tht the urrent intervl is the intervl I i I i+1 I i. The sequene of urrent intervls during the exeution of the lgorithm (for ll t) is deresing in lengths. If for some C t, lines 15 22 re exeuted s times, then the length of the urrent intervl dereses y ftor of t lest α mx(1,s 1). Thus, lines 15 22 re exeuted t most 2+2log α n times, nd the numer queries performed on type 1 group trees is t most 4+4log α n. Using similr nlysis, the numer of queries on type 2 group trees is t most 8+8log α n (in eh itertion of the serh in the type 2 group trees, up to 4 queries n e mde). Comining the ounds ove, we hve tht the totl numer of(k 1)-mismthes queries performed when nswering rooted queries is t most 14+14log α n. When nswering n unrooted query, t most 18+18log α n (k 1)-mismthes queries re mde (the dditionl 4+4log α n queries re due to the speil hndling of the pth C 1 ). Using indution, the totl numer of 0-mismthes queries performed for rooted or unrooted query is t most (18+18log α n) k = O((log α n) k ). Using the LCP dt-strutures of Cole et l. [4] we hve tht fter preproessing stge tht tkes O(m) time, the i-th 0-mismthes query tkes O(loglogn+ #mthes i ) time, where #mthes i is the numer of mthes returned y the query. Sine eh pproximte mth of p in t is reported extly one, i #mthes i = #mthes. Therefore, the totl time omplexity of k-mismthes query is O(m + (log α n) k loglogn+#mthes). Theorem 2. The spe omplexity of the index is O(n(αlogαlogn) k ). Proof. First, we ound the totl numer of leves in ll type 1 group trees (the nlysis is similr to the nlysis of Cole et l.). Define S k (n) = (5αlogαlogn) k. We will show tht the totl numer of leves in ll group trees tht re uilt for k-mismthes index over ompressed trie T with n leves is t most S k (n) n. The lim is proved using indution on k. The se k = 0 is trivil. Suppose we proved the lim for k 1, nd onsider some k-mismthes index over ompressed trie T with n leves. Let T 1,...,T d e ll the type 1 group trees tht re uilt for T y proedure Build, nd denote y x i the numer of leves in T i. By indution, we hve tht the (k 1)-mismthes indies onstruted on the trees T 1,...,T d hve t most d i=1 S k 1(x i ) x i leves. For lef v of T, let i(v,1),...,i(v,d v ) denote the indies of group trees in whih v ppers. Clerly, d i=1 S k 1(x i ) x i = dv v j=1 S k 1(x i(v,j) ). The funtion S k 1 (x) is n inresing funtion of x. Therefore, d i=1 S k 1(x i ) x i dv v j=1 S k 1(n) = S k 1 (n) v d v. We now give ound on d v. Fix some lef v of T. We prtition the group trees tht ontin v into sets, where eh set onsists of ll the trees tht re generted during one exeution of lines 10 16 of proedure Build. In eh set the numer of trees tht ontin v is t most α 1. Similrly to the proof of Theorem 1, the 9

numer of sets is t most logn + log α n 2logn. It follows tht the numer of leves in the (k 1)-mismthes indies uilt on the type 1 group trees is t most (α 1) 2logn S k 1 (n). Similrly, the numer of leves in the indies uilt on the type 2 group trees is t most (α 1) 2logn S k 1 (n). It remins to ound the numer of leves in the indies uilt on the type 3 group trees. Weeginyounding thesize ofb(c) forsome pthc. Consider the sutree T of T C tht is indued y the verties of B(C). For every two leves C 1 nd C 2 in T, the set of verties of T tht re desendents of the topmost vertex in C 1 is disjoint with the set of verties of T tht re desendents of the topmost vertex in C 2. It follows tht the sum of weights of the leves of T is less thn or equl to weight(c). Sine eh lef in T hs weight greter thn 1 weight(c), we onlude α tht T hs t most α leves. By the definition of hevy pth deomposition, we hve tht if C 1 is hild of C 2 in T then the weight of C 1 is less thn hlf the weight of C 2. Therefore, for every lef C in T, the numer of nestors of C in T is t most logα. Thus, B(C) αlogα. Using the sme rguments s ove, the numer of leves in the(k 1)-mismthes indies uilt on the type 3 group trees is t most S k 1 (n) v d v, where d v is the numer of type 3 group trees tht ontin the lef v. A type 3 group tree tht ontins v must e of the form Group 3 (T,C,C ) where C is pth through whih the pth from the root of T to v psses. The numer of suh pths is t most logn. Moreover, for fixed C, there re t most αlogα wys to hoose C. Therefore, d v αlogαlogn. We onlude tht the totl numer of leves in the indies uilt on ll group trees is t most (2 2(α 1)logn+αlogαlogn) S k 1 (n) 5αlogαlogn S k 1 (n) = S k (n). Referenes [1] A. Amir, D. Keselmn, G. M. Lndu, N. Lewenstein, M. Lewenstein, nd M. Rodeh. Ditionry mthing with one error. J. of Algorithms, 37(2):309 325, 2000. [2] A. L. Buhsum, M. T. Goodrih, nd J. R. Westrook. Rnge serhing over tree ross produts. In Pro. 8th Europen Symposium on Algorithms (ESA), pges 120 131, 2000. [3] H. Chn, T. W. Lm, W. Sung, S. Tm, nd S. Wong. A liner size index for pproximte pttern mthing. In Pro. 17th Symposium on Comintoril Pttern Mthing (CPM), LNCS 4009, pges 49 59, 2006. [4] R. Cole, L. Gottlie, nd M. Lewenstein. Ditionry mthing nd indexing with errors nd don t res. In Pro. 36th ACM Symposium on Theory Of Computing (STOC), pges 91 100, 2004. [5] C. Epifnio, A. Griele, F. Mignosi, A. Restivo, nd M. Siortino. Lnguges with mismthes. Theoretil Computer Siene, 385(1-3):152 166, 2007. 10

[6] A. Griele, F. Mignosi, A. Restivo, nd M. Siortino. Indexing strutures for pproximte string mthing. In Pro. 5th Itlin Conferene on Algorithms nd Complexity (CIAC), pges 140 151, 2003. [7] T. N. D. Huynh, W. K. Hon, T. W. Lm, nd W. K. Sung. Approximte string mthing using ompressed suffix rrys. In Pro. 15th Symposium on Comintoril Pttern Mthing (CPM), pges 434 444, 2004. [8] T. W. Lm, W. K. Sung, nd S. S. Wong. Improved pproximte string mthing using ompressed suffix dt strutures. In Pro. 16th Interntionl Symposium on Algorithms nd Computtion (ISAAC), pges 339 348, 2005. [9] M. G. Mß nd J.Nowk. Text indexing with errors. In Pro. 16th Symposium on Comintoril Pttern Mthing (CPM), pges 21 32, 2005. [10] G. Nvrro nd R. Bez-Ytes. A hyrid indexing method for pproximte string mthing. J. of Disrete Algorithms, 1(1):205 239, 2000. [11] G. Nvrro nd E. Chávez. A metri index for pproximte string mthing. Theoretil Computer Siene, 352(1 3):266 279, 2006. [12] P. Weiner. Liner pttern mthing lgorithm. In Pro. 14th IEEE Symposium on Swithing nd Automt Theory, pges 1 11, 1973. 11