Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Size: px
Start display at page:

Download "Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des"

Transcription

1 Text Searching Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des Systèmes. International PhD School in Formal Languages and Applications Tarragona, November 20th and 21st, 2006

2 Plan 1 Introduction 2 Left-to-right scan shift 3 Right-to-left scan shift Thierry Lecroq Text Searching 2/77

3 Plan 1 Introduction 2 Left-to-right scan shift 3 Right-to-left scan shift Thierry Lecroq Text Searching 3/77

4 Pattern matching text position of occurrence pattern Problem Find all the occurrences of pattern x of length m inside the text y of length n Two types of solutions Fixed pattern Use of combinatorial properties Static text Solutions based on indexes preprocessing time O(m) searching time O(n) preprocessing time O(n) searching time O(m) Thierry Lecroq Text Searching 4/77

5 Searching for a fixed string String matching given a pattern x, find all its locations in any text y Pattern a string x of symbols, of length m t a t a Text a string y of symbols, of length n c a c g t a t a t a t g c g t t a t a a t Thierry Lecroq Text Searching 5/77

6 String matching Occurrences c a c g t a t a t a t g c g t t a t a a t t a t a t a t a t a t a Basic operation symbol comparison (=, ) Thierry Lecroq Text Searching 6/77

7 Interest Practical basic problem for search for various patterns lexical analysis approximate string matching comparisons of strings alignments... Theoretical design of algorithms analysis of algorithms complexity combinatorics on strings... Thierry Lecroq Text Searching 7/77

8 Sliding window strategy window text y shift scan scan scan pattern x shift Scan-and-shift mechanism put window at the beginning of text while window on text do scan: if window = pattern then report it shift: shift window to the right and memorize some information for use during next scans and shifts Thierry Lecroq Text Searching 8/77

9 Naive search Principles No memorization, shift by 1 position Complexity O(m n) time, O(1) extra space Number of symbol comparisons maximum m n expected 2 n on a two-letter alphabet, with equiprobablity and independence conditions Thierry Lecroq Text Searching 9/77

10 Naive string-searching algorithm text y 0 pos u b j n 1 Algorithm pattern x u a 0 i m 1 NaiveSearch(string x, y; integer m, n) pos 0 while pos n m do i 0 while i < m and x[i] = y[pos + i] do i i + 1 if i = m then output( x occurs in y at position, pos) pos pos + 1 Thierry Lecroq Text Searching 10/77

11 Complexity of the problem pattern x of length m text y of length n (n > m) Theorem The search can be done optimally in time O(n) and space O(1). [Galil and Seiferas, 1983] Theorem The search can be done in optimal expected time O( log m m n). [Yao, 1979] Theorem The maximal number of comparisons done during the search is n m (n m), and can be made n + 3(m+1) (n m). [Cole et alii, 1995] Thierry Lecroq Text Searching 11/77

12 Known bounds on symbol comparisons Lower bounds Upper bounds access to the whole text n [Folklore] 2n 1 [Morris and Pratt, 1970] n [Zwick and Paterson, 1991] 4 3 search with a sliding window of size m n n 1 3 n + 9 4m n [Apost. and Croch., 1991] n [Galil and Giancarlo, 1991] m [Galil and Giancarlo, 1992] 4 log m+2 n + (n m) m [Breslauer and Galil, 1993] (n m) 8 n + (n m) 3(m+1) [Cole et alii, 1995] Thierry Lecroq Text Searching 12/77

13 Known bounds on symbol comparisons (followed) Lower bounds Upper bounds search with a sliding window of size 1 2n 1 [Morris and Pratt, 1970] (2 1 )n (2 1 )n [Hancart, 1993] m m [Breslauer et alii, 1993] delay = maximum number of comp s on each text symbol m [Morris and Pratt, 1970] log Φ (m + 1) [Knuth, MP, 1977] min(log Φ (m + 1), c) [Simon, 1989] min(1 + log 2 m, c) min(1 + log 2 m, c) [Hancart, 1993] log min(1 + log 2 m, c) [Hancart, 1996] Thierry Lecroq Text Searching 13/77

14 Methods prefix of text in Σ pattern pattern sequential searches (window size = one symbol) adapted to telecommunications based on efficient implementations of automata [Knuth, Morris, Pratt, 1976], [Simon, 1989], [Hancart, 1993], [Breslauer, Colussi, Toniolo, 1993] Thierry Lecroq Text Searching 14/77

15 Methods (followed) time-space optimal searches mainly of theoretical interest based on combinatorial properties of strings [Galil, Seiferas, 1983], [Crochemore, Perrin, 1991], [Crochemore, 1992], [G asieniec, Plandowski, Rytter, 1995], [Crochemore, G asieniec, Rytter, 1999] practically-fast searches used in text editors, data retrieval software based on combinatorics + automata (+ heuristics) [Boyer, Moore, 1977], [Galil, 1979], [Apostolico, Giancarlo, 1986], [Crochemore et alii, 1994] Thierry Lecroq Text Searching 15/77

16 Plan 1 Introduction 2 Left-to-right scan shift 3 Right-to-left scan shift Thierry Lecroq Text Searching 16/77

17 Left-to-right scan shift text y u b pattern x u a period(u) border(u) Mismatch situation: a b period(u) = u border(u) Optimal shift length = period(ub) Valid if u = x Thierry Lecroq Text Searching 17/77

18 Periods and borders Non-empty string u, integer p, 0 < p u p is a period of u if any of these equivalent conditions is satisfied: u[i] = u[i + p], for 1 i u p u is a prefix of some y k, k > 0, y = p u = yw = wz, for some strings y, z, w with y = z = p String w is called a border of u The period of u, period(u), is its smallest period (can be u ) The border of u, border(u), is its longest border (can be empty) Thierry Lecroq Text Searching 18/77

19 Periods and borders (followed) Example Periods and borders of abacabacaba 4 abacaba 8 aba 10 a 11 empty string Thierry Lecroq Text Searching 19/77

20 Sequential search Simple online search Length of shift = period Memorization of borders Algorithm while window on text do u longest common prefix of window and pattern if u = pattern then report a match shift window period(u) places to right memorize border(u) Thierry Lecroq Text Searching 20/77

21 MP algorithm text y 0 pos u b j n 1 pattern x u a 0 i m 1 0 MPnext[i] m 1 Thierry Lecroq Text Searching 21/77

22 MP algorithm (followed) Algorithm MP(string x, y; integer m, n) i 0; j 0 while j < n do while (i = m) or (i 0 and x[i] y[j]) do i MPnext[i] i i + 1; j j + 1 if i = m then output( x occurs in y at position, j i) Thierry Lecroq Text Searching 22/77

23 Computing borders of prefixes A border of a border of u is a border of u A border of u is either border(u) or a border of it Algorithm Border[i] = border(x[0.. i 1]) j runs through decreasing lengths of borders ComputeBorders(string x; integer m) Border[0] 1 for i 0 to m 1 do j Border[i] while j 0 and x[i] x[j] do j Border[j] Border[i + 1] j + 1 return Border MPnext[i] = Border[i] for i = 0,..., m Thierry Lecroq Text Searching 23/77

24 Improvement x u σ x w τ p w Interrupted periods strict borders Changes only the preprocessing of MP algorithm Thierry Lecroq Text Searching 24/77

25 Improvement (followed) Algorithm while window on text do u longest common prefix of window and pattern if u = pattern then report a match shift window interruptpperiod(u) places to the right memorize strictborder(u) Thierry Lecroq Text Searching 25/77

26 Interrupted periods and strict borders Fixed string x, non-empty prefix u of x w is a strict border of u if both: w is a border of u wτ is a prefix of x, but uτ is not p is an interrupted period of u if p = u w for some strict border w of u Example Prefix abacabacaba of abacabacabacc Interrupted periods and strict borders of abacabacaba 10 a 11 empty string Thierry Lecroq Text Searching 26/77

27 KMP preprocessing k = MPnext[i] { k, if x[i] x[k] or if i = m, KMPnext[i] = KMPnext[k], if x[i] = x[k]. Algorithm ComputeKMPnext(string x; integer m); KMPnext[0] 1; j 0 for i 1 to m 1 do if x[i] = x[j] then KMPnext[i] KMPnext[j] else KMPnext[i] j do j KMPnext[j] while j 0 and x[i] x[j] j j + 1 KMPnext[m] j return KMPnext Thierry Lecroq Text Searching 27/77

28 Running times of MP and KMP Theorem On a text of length n, MP and KMP string-searching algorithm run in time O(n). They make less than 2n symbol comparisons. Proof. Positive comparisons increase the value of j Negative comparisons increase the value of j i (shift) Thierry Lecroq Text Searching 28/77

29 Running times of MP and KMP (followed) Delay = maximum number of comparisons on a text symbol Theorem Pattern of length m. The delay for MP algorithm is no more than m. The delay for KMP algorithm is no more than log Φ (m + 1), where Φ is the golden ratio, (1 + 5)/2. Proof. For KMP, use the periodicity theorem A worst-case pattern of length 19: abaababaabaababaaba Thierry Lecroq Text Searching 29/77

30 Periodicities Theorem (Fine & Wilf 1965) If p and q are periods of a word x and satisfy p + q GCD(p, q) x then GCD(p, q) is a period of x. Used in the analysis of KMP algorithm and in the analysis of many other pattern matching algorithms. Theorem (Weak version) If p and q are periods of a word x and satisfy p + q x then GCD(p, q) is a period of x. Proof. If p and q are periods of x, p > q, then p q is also a period of x. Thierry Lecroq Text Searching 30/77

31 Proof of the weak statement p and q periods of x with p + q x and p > q p q period of x because: p a b c p q q p c a b q p q rest of the proof like Euclid s induction Thierry Lecroq Text Searching 31/77

32 Searching with an automaton Uses the string-matching automaton SMA(x): smallest deterministic automaton accepting Σ x Example x = abaa Search for abaa in: b a b b a a b a a b a a b b a state Thierry Lecroq Text Searching 32/77

33 Searching algorithm Simple parsing of the text with the string-matching automaton SMA(x) Algorithm SEARCH(string x, y; integer m, n) (Q, Σ, initial, {terminal }, δ) is the automaton SMA(x) q initial state if q = terminal then report an occurrence of x in y while not end of y do σ next symbol of y q δ(q, σ) if q = terminal then report an occurrence of x in y Thierry Lecroq Text Searching 33/77

34 Construction of SMA(x) Unwinding arcs From SMA(abaa) to SMA(abaab) Thierry Lecroq Text Searching 34/77

35 Construction algorithm Algorithm automaton SMA(string x) let initial be a new state Q {initial } terminal initial for all σ in Σ do δ(initial, σ) initial while not end of x do τ next symbol of x r δ(terminal, τ) add new state s to Q δ(terminal, τ) s for all σ in Σ do δ(s, σ) δ(r, σ) terminal s return (Q, Σ, initial, {terminal }, δ) Thierry Lecroq Text Searching 35/77

36 Significant arcs Example Complete SMA(ananas) n, s a a a a 0 a 1 n 2 a 3 n 4 a 5 s 6 s n n, s s n, s n, s Thierry Lecroq Text Searching 36/77

37 Significant arcs (followed) Lemma Forward arcs: spell the pattern Backward arcs: arcs going backwards without reaching the initial state a a a a 0 a 1 n 2 a 3 n 4 a 5 s 6 n SMA(x) contains at most x backward arcs. Consequence The implementation of SMA(x) can be done in O( x ) time and space, independently of the alphabet size Thierry Lecroq Text Searching 37/77

38 Backward arcs in SMA States of SMA(x) identified with prefixes of x A backward arc is of the form (u, τ, vτ) (u, v prefixes of x, τ symbol) with vτ is the longest suffix of uτ that is a prefix of x, and u v Note: uτ is not a prefix of x Let p(u, τ) = u v (a period of u) Backward arcs to periods Two backward arcs (u, τ, vτ), (u, τ, v τ ) If p(u, τ) = p(u, τ ) = p, then vτ = v τ. Otherwise, if for instance vτ is a proper prefix of v τ, vτ occurs at position p like v does, uτ is a prefix of x, a contradiction Thus v = v, τ = τ, and then u = u Thierry Lecroq Text Searching 38/77

39 Backward arcs in SMA (followed) Each period p, 1 p x, corresponds to at most one backward arc, thus there are at most x such arcs A worst case SMA(ab m 1 ) has m backward arcs (a b) Thierry Lecroq Text Searching 39/77

40 Complexity of searching with SMA Pattern x of length m, text y of length n With complete SMA implemented by transition matrix Preprocessing on pattern x time O(m card Σ) space O(m card Σ) Search on text y time O(n) space O(m card Σ) Delay time constant With SMA implemented by lists of backward arcs Preprocessing on pattern x time O(m) space O(m) Search on text y time O(n) space O(m) Delay comparisons min(card Σ, log 2 m) Improves on KMP algorithm Thierry Lecroq Text Searching 40/77

41 Plan 1 Introduction 2 Left-to-right scan shift 3 Right-to-left scan shift Thierry Lecroq Text Searching 41/77

42 Right-to-left scan shift Example text c g c t c g c g c t a t c g pattern c g c t a g c c g c t a g c c g c t a g c Match shift: good-suffix rule (function d) Occurrence heuristics: bad-character rule Extra rules if memorization Thierry Lecroq Text Searching 42/77

43 Right-to-left scan shift (followed) Algorithm while window on text do u longest common suffix of window and pattern if u = pattern then report a match shift window d(u) places to the right Thierry Lecroq Text Searching 43/77

44 Match shift text y pos τ u pattern x text y pos σ i? τ u u u shift pattern x σ i u shift Precomputation of rightmost occurrences of u s: O(m) Second shift length = a period of x Table D implements the good-suffix rule: shift = d(u) = D[i] Thierry Lecroq Text Searching 44/77

45 BM algorithm text y τ u 0 pos j n 1 pattern x σ u 0 i m 1 No memorization of previous matches Table D implements the good-suffix rule Thierry Lecroq Text Searching 45/77

46 BM algorithm (followed) Algorithm BM(string x, y; integer m, n) pos 0 while pos n m do i m 1 while i 0 and x[i] = y[pos + i] do i i 1 if i = 1 then output( x occurs in y at position, pos) pos pos + D[i + 1] Thierry Lecroq Text Searching 46/77

47 Suffix displacement Displacement function d: d(u) = min{ z > 0 (x suffix of uz) or (τuz suffix of x and τu not suffix of x, for τ Σ)} Displacement table D: D[i] = d(x[i.. m 1]), for i = 0,.., m 1 D[m] = 1 τ u z x σ u u z Thierry Lecroq Text Searching 47/77

48 Suffix displacement (followed) Note 1: u is a (strict) border of uz Note 2: z is a period of uz (thus, z is a period of x) Lemma Table D can be computed in linear time. Thierry Lecroq Text Searching 48/77

49 Suffixes Elements for computing the table Suf i given: Suf [j] known for i < j m g = min{f Suf [f] i < f. m} pattern x τ u τ u σ u τ u σ u 0 g i f i + m f 1 m 1 If i Suf [i + m f 1] k: Suf [i] = min{suf [i + m f 1], i g} Thierry Lecroq Text Searching 49/77

50 Suffixes pattern x τ u τ u σ u τ u σ u 0 g i f i + m f 1 m 1 scan scan If i Suf [i + m f 1] = g: scan to find Suf [i] Yields new g and new f Thierry Lecroq Text Searching 50/77

51 Computing Suf Algorithm ComputeSUF(string x; integer m) Suf [m 1] m; g m 1 for i m 2 downto 0 do if i > g and Suf [i + m 1 f] i g then Suf [i] min{suf [i + m f 1], i g} else f i; g min(g, i) while g 0 and x[g] = x[g + m 1 f] do g g 1 Suf [i] f g return Suf Thierry Lecroq Text Searching 51/77

52 Computing Suf i x a z m j 1 x c z j Suf[j] D[i] = D[m 1 Suf [j]] = m j 1 Thierry Lecroq Text Searching 52/77

53 Computing Suf i x z m j 1 x j D[i] = m j 1 Thierry Lecroq Text Searching 53/77

54 BM preprocessing Algorithm ComputeD(string x; integer m; table Suf ) i 0 for j m 2 downto 1 do if j = 1 or Suf [j] = j + 1 then while i m 1 j do D[i] m 1 j i i + 1 for j 0 to m 2 do D[m 1 Suf [j]] m 1 j return D Thierry Lecroq Text Searching 54/77

55 Occurrence shift text y pos τ u pattern x σ τ u DA[τ] shift Table DA implements the bad-character rule: DA[σ] = min({ z > 0 σz suffix of x} { x }) shift = DA[τ] u = DA[τ] m + i Thierry Lecroq Text Searching 55/77

56 Occurrence shift (followed) Algorithm ComputeDA(string x; integer m) for all σ in Σ do DA[σ] = m for i 0 to m 2 do DA[x[i]] = m i 1 return DA Thierry Lecroq Text Searching 56/77

57 BM with occurrence shift text y τ u 0 pos j n 1 pattern x σ u 0 i m 1 Use of DA in addition to D Thierry Lecroq Text Searching 57/77

58 BM with occurrence shift (followed) Algorithm BM(string x, y; integer m, n); pos 0 while pos n m do i m 1 while i 0 and x[i] = y[pos + i] do i i 1 if i = 1 then output( x occurs in y at position, pos) pos pos + max(d[i + 1], DA[y[pos + i]] m + i + 1) Thierry Lecroq Text Searching 58/77

59 Complexity of BM Preprocessing phase match shift O(m) occurrence shift O(m + card Σ) Search phase (finding all occurrences) running time O(n m) minimum number of comparisons n/m maximum number of comparisons n m Extra space for shift functions O(m + card Σ) can be reduced to O(m) Thierry Lecroq Text Searching 59/77

60 Symbol comparisons of BM For finding the first occurrence [Knuth, Morris, Pratt, 1977] [Guibas, Odlysko, 1980] [Cole, 1994] 7 n 4 n 3 n Theorem If period(x) > m/2, BM searching algorithm performs at most 3n n/m symbol comparisons. The bound is tight. Proof. difficult Thierry Lecroq Text Searching 60/77

61 Symbol comparisons in variants of BM For finding all occurrences [Galil, 1979] prefix memorization [Apostolico, Giancarlo, 1986] all-suffix memorization [Crochemore et alii, 1994] last-suffix memorization Turbo-BM O(n) O(1) extra space 1.5 n O(m) extra space 2 n O(1) extra space Thierry Lecroq Text Searching 61/77

62 Turbo-BM method text y pattern x pos memory u u u match-shift z Features Stores the last match in case of match shift (memory) Jumps on memory Uses turbo-shifts Thierry Lecroq Text Searching 62/77

63 Turbo-BM method (followed) Preprocessing same as BM algorithm Search O(1) extra space to store memory: (length, right position) Note match-shift is a period of uz Thierry Lecroq Text Searching 63/77

64 Turbo-shift text y pos memory σ match τ pattern x σ turbo-shift σ turbo-shift = memory match Use in Turbo-BM: shift max(match-shift, occ-shift, turbo-shift); if (shift = match-shift) then set memory; else { shift max(shift, match+1); no memory; } Thierry Lecroq Text Searching 64/77

65 Tight number of comparisons for Turbo-BM Theorem The Turbo-BM searching algorithm runs in time O(n). It makes no more than 2n symbol comparisons. Proof. difficult Thierry Lecroq Text Searching 65/77

66 AG method text y previous matches pattern x current match Thierry Lecroq Text Searching 66/77

67 AG method Features Stores all previous matches (suffixes of pattern) Uses the table Suf Suf [i] = longest suffix of x ending at i in x pattern x Suf [i] i Search O(m) extra space to store the matches Thierry Lecroq Text Searching 67/77

68 Rules for shifts in AG match text y pattern x Suf [i] i current match mismatch text y pattern x Suf [i] i current match Thierry Lecroq Text Searching 68/77

69 Rules for shifts in AG mismatch text y pattern x Suf [i] i current match jump text y pattern x? Suf [i] i current match Thierry Lecroq Text Searching 69/77

70 Tight number of comparisons for AG Lemma The AG algorithm makes at most n 2 characters previously compared. comparisons on text Theorem The AG searching algorithm runs in time O(n). It makes no more than 1.5n symbol comparisons. Proof. rather difficult Thierry Lecroq Text Searching 70/77

71 Worst-case example Example a a b a a a b a a b a a a b a a b a a a b... a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b a a b a a a b pattern = a m 1 ba m b, text = (a m 1 ba m b) e number of comparisons = 2m (3m + 1)e = 3m+1 2m+1 n m 1.5n Thierry Lecroq Text Searching 71/77

72 References A. Apostolico and M. Crochemore. Optimal Canonization of All Substrings of a String. Inf. Comput., 95(1):76-95, A. Apostolico et Z. Galil, editors. Pattern matching algorithms. Oxford University Press, A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited. SIAM J. Comput., 15(1):98 105, R. S. Boyer and J. S. Moore. A fast string searching algorithm. Commun. ACM, 20(10): , D. Breslauer, L. Colussi, and L. Toniolo. Tight comparison bounds for the string prefix-matching problem. Inf. Process. Lett., 47(1):51 57, Thierry Lecroq Text Searching 72/77

73 References D. Breslauer and Z. Galil. Efficient Comparison Based String Matching. J. Complexity, 9(3): , M. Crochemore, A. Czumaj, L. G asieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5): , M. Crochemore, L. G asieniec, and W. Rytter. Constant-space string-matching in sublinear average time. Theor. Comput. Sci., 218(1): , M. Crochemore, C. Hancart and T. Lecroq. Algorithms on Strings. Cambridge University Press, 20017, to appear. M. Crochemore and T. Lecroq. Tight bounds on the complexity of the Apostolico-Giancarlo algorithm. Inf. Process. Lett., 63(4): , Thierry Lecroq Text Searching 73/77

74 References M. Crochemore and T. Lecroq. Text Searching and Indexing. Recent Advances in Formal Languages and Applications, Series: Studies in Computational Intelligence, Volume 25, (2006), Z. Esik, C. Martin-Vide and V. Mitrana editors, pages 43-80, Springer Verlag. M. Crochemore and D. Perrin. Two-way string-matching. J. Assoc. Comput. Mach., 38(3): , M. Crochemore et W. Rytter. Jewels of Stringology. World Scientific, R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algorithm. SIAM J. Comput., 23(5): , R. Cole, R. Hariharan, M. Paterson, and U. Zwick. Tighter lower bounds on the exact complexity of string matching. SIAM J. Comput., 24(1):30 45, Thierry Lecroq Text Searching 74/77

75 References Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm. Commun. ACM, 22(9): , Z. Galil and R. Giancarlo. On the exact complexity of string matching: lower bounds. SIAM J. Comput., 20(6): , Z. Galil and R. Giancarlo. On the Exact Complexity of String Matching: Upper Bounds. SIAM J. Comput., 21(3): , Z. Galil and J. Seiferas. Time-space optimal string matching. J. Comput. Syst. Sci., 26(3): , L. G asieniec, W. Plandowski, and W. Rytter. Constant-space string matching with smaller number of comparisons: sequential sampling. In Z. Galil and E. Ukkonen, editors, Proceedings of the 6th Annual Thierry Lecroq Text Searching 75/77

76 References L. J. Guibas and A. M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm. SIAM J. Comput., 9(4): , D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, C. Hancart. On Simon s string searching algorithm. Inf. Process. Lett., 47(2):95 99, C. Hancart, Personal communication. D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(1): , Thierry Lecroq Text Searching 76/77

77 References J. H. Morris, Jr and V. R. Pratt. A linear pattern-matching algorithm. Report 40, University of California, Berkeley, I. Simon. Sequence comparison: some theory and some practice. In M. Gross and D. Perrin, editors, Electronic, Dictionaries and Automata in Computational Linguistics, number 377 in Lecture Notes in Computer Science, pages Springer-Verlag, Berlin, G. A. Stephen. String searching algorithms. World Scientific Press, A. C. Yao. The complexity of pattern matching for a random string. SIAM J. Comput., 8(3): , U. Zwick and M. S. Paterson. Lower bounds for string matching in the sequential comparison model. Manuscript, Thierry Lecroq Text Searching 77/77

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

On Boyer-Moore Preprocessing

On Boyer-Moore Preprocessing On Boyer-Moore reprocessing Heikki Hyyrö Department of Computer Sciences University of Tampere, Finland Heikki.Hyyro@cs.uta.fi Abstract robably the two best-known exact string matching algorithms are the

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

Three new strategies for exact string matching

Three new strategies for exact string matching Three new strategies for exact string matching Simone Faro 1 Thierry Lecroq 2 1 University of Catania, Italy 2 University of Rouen, LITIS EA 4108, France SeqBio 2012 November 26th-27th 2012 Marne-la-Vallée,

More information

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm Knuth-Morris-Pratt & s by Robert C. St.Pierre Overview Notation review Knuth-Morris-Pratt algorithm Discussion of the Algorithm Example Boyer-Moore algorithm Discussion of the Algorithm Example Applications

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

SUFFIX TREE. SYNONYMS Compact suffix trie

SUFFIX TREE. SYNONYMS Compact suffix trie SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Simple Real-Time Constant-Space String Matching

Simple Real-Time Constant-Space String Matching Simple Real-Time Constant-Space String Matching Dany Breslauer 1, Roberto Grossi 2, and Filippo Mignosi 3 1 Caesarea Rothchild Institute, University of Haifa, Haifa, Israel 2 Dipartimento di Informatica,

More information

Linear-Time Computation of Local Periods

Linear-Time Computation of Local Periods Linear-Time Computation of Local Periods Jean-Pierre Duval 1, Roman Kolpakov 2,, Gregory Kucherov 3, Thierry Lecroq 4, and Arnaud Lefebvre 4 1 LIFAR, Université de Rouen, France Jean-Pierre.Duval@univ-rouen.fr

More information

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro Thierry Lecroq University of Catania, Italy University of Rouen, LITIS EA 4108, France Symposium on Eperimental Algorithms

More information

Number of occurrences of powers in strings

Number of occurrences of powers in strings Author manuscript, published in "International Journal of Foundations of Computer Science 21, 4 (2010) 535--547" DOI : 10.1142/S0129054110007416 Number of occurrences of powers in strings Maxime Crochemore

More information

Counting and Verifying Maximal Palindromes

Counting and Verifying Maximal Palindromes Counting and Verifying Maximal Palindromes Tomohiro I 1, Shunsuke Inenaga 2, Hideo Bannai 1, and Masayuki Takeda 1 1 Department of Informatics, Kyushu University 2 Graduate School of Information Science

More information

Computing repetitive structures in indeterminate strings

Computing repetitive structures in indeterminate strings Computing repetitive structures in indeterminate strings Pavlos Antoniou 1, Costas S. Iliopoulos 1, Inuka Jayasekera 1, Wojciech Rytter 2,3 1 Dept. of Computer Science, King s College London, London WC2R

More information

(Preliminary Version)

(Preliminary Version) Relations Between δ-matching and Matching with Don t Care Symbols: δ-distinguishing Morphisms (Preliminary Version) Richard Cole, 1 Costas S. Iliopoulos, 2 Thierry Lecroq, 3 Wojciech Plandowski, 4 and

More information

Reverse Engineering Prefix Tables

Reverse Engineering Prefix Tables Reverse Engineering Prefix Tables Julien Clément, Maxime Crochemore, Giuseppina Rindone To cite this version: Julien Clément, Maxime Crochemore, Giuseppina Rindone. Reverse Engineering Prefix Tables. Susanne

More information

Analysis of Algorithms Prof. Karen Daniels

Analysis of Algorithms Prof. Karen Daniels UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1 Chapter

More information

Partial words and the critical factorization theorem revisited. By: Francine Blanchet-Sadri, Nathan D. Wetzler

Partial words and the critical factorization theorem revisited. By: Francine Blanchet-Sadri, Nathan D. Wetzler Partial words and the critical factorization theorem revisited By: Francine Blanchet-Sadri, Nathan D. Wetzler F. Blanchet-Sadri and N.D. Wetzler, "Partial Words and the Critical Factorization Theorem Revisited."

More information

Shift-And Approach to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matching in LZW Compressed Text Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {kida,

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Improving the KMP Algorithm by Using Properties of Fibonacci String

Improving the KMP Algorithm by Using Properties of Fibonacci String Improving the KMP Algorithm by Using Properties of Fibonacci String Yi-Kung Shieh and R. C. T. Lee Department of Computer Science National Tsing Hua University d9762814@oz.nthu.edu.tw and rctlee@ncnu.edu.tw

More information

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

State Complexity of Neighbourhoods and Approximate Pattern Matching

State Complexity of Neighbourhoods and Approximate Pattern Matching State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca

More information

Partial Words and the Critical Factorization Theorem Revisited

Partial Words and the Critical Factorization Theorem Revisited Partial Words and the Critical Factorization Theorem Revisited F. Blanchet-Sadri and Nathan D. Wetzler Department of Mathematical Sciences University of North Carolina P.O. Box 26170 Greensboro, NC 27402

More information

Average Case Analysis of the Boyer-Moore Algorithm

Average Case Analysis of the Boyer-Moore Algorithm Average Case Analysis of the Boyer-Moore Algorithm TSUNG-HSI TSAI Institute of Statistical Science Academia Sinica Taipei 115 Taiwan e-mail: chonghi@stat.sinica.edu.tw URL: http://www.stat.sinica.edu.tw/chonghi/stat.htm

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Online Computation of Abelian Runs

Online Computation of Abelian Runs Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it

More information

Pattern-Matching for Strings with Short Descriptions

Pattern-Matching for Strings with Short Descriptions Pattern-Matching for Strings with Short Descriptions Marek Karpinski marek@cs.uni-bonn.de Department of Computer Science, University of Bonn, 164 Römerstraße, 53117 Bonn, Germany Wojciech Rytter rytter@mimuw.edu.pl

More information

A Note on the Number of Squares in a Partial Word with One Hole

A Note on the Number of Squares in a Partial Word with One Hole A Note on the Number of Squares in a Partial Word with One Hole F. Blanchet-Sadri 1 Robert Mercaş 2 July 23, 2008 Abstract A well known result of Fraenkel and Simpson states that the number of distinct

More information

Algorithms: COMP3121/3821/9101/9801

Algorithms: COMP3121/3821/9101/9801 NEW SOUTH WALES Algorithms: COMP3121/3821/9101/9801 Aleks Ignjatović School of Computer Science and Engineering University of New South Wales LECTURE 8: STRING MATCHING ALGORITHMS COMP3121/3821/9101/9801

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Efficient Sequential Algorithms, Comp309

Efficient Sequential Algorithms, Comp309 Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to

More information

PERIODS OF FACTORS OF THE FIBONACCI WORD

PERIODS OF FACTORS OF THE FIBONACCI WORD PERIODS OF FACTORS OF THE FIBONACCI WORD KALLE SAARI Abstract. We show that if w is a factor of the infinite Fibonacci word, then the least period of w is a Fibonacci number. 1. Introduction The Fibonacci

More information

INF 4130 / /8-2014

INF 4130 / /8-2014 INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT

More information

Graduate Algorithms CS F-20 String Matching

Graduate Algorithms CS F-20 String Matching Graduate Algorithms CS673-2016F-20 String Matching David Galles Department of Computer Science University of San Francisco 20-0: String Matching Given a source text, and a string to match, where does the

More information

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017 Maximal Unbordered Factors of Random Strings arxiv:1704.04472v1 [cs.ds] 14 Apr 2017 Patrick Hagge Cording 1 and Mathias Bæk Tejs Knudsen 2 1 DTU Compute, Technical University of Denmark, phaco@dtu.dk 2

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Lecture 3: String Matching

Lecture 3: String Matching COMP36111: Advanced Algorithms I Lecture 3: String Matching Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18 Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt

More information

On the number of prefix and border tables

On the number of prefix and border tables On the number of prefix and border tables Julien Clément 1 and Laura Giambruno 1 GREYC, CNRS-UMR 607, Université de Caen, 1403 Caen, France julien.clement@unicaen.fr, laura.giambruno@unicaen.fr Abstract.

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003 String Matching Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 Rabin-Karp Algorithm 3 3 Knuth-Morris-Pratt Algorithm 5 3.1 Informal Description.........................

More information

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University

More information

Data Structure for Dynamic Patterns

Data Structure for Dynamic Patterns Data Structure for Dynamic Patterns Chouvalit Khancome and Veera Booning Member IAENG Abstract String matching and dynamic dictionary matching are significant principles in computer science. These principles

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M.Sc. Engg. Thesis String Regularities and Degenerate Strings by Md. Faizul Bari Submitted to Department of Computer Science and Engineering in partial fulfilment of the requirments for the degree of Master

More information

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting

More information

Text matching of strings in terms of straight line program by compressed aleshin type automata

Text matching of strings in terms of straight line program by compressed aleshin type automata Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical

More information

On the Number of Distinct Squares

On the Number of Distinct Squares Frantisek (Franya) Franek Advanced Optimization Laboratory Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Invited talk - Prague Stringology Conference 2014 Outline

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076

More information

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.

More information

String Matching II. Algorithm : Design & Analysis [19]

String Matching II. Algorithm : Design & Analysis [19] String Matching II Algorithm : Design & Analysis [19] In the last class Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan String Matching II Boyer-Moore s heuristics Skipping unnecessary

More information

String Suffix Automata and Subtree Pushdown Automata

String Suffix Automata and Subtree Pushdown Automata String Suffix Automata and Subtree Pushdown Automata Jan Janoušek Department of Computer Science Faculty of Information Technologies Czech Technical University in Prague Zikova 1905/4, 166 36 Prague 6,

More information

On-Line Construction of Suffix Trees I. E. Ukkonen z

On-Line Construction of Suffix Trees I. E. Ukkonen z .. t Algorithmica (1995) 14:249-260 Algorithmica 9 1995 Springer-Verlag New York Inc. On-Line Construction of Suffix Trees I E. Ukkonen z Abstract. An on-line algorithm is presented for constructing the

More information

Theoretical Computer Science. Efficient algorithms to compute compressed longest common substrings and compressed palindromes

Theoretical Computer Science. Efficient algorithms to compute compressed longest common substrings and compressed palindromes Theoretical Computer Science 410 (2009) 900 913 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Efficient algorithms to compute compressed

More information

Exact Circular Pattern Matching Using the BNDM Algorithm

Exact Circular Pattern Matching Using the BNDM Algorithm Exact Circular Pattern Matching Using the BNDM Algorithm K. H. Chen, G. S. Huang and R. C. T. Lee Department of Computer Science and Information Engineering, National Chi Nan University, Puli, Nantou,

More information

The Number of Runs in a String: Improved Analysis of the Linear Upper Bound

The Number of Runs in a String: Improved Analysis of the Linear Upper Bound The Number of Runs in a String: Improved Analysis of the Linear Upper Bound Wojciech Rytter Instytut Informatyki, Uniwersytet Warszawski, Banacha 2, 02 097, Warszawa, Poland Department of Computer Science,

More information

Where to Use and How not to Use Polynomial String Hashing

Where to Use and How not to Use Polynomial String Hashing Olympiads in Informatics, 2013, Vol. 7, 90 100 90 2013 Vilnius University Where to Use and How not to Use Polynomial String Hashing Jakub PACHOCKI, Jakub RADOSZEWSKI Faculty of Mathematics, Informatics

More information

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)

More information

How many double squares can a string contain?

How many double squares can a string contain? How many double squares can a string contain? F. Franek, joint work with A. Deza and A. Thierry Algorithms Research Group Department of Computing and Software McMaster University, Hamilton, Ontario, Canada

More information

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

Overlapping factors in words

Overlapping factors in words AUSTRALASIAN JOURNAL OF COMBINATORICS Volume 57 (2013), Pages 49 64 Overlapping factors in words Manolis Christodoulakis University of Cyprus P.O. Box 20537, 1678 Nicosia Cyprus christodoulakis.manolis@ucy.ac.cy

More information

Transducers for bidirectional decoding of prefix codes

Transducers for bidirectional decoding of prefix codes Transducers for bidirectional decoding of prefix codes Laura Giambruno a,1, Sabrina Mantaci a,1 a Dipartimento di Matematica ed Applicazioni - Università di Palermo - Italy Abstract We construct a transducer

More information

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS Parallel Processing Letters, c World Scientific Publishing Company OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS COSTAS S. ILIOPOULOS Department of Computer Science, King s College London,

More information

Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt Algorithm Knuth-Morris-Pratt Algorithm Jayadev Misra June 5, 2017 The Knuth-Morris-Pratt string matching algorithm (KMP) locates all occurrences of a pattern string in a text string in linear time (in the combined

More information

A Lower-Variance Randomized Algorithm for Approximate String Matching

A Lower-Variance Randomized Algorithm for Approximate String Matching A Lower-Variance Randomized Algorithm for Approximate String Matching Mikhail J. Atallah Elena Grigorescu Yi Wu Department of Computer Science Purdue University West Lafayette, IN 47907 U.S.A. {mja,egrigore,wu510}@cs.purdue.edu

More information

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS Manolis Christodoulakis 1, Costas S. Iliopoulos 1, Yoan José Pinzón Ardila 2 1 King s College London, Department of Computer Science London WC2R

More information

Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays

Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays Omri Ben-Eliezer Simon Korman Daniel Reichman March 27, 2017 Abstract Understanding the local behaviour of structured multi-dimensional

More information

An on{line algorithm is presented for constructing the sux tree for a. the desirable property of processing the string symbol by symbol from left to

An on{line algorithm is presented for constructing the sux tree for a. the desirable property of processing the string symbol by symbol from left to (To appear in ALGORITHMICA) On{line construction of sux trees 1 Esko Ukkonen Department of Computer Science, University of Helsinki, P. O. Box 26 (Teollisuuskatu 23), FIN{00014 University of Helsinki,

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Kjell Lemström; Esko Ukkonen Department of Computer Science; P.O.Box 6, FIN-00014 University of Helsinki, Finland {klemstro,ukkonen}@cs.helsinki.fi

More information

Approximate periods of strings

Approximate periods of strings Theoretical Computer Science 262 (2001) 557 568 www.elsevier.com/locate/tcs Approximate periods of strings Jeong Seop Sim a;1, Costas S. Iliopoulos b; c;2, Kunsoo Park a;1; c; d;3, W.F. Smyth a School

More information

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction. Microsoft Research Asia September 5, 2005 1 2 3 4 Section I What is? Definition is a technique for efficiently recurrence computing by storing partial results. In this slides, I will NOT use too many formal

More information

Optimal Superprimitivity Testing for Strings

Optimal Superprimitivity Testing for Strings Optimal Superprimitivity Testing for Strings Alberto Apostolico Martin Farach Costas S. Iliopoulos Fibonacci Report 90.7 August 1990 - Revised: March 1991 Abstract A string w covers another string z if

More information

How Many Holes Can an Unbordered Partial Word Contain?

How Many Holes Can an Unbordered Partial Word Contain? How Many Holes Can an Unbordered Partial Word Contain? F. Blanchet-Sadri 1, Emily Allen, Cameron Byrum 3, and Robert Mercaş 4 1 University of North Carolina, Department of Computer Science, P.O. Box 6170,

More information

By: F. Blanchet-Sadri, C.D. Davis, Joel Dodge, Robert Mercaş, Robert, Margaret Moorefield

By: F. Blanchet-Sadri, C.D. Davis, Joel Dodge, Robert Mercaş, Robert, Margaret Moorefield Unbordered partial words By: F. Blanchet-Sadri, C.D. Davis, Joel Dodge, Robert Mercaş, Robert, Margaret Moorefield F. Blanchet-Sadri, C.D. Davis, J. Dodge, R. Mercas and M. Moorefield, "Unbordered Partial

More information

arxiv: v1 [cs.cc] 4 Feb 2008

arxiv: v1 [cs.cc] 4 Feb 2008 On the complexity of finding gapped motifs Morris Michael a,1, François Nicolas a, and Esko Ukkonen a arxiv:0802.0314v1 [cs.cc] 4 Feb 2008 Abstract a Department of Computer Science, P. O. Box 68 (Gustaf

More information

Varieties of Regularities in Weighted Sequences

Varieties of Regularities in Weighted Sequences Varieties of Regularities in Weighted Sequences Hui Zhang 1, Qing Guo 2 and Costas S. Iliopoulos 3 1 College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, Zhejiang 310023,

More information

Hierarchical Overlap Graph

Hierarchical Overlap Graph Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018 arxiv:1802.04632 2018 B. Cazaux & E. Rivals 1 / 29 Overlap Graph for a set of words Consider the set P := {abaa,

More information

Intrusion Detection and Malware Analysis

Intrusion Detection and Malware Analysis Intrusion Detection and Malware Analysis IDS feature extraction Pavel Laskov Wilhelm Schickard Institute for Computer Science Metric embedding of byte sequences Sequences 1. blabla blubla blablabu aa 2.

More information

Primitive partial words

Primitive partial words Discrete Applied Mathematics 148 (2005) 195 213 www.elsevier.com/locate/dam Primitive partial words F. Blanchet-Sadri 1 Department of Mathematical Sciences, University of North Carolina, P.O. Box 26170,

More information

Compror: On-line lossless data compression with a factor oracle

Compror: On-line lossless data compression with a factor oracle Information Processing Letters 83 (2002) 1 6 Compror: On-line lossless data compression with a factor oracle Arnaud Lefebvre a,, Thierry Lecroq b a UMR CNRS 6037 ABISS, Faculté des Sciences et Techniques,

More information

Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays

Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays Omri Ben-Eliezer 1, Simon Korman 2, and Daniel Reichman 3 1 Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel

More information

Remarks on Separating Words

Remarks on Separating Words Remarks on Separating Words Erik D. Demaine, Sarah Eisenstat, Jeffrey Shallit 2, and David A. Wilson MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 239, USA,

More information

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking Jung-Hua Hsu A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking Student:Jung-Hua

More information

arxiv: v1 [cs.dm] 18 May 2016

arxiv: v1 [cs.dm] 18 May 2016 A note on the shortest common superstring of NGS reads arxiv:1605.05542v1 [cs.dm] 18 May 2016 Tristan Braquelaire Marie Gasparoux Mathieu Raffinot Raluca Uricaru May 19, 2016 Abstract The Shortest Superstring

More information

On Stateless Multicounter Machines

On Stateless Multicounter Machines On Stateless Multicounter Machines Ömer Eğecioğlu and Oscar H. Ibarra Department of Computer Science University of California, Santa Barbara, CA 93106, USA Email: {omer, ibarra}@cs.ucsb.edu Abstract. We

More information