Three new strategies for exact string matching

Size: px
Start display at page:

Download "Three new strategies for exact string matching"

Transcription

1 Three new strategies for exact string matching Simone Faro 1 Thierry Lecroq 2 1 University of Catania, Italy 2 University of Rouen, LITIS EA 4108, France SeqBio 2012 November 26th-27th 2012 Marne-la-Vallée, France

2 Outline 1 Introduction to exact string matching 2 Multiple windows 3 Substring with no repetition 4 Multiple hashing SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

3 Outline 1 Introduction to exact string matching 2 Multiple windows 3 Substring with no repetition 4 Multiple hashing SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

4 Exact String Matching Definition Find all the occurrences of a pattern x of length m in a text y of length n. x, y Σ 2 instances x is given first y is given first SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

5 Exact String Matching Interests basic components of many softwares theoretical problems SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

6 Exact String Matching Theory linear time since Morris and Pratt 1970 linear time and constant space O((n log m)/m) in average [Yao 1979] SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

7 Exact String Matching Solutions Many!! see SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

8 Efficient solutions S. Faro and T. Lecroq The Exact Online String Matching Problem: a Review of the Most Recent Results ACM Computing Surveys 45(2) (2013) to appear. SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

9 Exact String Matching Classical solutions comparisons Knuth-Morris-Pratt (KMP) Boyer-Moore (BM) automata Backward DAWG Matching (with suffix automaton or oracle) (BDM) bit-parallelism Shift Or (SO) Backward Nondeterministic DAWG Matching (BNDM) filtering Karp-Rabin (KR) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

10 Sliding Window y window x beginning SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

11 Sliding Window y window x middle SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

12 Sliding Window y window x middle SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

13 Sliding Window y window x end SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

14 Boyer-Moore (1977) y x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

15 Boyer-Moore (1977) y x comparisons b v = a v SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

16 Boyer-Moore (1977) y x x comparisons b v = a v = c v good suffix shift (1) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

17 Boyer-Moore (1977) y x comparisons b v = a v = good suffix shift (2) v SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

18 Boyer-Moore (1977) y x comparisons b v = a v bad character shift b no b SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

19 Fast Search (Cantone-Faro, 2003) uses the bad character shift when a mismatch occurs with the pattern righmost character uses the good suffix shift otherwise SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

20 TVSBS (Thathoo-Virmani-Lakshmi-Balakrishnan-Sekar, 2006) compares first the rightmost character of the window then the leftmost, then all the others uses both the righmost character of the window and the following to perform the shift SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

21 TVSBS (Thathoo-Virmani-Lakshmi-Balakrishnan-Sekar, 2006) y x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

22 TVSBS (Thathoo-Virmani-Lakshmi-Balakrishnan-Sekar, 2006) y x a = a SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

23 TVSBS (Thathoo-Virmani-Lakshmi-Balakrishnan-Sekar, 2006) y x b = b a = a SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

24 TVSBS (Thathoo-Virmani-Lakshmi-Balakrishnan-Sekar, 2006) y x b = b comparisons u = u c d a = a SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

25 TVSBS (Thathoo-Virmani-Lakshmi-Balakrishnan-Sekar, 2006) forward character comparisons y x b = b u = u c d a = a e Berry-Ravindran shift a e no ae SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

26 BNDM (Navarro & Raffinot, 1998) x C A T A SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

27 BNDM (Navarro & Raffinot, 1998) x C A T A S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

28 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

29 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

30 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

31 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

32 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S T = SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

33 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S T = Shift SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

34 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

35 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S A SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

36 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S A = SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

37 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S A = Shift SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

38 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

39 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S C SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

40 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S C = SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

41 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S C = Shift SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

42 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

43 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S C SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

44 BNDM (Navarro & Raffinot, 1998) x C A T A y C C A T A C S A S C S G S T And with S C = SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

45 Outline 1 Introduction to exact string matching 2 Multiple windows 3 Substring with no repetition 4 Multiple hashing SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

46 Partition the text in k/2 substrings and use k windows m < n/k and k is even (A) s 0 }{{} w 0 (B) s 0 }{{} { w 1 }} { s 1 w 0 (C) s 0 }{{} w 0 s 1 w 2 {}}{ } {{ } w 1 s 2 { w 3 }} { s 3 A general scheme for the multiple sliding windows matcher with: (A) 1 window (B) 2 windows (C) 4 windows SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

47 A General Multiple Sliding Windows Approach Then process simultaneously the k different text windows where w 0, w 1,..., w k 1 w 2i = y[(2n/k)i.. (2n/k)i + m 1] and left windows, goes to right w 2i+1 = y[(2n/k)(i + 1) 1.. (2n/k)(i + 1) + m 2] right windows, goes to left for i = 0,..., (k 2)/2 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

48 Ending situation For each couple of windows (w 2i, w 2i+1 ) the sliding process ends when the window w 2i slides over the window w 2i+1 no occurrence can be missed due to the m 1 overlapping characters between adjacent substrings SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

49 SMART: String Matching Algorithm Research Tool SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

50 SMART: String Matching Algorithm Research Tool more than 80 string matching algorithms a corpus of 12 texts select/deselect string matching algorithms output experimental results in L A TEX, xml, html and txt formats easy to plug new algorithms SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

51 Experiments Random texts of length n = 4, 000, 000 on alphabets of size 16, 32 and 64 for m = 2, 4, 8, 16, 32, 64 with algorithms: Fast Search TVSBS SBNDM FSBNDM and k = 1, 2, 4, 6, 8 windows SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

52 Experimental results Fast Search TVSBS SBNDM FSBNDM SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

53 Experiments Random texts of length n = 4, 000, 000 on alphabets of size 16, 32 and 64 for m = 2, 4, 8, 16, 32, 64, 128, 256, 512 and with a protein and a natural language (English) files with algorithms: known EBOM HASH FSBNDM QF new FS-W FSBNDM-W SBNDM-W TVSBS-W SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

54 Experimental results Alphabet size 32, short patterns SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

55 Perspectives implement other known exact string matching algorithms in this multiple window framework truly parallelize apply it to other related problems SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

56 Reference S. Faro and T. Lecroq A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms In: (R. Klasing editor, Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2012), Bordeaux, France, 2012), Lecture Notes in Computer Science 7276, Springer-Verlag, Berlin, SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

57 Outline 1 Introduction to exact string matching 2 Multiple windows 3 Substring with no repetition 4 Multiple hashing SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

58 Problem with bit-parallel algorithms Solutions LBNDM (Peltola & Tarhio, 2003) BXS (Durian, Peltola, Salmela & Tarhio, 2010) Factorized BNDM (Cantone, Faro & Giaquinta, 2010) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

59 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) partition x = x 0 x 1 x m/k 1 where x j = k = (m 1)/ω + 1 construct a new pattern x where x [j] is the set of characters of x j search for x in y using the BNDM algorithm by considering every k-th character SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

60 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

61 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) x ω ω ω SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

62 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) x ω ω ω k = 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

63 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) x k ω ω ω k = 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

64 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) k x k = 4 x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

65 LBNDM (Long patterns BNDM, Peltola & Tarhio, 2003) k x k = 4 x y SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

66 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) partition x = x 0 x 1 x m/ω where x j = ω construct a new pattern x where x [j] is the set of characters of x[j + ω] search for x in y using the BNDM algorithm by using a circular automaton SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

67 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

68 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x ω ω ω SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

69 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

70 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

71 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

72 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

73 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

74 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

75 BXS (BNDM algorithm with Extended Shift, Durian, Peltola, Salmela & Tarhio, 2010) x x SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

76 Substring with no repetition Definition A substring s of x is a substring with no repetition (SNR) if any character c Σ appears at most once in s. SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

77 SNR SNR automaton Maximal SNR automaton SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

78 SNR SNR automaton Maximal SNR automaton SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

79 A New Fast Suffix Automaton Based Algorithm Properties s min{σ, m} an SNR admits a suffix automaton where there does not exist two states having incoming transitions labeled with the same symbol SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

80 SNR Suffix Automaton 5 u 4 t 3 o u 2 t m o 1 m a 0 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

81 The Preprocessing Phase Find a maximal SNR of the pattern, i.e. an SNR of maximal length. O(n) time Length of the maximal SNR In many practical cases the length of the maximal SNR is not large enough if compared with the pattern length. Average length of the maximal SNR m genome protein English SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

82 Longer SNRs In order to allow longer SNR it is convenient to use a condensed alphabet whose symbols are obtained by combining groups of q symbols. A hash function h : Σ q {0,..., MAX 1} can be used for combining the group of symbols, for a fixed constant value MAX. Let [u] q = h(u), for each u Σ q. Condensed alphabet x : agcctgatcg x 2 : [ag] 2 [gc] 2 [cc] 2 [ct] 2 [tg] 2 [ga] 2 [at] 2 [tc] 2 [cg] 2 x 3 : [agc] 3 [gcc] 3 [cct] 3 [ctg] 3 [tga] 3 [gat] 3 [atc] 3 [tcg] 3 x 4 : [agcc] 4 [gcct] 4 [cctg] 4 [ctga] 4 [tgat] 4 [gatc] 4 [atcg] 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

83 It turns out that the length of the maximal SNR, though quite less than m in most cases, is quite larger than the size of a computer word (which typically is 32 or 64). This leads to larger shift in a suffix automata based algorithm. Length of the maximal SNR using condensed characters The average length of the maximal SNR in patterns randomly extracted from a genome sequence. The SNR have been computed using condensed alphabets on q characters, where q ranges from 1 to 8. q/m SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

84 Experimental results on a genome sequence m BDM BNDM SBNDM BXS F-BNDM LBNDM BSDM BSDM BSDM BSDM BSDM BSDM BSDM BSDM SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

85 S. Faro and T. Lecroq A Fast Suffix Automata Based Algorithm for Exact Online String Matching In: (N. Moreira and R. Reis editors, Proceedings of the 17th International Conference on Implementation and Application of Automata (CIAA 2012), Porto, Portugal, 2012), Lecture Notes in Computer Science 7381, Springer-Verlag, Berlin, SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

86 Outline 1 Introduction to exact string matching 2 Multiple windows 3 Substring with no repetition 4 Multiple hashing SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

87 Multiple exact string matching Problem Find all the occurrences of the patterns in X = {x 0,..., x k 1 } in y. m = min{ x i } Classical solutions Aho & Corasick (1975) Commentz-Walter (1979) Wu & Manber (1994) Multi-Reverse-Factor (1999) MBNDM SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

88 Wu-Manber algorithm considers blocks of length l blocks are hashed using a function h into values less than maxvalue shift[h(x i [j l j])] = min{ x i, m l + 1} prefixes are also hashed implemented in agrep SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

89 Wu-Manber algorithm y x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

90 Wu-Manber algorithm y window x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

91 Wu-Manber algorithm window y B x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

92 Wu-Manber algorithm window y h(b) x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

93 Wu-Manber algorithm window y h(b) x 0 x 1 x 2 x 3 x 4 h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

94 Wu-Manber algorithm window y h(b) x 0 x 1 x 2 x 3 x 4 h(b) h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

95 Wu-Manber algorithm window y B x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

96 Wu-Manber algorithm window y h(b) x 0 x 1 x 2 x 3 x 4 h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

97 Wu-Manber algorithm window y h(b) x 0 x 1 x 2 x 3 x 4 h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

98 Wu-Manber algorithm window y B x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

99 Wu-Manber algorithm window y h(b) x 0 x 1 x 2 x 3 x 4 h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

100 Wu-Manber algorithm window y P h(b) x 0 x 1 x 2 x 3 x 4 h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

101 Wu-Manber algorithm window y h (P ) h(b) x 0 x 1 x 2 x 3 x 4 h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

102 Wu-Manber algorithm window y h (P ) h(b) x 0 x 1 x 2 x 3 x 4 h (P ) h (P ) h (P ) h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

103 Wu-Manber algorithm window y h (P ) h(b) x 0 x 1 x 2 x 3 x 4 h (P ) h (P ) h (P ) h(b) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

104 Our new method γ hashing functions SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

105 Our new method y x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

106 Our new method y window x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

107 Our new method window y B x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

108 Our new method window y h 1(B) x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

109 Our new method window y h 1(B) x 0 x 1 x 2 x 3 x 4 h 1(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

110 Our new method window y h 1(B) x 0 x 1 x 2 x 3 x 4 h 1(B) h 1(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

111 Our new method window y B x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

112 Our new method window y h 2(B) x 0 x 1 x 2 x 3 x 4 h 2(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

113 Our new method window y h 2(B) x 0 x 1 x 2 x 3 x 4 h 2(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

114 Our new method window y B x 0 x 1 x 2 x 3 x 4 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

115 Our new method window y h 1(B) x 0 x 1 x 2 x 3 x 4 h 1(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

116 Our new method window y h 2(B) x 0 x 1 x 2 x 3 x 4 h 1(B) h 2(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

117 Our new method window y h 2(B) x 0 x 1 x 2 x 3 x 4 F (B) F (B) h 1(B) h 2(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

118 Our new method window y h 2(B) x 0 x 1 x 2 x 3 x 4 F (B) F (B) h 1(B) h 2(B) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

119 Experimental results 10, 000 patterns on a genome sequence m MBNDM WM(2, 2) WM(2, 3) WM(3, 2) WM(3, 3) WM(4, 1) WM(4, 2) WM(4, 3) WM(6, 1) WM(6, 2) WM(6, 3) WM(8, 1) WM(8, 2) WM(8, 3) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

120 Experimental results Naive comparison per text symbol for 10, 000 patterns m MBNDM WM(2, 2) WM(2, 3) WM(3, 2) WM(3, 3) WM(4, 1) WM(4, 2) WM(4, 3) WM(6, 1) WM(6, 2) WM(6, 3) WM(8, 1) WM(8, 2) WM(8, 3) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

121 Experimental results Speed ups obtained via WM(q, γ) vs MBNDM k / m , SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

122 Reference S. Faro and T. Lecroq Fast Searching in Biological Sequences Using Multiple Hash Functions IEEE 12th International Conference on Bioinformatics and Bioengineering (BIBE 2012), Larnaca, Cyprus, 2012 SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

123 Thank you for your attention! SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

124 Multiple windows: experimental results Alphabet size 32, long patterns SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

125 Multiple windows: experimental results Protein alphabet (size 20), short patterns SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

126 Multiple windows: experimental results Protein alphabet (size 20), long patterns SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

127 Multiple windows: experimental results English text, short patterns SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

128 Multiple windows: experimental results English text, long patterns SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

129 BSDM Experimental results on a protein sequence m BDM BNDM SBNDM BXS F-BNDM LBNDM BSDM BSDM BSDM BSDM BSDM BSDM BSDM BSDM SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

130 BSDM Experimental results on a natural language text m BDM BNDM SBNDM BXS F-BNDM LBNDM BSDM BSDM BSDM BSDM BSDM BSDM BSDM BSDM SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

131 Multiple hashing: experimental results 100 patterns on a genome sequence m MBNDM WM(2, 2) WM(2, 3) WM(3, 2) WM(3, 3) WM(4, 1) WM(4, 2) WM(4, 3) WM(6, 1) WM(6, 2) WM(6, 3) WM(8, 1) WM(8, 2) WM(8, 3) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

132 Multiple hashing: experimental results 1000 patterns on a genome sequence m MBNDM WM(2, 2) WM(2, 3) WM(3, 2) WM(3, 3) WM(4, 1) WM(4, 2) WM(4, 3) WM(6, 1) WM(6, 2) WM(6, 3) WM(8, 1) WM(8, 2) WM(8, 3) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

133 Multiple hashing: experimental results Naive comparison per text symbol for 1000 patterns m MBNDM WM(2, 2) WM(2, 3) WM(3, 2) WM(3, 3) WM(4, 1) WM(4, 2) WM(4, 3) WM(6, 1) WM(6, 2) WM(6, 3) WM(8, 1) WM(8, 2) WM(8, 3) SF&TL (Catania & Rouen) Exact String Matching SeqBio / 63

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro Thierry Lecroq University of Catania, Italy University of Rouen, LITIS EA 4108, France Symposium on Eperimental Algorithms

More information

PATTERN MATCHING WITH SWAPS IN PRACTICE

PATTERN MATCHING WITH SWAPS IN PRACTICE International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

Lecture 5: The Shift-And Method

Lecture 5: The Shift-And Method Biosequence Algorithms, Spring 2005 Lecture 5: The Shift-And Method Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 5: Shift-And p.1/19 Seminumerical String Matching Most

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

Algorithms: COMP3121/3821/9101/9801

Algorithms: COMP3121/3821/9101/9801 NEW SOUTH WALES Algorithms: COMP3121/3821/9101/9801 Aleks Ignjatović School of Computer Science and Engineering University of New South Wales LECTURE 8: STRING MATCHING ALGORITHMS COMP3121/3821/9101/9801

More information

Data Structure for Dynamic Patterns

Data Structure for Dynamic Patterns Data Structure for Dynamic Patterns Chouvalit Khancome and Veera Booning Member IAENG Abstract String matching and dynamic dictionary matching are significant principles in computer science. These principles

More information

Exact Circular Pattern Matching Using the BNDM Algorithm

Exact Circular Pattern Matching Using the BNDM Algorithm Exact Circular Pattern Matching Using the BNDM Algorithm K. H. Chen, G. S. Huang and R. C. T. Lee Department of Computer Science and Information Engineering, National Chi Nan University, Puli, Nantou,

More information

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des Text Searching Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, du Traitement de l Information et des Systèmes. International PhD School in Formal Languages and Applications Tarragona,

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

PSC Prague Stringology Club

PSC Prague Stringology Club Proceedings of the Prague Stringology Conference 2011 Edited by Jan Holub and Jan Žd árek August 2011 PSC Prague Stringology Club http://www.stringology.org/ Proceedings of the Prague Stringology Conference

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

New Inverted Lists-Multiple String Patterns Matching Algorithm

New Inverted Lists-Multiple String Patterns Matching Algorithm SSN 2348-1196 (print) nternational Journal of Computer Science and nformation Technology Research SSN 2348-120X (online) Vol. 2, ssue 4, pp: (254-264), Month: October - December 2014, Available at: www.researchpublish.com

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

INF 4130 / /8-2014

INF 4130 / /8-2014 INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT

More information

Lecture 3: String Matching

Lecture 3: String Matching COMP36111: Advanced Algorithms I Lecture 3: String Matching Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18 Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt

More information

Analysis of Algorithms Prof. Karen Daniels

Analysis of Algorithms Prof. Karen Daniels UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1 Chapter

More information

Average Case Analysis of the Boyer-Moore Algorithm

Average Case Analysis of the Boyer-Moore Algorithm Average Case Analysis of the Boyer-Moore Algorithm TSUNG-HSI TSAI Institute of Statistical Science Academia Sinica Taipei 115 Taiwan e-mail: chonghi@stat.sinica.edu.tw URL: http://www.stat.sinica.edu.tw/chonghi/stat.htm

More information

Computing repetitive structures in indeterminate strings

Computing repetitive structures in indeterminate strings Computing repetitive structures in indeterminate strings Pavlos Antoniou 1, Costas S. Iliopoulos 1, Inuka Jayasekera 1, Wojciech Rytter 2,3 1 Dept. of Computer Science, King s College London, London WC2R

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda 1,2, Shunsuke Inenaga 1,2, Hideo Bannai 3, Ayumi Shinohara 1,2, and Setsuo Arikawa 1 1 Department of Informatics,

More information

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm Knuth-Morris-Pratt & s by Robert C. St.Pierre Overview Notation review Knuth-Morris-Pratt algorithm Discussion of the Algorithm Example Boyer-Moore algorithm Discussion of the Algorithm Example Applications

More information

Compror: On-line lossless data compression with a factor oracle

Compror: On-line lossless data compression with a factor oracle Information Processing Letters 83 (2002) 1 6 Compror: On-line lossless data compression with a factor oracle Arnaud Lefebvre a,, Thierry Lecroq b a UMR CNRS 6037 ABISS, Faculté des Sciences et Techniques,

More information

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076

More information

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

Online Computation of Abelian Runs

Online Computation of Abelian Runs Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Graduate Algorithms CS F-20 String Matching

Graduate Algorithms CS F-20 String Matching Graduate Algorithms CS673-2016F-20 String Matching David Galles Department of Computer Science University of San Francisco 20-0: String Matching Given a source text, and a string to match, where does the

More information

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then String Matching Thanks to Piotr Indyk String Matching Input: Two strings T[1 n] and P[1 m], containing symbols from alphabet Σ Goal: find all shifts 0 s n-m such that T[s+1 s+m]=p Example: Σ={,a,b,,z}

More information

Samson Zhou. Pattern Matching over Noisy Data Streams

Samson Zhou. Pattern Matching over Noisy Data Streams Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt

More information

On Boyer-Moore Preprocessing

On Boyer-Moore Preprocessing On Boyer-Moore reprocessing Heikki Hyyrö Department of Computer Sciences University of Tampere, Finland Heikki.Hyyro@cs.uta.fi Abstract robably the two best-known exact string matching algorithms are the

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

arxiv: v1 [cs.fl] 29 Jun 2013

arxiv: v1 [cs.fl] 29 Jun 2013 On a compact encoding of the swap automaton Kimmo Fredriksson 1 and Emanuele Giaquinta 2 arxiv:1307.0099v1 [cs.fl] 29 Jun 2013 1 School of Computing, University of Eastern Finland kimmo.fredriksson@uef.fi

More information

Fast String Searching. J Strother Moore Department of Computer Sciences University of Texas at Austin

Fast String Searching. J Strother Moore Department of Computer Sciences University of Texas at Austin Fast String Searching J Strother Moore Department of Computer Sciences University of Texas at Austin 1 The Problem One of the classic problems in computing is string searching: find the first occurrence

More information

Lecture 15. Saad Mneimneh

Lecture 15. Saad Mneimneh Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F, Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering

More information

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182 Bell Labs Honors Pattern matching 10-07 CSE182 Just the Facts Consider the set of all substrings

More information

CMP 309: Automata Theory, Computability and Formal Languages. Adapted from the work of Andrej Bogdanov

CMP 309: Automata Theory, Computability and Formal Languages. Adapted from the work of Andrej Bogdanov CMP 309: Automata Theory, Computability and Formal Languages Adapted from the work of Andrej Bogdanov Course outline Introduction to Automata Theory Finite Automata Deterministic Finite state automata

More information

Text Analytics. Searching Terms. Ulf Leser

Text Analytics. Searching Terms. Ulf Leser Text Analytics Searching Terms Ulf Leser A Probabilistic Interpretation of Relevance We want to compute the probability that a doc d is relevant to query q The probabilistic model determines this probability

More information

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens

More information

State Complexity of Neighbourhoods and Approximate Pattern Matching

State Complexity of Neighbourhoods and Approximate Pattern Matching State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca

More information

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,

More information

More Speed and More Compression: Accelerating Pattern Matching by Text Compression

More Speed and More Compression: Accelerating Pattern Matching by Text Compression More Speed and More Compression: Accelerating Pattern Matching by Text Compression Tetsuya Matsumoto, Kazuhito Hagio, and Masayuki Takeda Department of Informatics, Kyushu University, Fukuoka 819-0395,

More information

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking Jung-Hua Hsu A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking Student:Jung-Hua

More information

Efficient Sequential Algorithms, Comp309

Efficient Sequential Algorithms, Comp309 Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to

More information

Inferring Strings from Graphs and Arrays

Inferring Strings from Graphs and Arrays Inferring Strings from Graphs and Arrays Hideo Bannai 1, Shunsuke Inenaga 2, Ayumi Shinohara 2,3, and Masayuki Takeda 2,3 1 Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1

More information

Bio nformatics. Lecture 16. Saad Mneimneh

Bio nformatics. Lecture 16. Saad Mneimneh Bio nformatics Lecture 16 DNA sequencing To sequence a DNA is to obtain the string of bases that it contains. It is impossible to sequence the whole DNA molecule directly. We may however obtain a piece

More information

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting

More information

Improving the KMP Algorithm by Using Properties of Fibonacci String

Improving the KMP Algorithm by Using Properties of Fibonacci String Improving the KMP Algorithm by Using Properties of Fibonacci String Yi-Kung Shieh and R. C. T. Lee Department of Computer Science National Tsing Hua University d9762814@oz.nthu.edu.tw and rctlee@ncnu.edu.tw

More information

58093 String Processing Algorithms. Lectures, Fall 2010, period II

58093 String Processing Algorithms. Lectures, Fall 2010, period II 58093 String Processing Algorithms Lectures, Fall 2010, period II Juha Kärkkäinen 1 Who is this course for? Master s level course in Computer Science, 4 cr Subprogram of Algorithms and Machine Learning

More information

How do regular expressions work? CMSC 330: Organization of Programming Languages

How do regular expressions work? CMSC 330: Organization of Programming Languages How do regular expressions work? CMSC 330: Organization of Programming Languages Regular Expressions and Finite Automata What we ve learned What regular expressions are What they can express, and cannot

More information

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov INF 4130 / 9135 29/8-2012 Today s slides are produced mainly by Petter Kristiansen Lecturer Stein Krogdahl Mandatory assignments («Oblig1», «-2», and «-3»): All three must be approved Deadlines around:

More information

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {

More information

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata. CMSC 330: Organization of Programming Languages Last Lecture Languages Sets of strings Operations on languages Finite Automata Regular expressions Constants Operators Precedence CMSC 330 2 Clarifications

More information

Define M to be a binary n by m matrix such that:

Define M to be a binary n by m matrix such that: The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]

More information

The streaming k-mismatch problem

The streaming k-mismatch problem The streaming k-mismatch problem Raphaël Clifford 1, Tomasz Kociumaka 2, and Ely Porat 3 1 Department of Computer Science, University of Bristol, United Kingdom raphael.clifford@bristol.ac.uk 2 Institute

More information

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003 String Matching Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 Rabin-Karp Algorithm 3 3 Knuth-Morris-Pratt Algorithm 5 3.1 Informal Description.........................

More information

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS Parallel Processing Letters, c World Scientific Publishing Company OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS COSTAS S. ILIOPOULOS Department of Computer Science, King s College London,

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

Simple Compression Code Supporting Random Access and Fast String Matching

Simple Compression Code Supporting Random Access and Fast String Matching Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 443 (2012) 25 34 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs String matching with variable

More information

Mechanized Operational Semantics

Mechanized Operational Semantics Mechanized Operational Semantics J Strother Moore Department of Computer Sciences University of Texas at Austin Marktoberdorf Summer School 2008 (Lecture 5: Boyer-Moore Fast String Searching) 1 The Problem

More information

A Repetitive Corpus Testbed

A Repetitive Corpus Testbed Chapter 3 A Repetitive Corpus Testbed In this chapter we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

String Matching II. Algorithm : Design & Analysis [19]

String Matching II. Algorithm : Design & Analysis [19] String Matching II Algorithm : Design & Analysis [19] In the last class Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan String Matching II Boyer-Moore s heuristics Skipping unnecessary

More information

Efficient Algorithms forregular Expression Constrained Sequence Alignment p. 1/35

Efficient Algorithms forregular Expression Constrained Sequence Alignment p. 1/35 Efficient Algorithms for Regular Expression Constrained Sequence Alignment Yun-Sheng Chung, Chin Lung Lu, and Chuan Yi Tang Department of Computer Science National Tsing Hua University, Taiwan Department

More information

Exact vs approximate search

Exact vs approximate search Text Algorithms (4AP) Lecture: Approximate matching Jaak Vilo 2008 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Exact vs approximate search In exact search we searched for a string or set of strings in

More information

Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt Algorithm Knuth-Morris-Pratt Algorithm Jayadev Misra June 5, 2017 The Knuth-Morris-Pratt string matching algorithm (KMP) locates all occurrences of a pattern string in a text string in linear time (in the combined

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Shift-And Approach to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matching in LZW Compressed Text Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {kida,

More information

Linear-Time Computation of Local Periods

Linear-Time Computation of Local Periods Linear-Time Computation of Local Periods Jean-Pierre Duval 1, Roman Kolpakov 2,, Gregory Kucherov 3, Thierry Lecroq 4, and Arnaud Lefebvre 4 1 LIFAR, Université de Rouen, France Jean-Pierre.Duval@univ-rouen.fr

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

On the Decision Tree Complexity of String Matching. December 29, 2017

On the Decision Tree Complexity of String Matching. December 29, 2017 On the Decision Tree Complexity of String Matching Xiaoyu He 1 Neng Huang 1 Xiaoming Sun 2 December 29, 2017 arxiv:1712.09738v1 [cs.cc] 28 Dec 2017 Abstract String matching is one of the most fundamental

More information

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING BBM 202 - LGORITHMS DEPT. OF OMPUTER ENGINEERING SUBSTRING SERH cknowledgement: The course slides are adapted from the slides prepared by R. Sedgewick and K. Wayne of Princeton University. 1 TODY Substring

More information

arxiv:cs/ v1 [cs.dm] 7 May 2006

arxiv:cs/ v1 [cs.dm] 7 May 2006 arxiv:cs/0605026v1 [cs.dm] 7 May 2006 Strongly Almost Periodic Sequences under Finite Automata Mappings Yuri Pritykin April 11, 2017 Abstract The notion of almost periodicity nontrivially generalizes the

More information

Hierarchical Overlap Graph

Hierarchical Overlap Graph Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018 arxiv:1802.04632 2018 B. Cazaux & E. Rivals 1 / 29 Overlap Graph for a set of words Consider the set P := {abaa,

More information

Pattern matching in highly similar sequences

Pattern matching in highly similar sequences Pattern matching in highly similar sequences Thierry Lecroq Thierry.Lecroq@univ-rouen.fr joint work with N. Ben Nsira et É. Prieur-Gaston Laboratoire d Informatique, du Traitement de l Information et des

More information

Hashing Techniques For Finite Automata

Hashing Techniques For Finite Automata Hashing Techniques For Finite Automata Hady Zeineddine Logic Synthesis Course Project - Spring 2007 Professor Adnan Aziz 1. Abstract This report presents two hashing techniques - Alphabet and Power-Set

More information

The NFA Segments Scan Algorithm

The NFA Segments Scan Algorithm The NFA Segments Scan Algorithm Omer Barkol, David Lehavi HP Laboratories HPL-2014-10 Keyword(s): formal languages; regular expression; automata Abstract: We present a novel way for parsing text with non

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

String Matching Problem

String Matching Problem String Matching Problem Pattern P Text T Set of Locations L 9/2/23 CAP/CGS 5991: Lecture 2 Computer Science Fundamentals Specify an input-output description of the problem. Design a conceptual algorithm

More information

Nondeterministic Finite Automata. Nondeterminism Subset Construction

Nondeterministic Finite Automata. Nondeterminism Subset Construction Nondeterministic Finite Automata Nondeterminism Subset Construction 1 Nondeterminism A nondeterministic finite automaton has the ability to be in several states at once. Transitions from a state on an

More information

Java II Finite Automata I

Java II Finite Automata I Java II Finite Automata I Bernd Kiefer Bernd.Kiefer@dfki.de Deutsches Forschungszentrum für künstliche Intelligenz November, 23 Processing Regular Expressions We already learned about Java s regular expression

More information

Counting and Verifying Maximal Palindromes

Counting and Verifying Maximal Palindromes Counting and Verifying Maximal Palindromes Tomohiro I 1, Shunsuke Inenaga 2, Hideo Bannai 1, and Masayuki Takeda 1 1 Department of Informatics, Kyushu University 2 Graduate School of Information Science

More information

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING ERKUT ERDEM. Apr. 28, 2015

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING ERKUT ERDEM. Apr. 28, 2015 BBM 202 - LGORITHMS DEPT. OF OMPUTER ENGINEERING ERKUT ERDEM SUBSTRING SERH pr. 28, 2015 cknowledgement: The course slides are adapted from the slides prepared by R. Sedgewick and K. Wayne of Princeton

More information

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r Syllabus R9 Regulation UNIT-II NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: In the automata theory, a nondeterministic finite automaton (NFA) or nondeterministic finite state machine is a finite

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version) Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

More information

Transposition invariant pattern matching for multi-track strings

Transposition invariant pattern matching for multi-track strings https://helda.helsinki.fi Transposition invariant pattern matching for multi-track strings Lemstrom, Kjell 2003 Lemstrom, K & Tarhio, J 2003, ' Transposition invariant pattern matching for multi-track

More information

Streaming Periodicity with Mismatches. Funda Ergun, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou

Streaming Periodicity with Mismatches. Funda Ergun, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou Streaming Periodicity with Mismatches Funda Ergun, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou Periodicity A portion of a string that repeats ABCDABCDABCDABCD ABCDABCDABCDABCD Periodicity Alternate

More information