Module 9: Tries nd String Mtching CS 240 - Dt Structures nd Dt Mngement Sjed Hque Veronik Irvine Tylor Smith Bsed on lecture notes by mny previous cs240 instructors Dvid R. Cheriton School of Computer Science, University of Wterloo Spring 2017 Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 1 / 43 Pttern Mtching Serch for string (pttern) in lrge body of text T [0..n 1] The text (or hystck) being serched within P[0..m 1] The pttern (or needle) being serched for Strings over lphbet Σ Return the first i such tht P[j] = T [i + j] for 0 j m 1 This is the first occurrence of P in T If P does not occur in T, return FAIL Applictions: Informtion Retrievl (text editors, serch engines) Bioinformtics Dt Mining Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 2 / 43
Pttern Mtching Exmple: T = Where is he? P 1 = he P 2 = who Definitions: Substring T [i..j] 0 i j < n: string of length j i + 1 which consists of chrcters T [i],... T [j] in order A prefix of T : substring T [0..i] of T for some 0 i < n A suffix of T : substring T [i..n 1] of T for some 0 i n 1 Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 3 / 43 Generl Ide of Algorithms Pttern mtching lgorithms consist of guesses nd checks: A guess is position i such tht P might strt t T [i]. Vlid guesses (initilly) re 0 i n m. A check of guess is single position j with 0 j < m where we compre T [i + j] to P[j]. We must perform m checks of single correct guess, but my mke (mny) fewer checks of n incorrect guess. We will digrm single run of ny pttern mtching lgorithm by mtrix of checks, where ech row represents single guess. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 4 / 43
Brute-force Algorithm Ide: Check every possible guess. BruteforcePM(T [0..n 1], P[0..m 1]) T : String of length n (text), P: String of length m (pttern) 1. for i 0 to n m do 2. mtch true 3. j 0 4. while j < m nd mtch do 5. if T [i + j] = P[j] then 6. j j + 1 7. else 8. mtch flse 9. if mtch then 10. return i 11. return FAIL Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 5 / 43 Exmple Exmple: T = bbbbbbb, P = bb b b b b b b b b b b b b b Wht is the worst possible input? P = m 1 b, T = n Worst cse performnce Θ((n m + 1)m) m n/2 Θ(mn) Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 6 / 43
Pttern Mtching More sophisticted lgorithms Deterministic finite utomt (DFA) KMP, Boyer-Moore nd Rbin-Krp Do extr preprocessing on the pttern P We eliminte guesses bsed on completed mtches nd mismtches. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 7 / 43 String mtching with finite utomt There is string-mtching utomton for every pttern P. It is constructed from the pttern in preprocessing step before it cn be used to serch the text string. Exmple: Automton for the pttern P = bbc b,c b b b c 0 1 2 3 4 5 6 7 c b,c c b,c b,c c b Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 8 / 43
String mtching with finite utomt Let P the pttern to serch, of length m. Then where the sttes of the utomton re 0,..., m the trnsition function δ of the utomton is defined s follows, for stte q nd chrcter c in Σ: δ(q, c) = l(p[0..q 1]c), P[0..q 1]c is the conctention of P[0..q 1] nd c for string s, l(s) {0,..., m} is the length of the longest prefix of P tht is lso suffix of s. Grphiclly, this corresponds to q c δ(q, c) Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 9 / 43 String mtching with finite utomt Let T be the text string of length n, P the pttern to serch of length m nd δ the trnsition function of finite utomton for pttern P. FINITE-AUTOMATON-MATCHER(T, δ, m) n length[t ] q 0 for i 0 to n 1 do q δ(q, T [i]) if q = m then print Pttern occurs with shift i (m 1) Ide of proof: the stte fter reding T [i] is l(t [0..i]). Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 10 / 43
String mtching with finite utomt Mtching time on text string of length n is Θ(n) This does not include the preprocessing time required to compute the trnsition function δ. There exists n lgorithm with O(m Σ ) preprocessing time. Altogether, we cn find ll occurrences of length-m pttern in length-n text over finite lphbet Σ with O(m Σ ) preprocessing time nd Θ(n) mtching time. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 11 / 43 KMP Algorithm Knuth-Morris-Prtt lgorithm (1977) Compres the pttern to the text in left-to-right Shifts the pttern more intelligently thn the brute-force lgorithm When mismtch occurs, how much cn we shift the pttern (reusing knowledge from previous mtches)? T = b c d c b c??? b c d c b b c d c KMP Answer: this depends on the lrgest prefix of P[0..j] tht is suffix of P[1..j] Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 12 / 43
KMP Filure Arry T: b b c b c d P: b b c b wht next slide would mtch with the text? Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 13 / 43 KMP Filure Arry Suppose we hve mtch up to position T [i 1] = P[j 1], but not t the next position. Define F [j 1] s the index we will hve to check in P, fter we bring the pttern to its next possible position (previous exmple: j = 6, F [5] = 2). This cn be computed by trying ll sliding positions until finding the first one mtching the text (s in previous exmple). We cn do better: ny possible sliding position corresponds to prefix of P[0..j 1] tht is lso strict suffix of it = suffix of P[1..j 1] the next possible sliding position corresponds to the lrgest such prefix / suffix we let F [j 1] be the length of this prefix / suffix. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 14 / 43
KMP Filure Arry Schemticlly: T i P j 1 j F [j 1] F [j 1] slide no need to check for mtching with T compring with T strts from here Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 15 / 43 KMP Filure Arry F [0] = 0 F [j], for j > 0, is the length of the lrgest prefix of P[0..j] tht is lso suffix of P[1..j] Consider P = bcb j P[1..j] P F [j] 0 bcb 0 1 b bcb 0 2 b bcb 1 3 bc bcb 0 4 bc bcb 1 5 bcb bcb 2 6 bcb bcb 3 Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 16 / 43
Computing the Filure Arry filurearry(p) P: String of length m (pttern) 1. F [0] 0 2. i 1 3. j 0 4. while i < m do 5. if P[i] = P[j] then 6. F [i] j + 1 7. i i + 1 8. j j + 1 9. else if j > 0 then 10. j F [j 1] 11. else 12. F [i] 0 13. i i + 1 Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 17 / 43 KMP Algorithm KMP(T, P), to return the first mtch T : String of length n (text), P: String of length m (pttern) 1. F filurearry(p) 2. i 0 3. j 0 4. while i < n do 5. if T [i] = P[j] then 6. if j = m 1 then 7. return i j //mtch 8. else 9. i i + 1 10. j j + 1 11. else 12. if j > 0 then 13. j F [j 1] 14. else 15. i i + 1 16. return 1 // no mtch Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 18 / 43
KMP: Exmple P = bcb T = bxybcbbbbcb j 0 1 2 3 4 5 6 F [j] 0 0 1 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 11 b x y b c b b b c () b b c b () (b) Exercise: continue with T = bxybcbbbbcb Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 19 / 43 KMP: Anlysis filurearry At ech itertion of the while loop, t lest one of the following hppens: 1 i increses by one, or 2 the index i j increses by t lest one (F [j 1] < j) KMP There re no more thn 2m itertions of the while loop Running time: Θ(m) filurearry cn be computed in Θ(m) time At ech itertion of the while loop, t lest one of the following hppens: 1 i increses by one, or 2 the index i j increses by t lest one (F [j 1] < j) There re no more thn 2n itertions of the while loop Running time: Θ(n) Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 20 / 43
Boyer-Moore Algorithm Bsed on three key ides: Reverse-order serching: Compre P with subsequence of T moving bckwrds Bd chrcter jumps: When mismtch occurs t T [i] = c If P contins c, we cn shift P to lign the lst occurrence of c in P with T [i] Otherwise, we cn shift P to lign P[0] with T [i + 1] Good suffix jumps: If we hve lredy mtched suffix of P, then get mismtch, we cn shift P forwrd to lign with the previous occurence of tht suffix (with mismtch from the suffix we red). If none exists, look for the longest prefix of P tht is suffix of wht we red. Similr to filure rry in KMP. Cn skip lrge prts of T Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 21 / 43 Bd chrcter exmples P = l d o T = w h e r e i s w l d o o o l d o 6 comprisons (checks) P = m o o r e T = b o y e r m o o r e 7 comprisons (checks) e (r) e (m) o o r e Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 22 / 43
Good suffix exmples P = sells shells s h e i l s e l l s s h e l l s h e l l s s (e) (l) (l) (s) s h e l l s P = odetofood i l i k e f o o d f r o m m e x i c o o f o o d (e) (o) (d) d d Good suffix moves further thn bd chrcter for 2nd guess. Bd chrcter moves further thn good suffix for 3rd guess. This is out of rnge, so pttern not found. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 23 / 43 Lst-Occurrence Function Preprocess the pttern P nd the lphbet Σ Build the lst-occurrence function L mpping Σ to integers L(c) is defined s the lrgest index i such tht P[i] = c or 1 if no such index exists Exmple: Σ = {, b, c, d}, P = bcb c b c d L(c) 4 5 3-1 The lst-occurrence function cn be computed in time O(m + Σ ) In prctice, L is stored in size- Σ rry. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 24 / 43
Good Suffix rry Agin, we preprocess P to build tble. Suffix skip rry S of size m: for 0 i < m, S[i] is the lrgest index j such tht P[i + 1..m 1] = P[j + 1..j + m 1 i] nd P[j] P[i]. Note: in this clcultion, ny negtive indices re considered to mke the given condition true (these correspond to letters tht we might not hve checked yet). Similr to KMP filure rry, with n extr condition. Computed similrly to KMP filure rry in Θ(m) time. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 25 / 43 Good Suffix rry Exmple: P = bonobobo i 0 1 2 3 4 5 6 7 P[i] b o n o b o b o S[i] 6 5 4 3 2 1 2 6 Computed similrly to KMP filure rry in Θ(m) time. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 26 / 43
Boyer-Moore Algorithm boyer-moore(t,p) 1. L lst occurrence rry computed from P 2. S good suffix rry computed from P 3. i m 1, j m 1 4. while i < n nd j 0 do 5. if T [i] = P[j] then 6. i i 1 7. j j 1 8. else 9. i i + m 1 min(l[t [i]], S[j]) 10. j m 1 11. if j = 1 return i + 1 12. else return FAIL Exercise: Prove tht i j lwys increses on lines 9 10. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 27 / 43 Boyer-Moore lgorithm conclusion Worst-cse running time O(n + Σ ) This complexity is difficult to prove. Worst-cse running time O(nm) if we wnt to report ll occurrences On typicl English text the lgorithm probes pproximtely 25% of the chrcters in T Fster thn KMP in prctice on English text. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 28 / 43
Rbin-Krp Fingerprint Algorithm Ide: use hshing Compute hsh function for ech text position No explicit hsh tble: just compre with pttern hsh If mtch of the hsh vlue of the pttern nd text position found, then compres the pttern with the substring by nive pproch Exmple: Hsh tble size = 97 Serch Pttern P: 5 9 2 6 5 Serch Text T : 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 Hsh function: h(x) = x mod 97 nd h(p) = 95. 31415 mod 97 = 84 14159 mod 97 = 94 41592 mod 97 = 76 15926 mod 97 = 18 59265 mod 97 = 95 Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 29 / 43 Rbin-Krp Fingerprint Algorithm Gurnteeing correctness Need full compre on hsh mtch to gurd ginst collisions 59265 mod 97 = 95 59362 mod 97 = 95 Running time Hsh function depends on m chrcters Running time is Θ(mn) for serch miss (how cn we fix this?) Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 30 / 43
Rbin-Krp Fingerprint Algorithm The initil hshes re clled fingerprints. Rbin & Krp discovered wy to updte these fingerprints in constnt time. Ide: To go from the hsh of substring in the text string to the next hsh vlue only requires constnt time. Use previous hsh to compute next hsh O(1) time per hsh, except first one Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 31 / 43 Rbin-Krp Fingerprint Algorithm Exmple: Pre-compute: 10000 mod 97 = 9 Previous hsh: 41592 mod 97 = 76 Next hsh: 15926 mod 97 =?? Observtion: 15926 mod 97 = (41592 (4 10000 )) 10 + 6 = (76 (4 9 )) 10 + 6 = 406 = 18 Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 32 / 43
Rbin-Krp Fingerprint Algorithm Choose tble size t rndom to be huge prime Expected running time is O(m + n) Θ(mn) worst-cse, but this is (unbelievbly) unlikely Min dvntge: Extends to 2d ptterns nd other generliztions Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 33 / 43 Suffix Tries nd Suffix Trees Wht if we wnt to serch for mny ptterns P within the sme fixed text T? Ide: Preprocess the text T rther thn the pttern P Observtion: P is substring of T if nd only if P is prefix of some suffix of T. We will cll trie tht stores ll suffixes of text T suffix trie, nd the compressed suffix trie of T suffix tree. Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 34 / 43
Suffix Trees Build the suffix trie, i.e. the trie contining ll the suffixes of the text Build the suffix tree by compressing the trie bove (like in Ptrici trees) Store two indexes l, r on ech node v (both internl nodes nd leves) where node v corresponds to substring T [l..r] Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 35 / 43 Suffix Trie: Exmple T =bnnbn {bnnbn, nnbn, nnbn, nbn, nbn, bn, bn, n, n} b n b $ n $ n $ n b $ n $ n n b n $ b n b i 0 1 2 3 4 5 6 7 8 9 T [i] b n n b n $ $ b n n $ n $ $ Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 36 / 43
Suffix Tree (compressed suffix trie): Exmple T =bnnbn {bnnbn, nnbn, nnbn, nbn, nbn, bn, bn, n, n} Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 37 / 43 Suffix Trees: Pttern Mtching To serch for pttern P of length m: Similr to Serch in compressed trie with the difference tht we re looking for prefix mtch rther thn complete mtch If we rech lef with corresponding string length less thn m, then serch is unsuccessful Otherwise, we rech node v (lef or internl) with corresponding string length of t lest m It only suffices to check the first m chrcters ginst the substring of the text between indices of the node, to see if there indeed is mtch We cn then visit ll children of the node to report ll mtches Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 38 / 43
Suffix Tree: Exmple T = bnnbn P = n Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 39 / 43 Suffix Tree: Exmple T = bnnbn P = bn Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 40 / 43
Suffix Tree: Exmple T = bnnbn P = nn Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 41 / 43 Suffix Tree: Exmple T = bnnbn P = bbn not found Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 42 / 43
Pttern Mtching Conclusion Brute- Suffix DFA KMP BM RK Force trees O ( n 2) Preproc.: O (m Σ ) O (m) O (m + Σ ) O (m) ( O (n)) Serch O (n) Õ(n + m) O (nm) O (n) O (n) O (m) time: (often better) (expected) Extr O (m Σ ) O (m) O (m + Σ ) O (1) O (n) spce: Hque, Irvine, Smith (SCS, UW) CS240 - Module 9 Spring 2017 43 / 43