UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1
Chapter Dependencies Automata Ch 32 String Matching You re responsible for material in Sections 32.1-32.4 of this chapter. 2
String Matching Algorithms Motivation & Basics 3
String Matching Problem Motivations: text-editing, editing, pattern matching in DNA sequences 32.1 Text: : array T [1...n] n m Pattern: : array P [1...m] Array Element: : Character from finite alphabet Σ Pattern P occurs with shift s in T if P [1...m] = T [s +1...s + m] 0 s n m 4
String Matching Algorithms: Worst-Case Execution Time Naive Algorithm Preprocessing: 0 Matching: O((n-m+1) +1)m) Overall: O(( ((n-m+1) +1)m) ) Rabin-Karp Preprocessing: Θ(m) Matching: O((n-m+1) +1)m) Overall: O(( ((n-m+1) +1)m) (Better than this on average and in practice) Finite Automaton Preprocess: O(m Σ )) Matching: Θ(n) Overall: O(n + m Σ ) Knuth-Morris-Pratt Preprocessing: Θ(m) Matching: Θ(n) Overall: Θ(n + m) Text: : array T [1...n] Pattern: : array P [1...m] 5
Notation & Terminology Σ* * = set of all finite-length strings formed using characters from alphabet Σ Empty string: ε x = length of string x w is a prefix of x: w x ab abcca w is a suffix of x: w x prefix, suffix are transitive cca abcca 6
Overlapping Suffix Lemma 32.1 32.33 32.1 7
String Matching Algorithms Naive Algorithm 8
Naive String Matching implicit loop worst-case running time is in Θ(( ((n-m+1) +1)m) 32.4 9
String Matching Algorithms Rabin-Karp 10
Rabin-Karp Algorithm Assume each character is digit in radix-d d notation (e.g. d=10) p = decimal value of pattern Convert to numeric representation for mod operations. t s = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m Strategy: compute p in O(m) time (in O(n)) compute all t i values in total of O(n) time find all valid shifts s in O(n) time by comparing p with each t s Compute p in O(m) time using Horner s rule: p = P[m] + d(p[m-1] + d(p[m-2] +... + d(p[2] + dp[1]))) Compute t 0 similarly from T[1..m] in O(m) time Compute remaining t i s in O(n-m) time t s+1 = d(t s -d m- 1 T[s+1]) + T[s+m+1] rolling window 11
Rabin-Karp Algorithm p, t s may be large, so mod by a prime q pattern match 32.5 12
Rabin-Karp Algorithm (continued) t s+1 = d(t s -d m-1 T[s+1]) + T[s+m+1] d m-1 mod q (32.2) p = 31415 spurious hit 13
Rabin-Karp Algorithm (continued) 14
Rabin-Karp Algorithm (continued) d is radix; q is modulus Θ(m) high-order digit position for m-digit window Θ(m) Θ((n-m+1)m) Try all possible shifts Θ(m) stopping condition Matching loop invariant: when line 10 executed t s =T [s+1..s+m] mod q rule out spurious hit worst-case running time is in Θ((n-m+1)m) 15
Rabin-Karp Algorithm (continued) d is radix; q is modulus Θ(m) high-order digit position for m-digit window Θ((n-m+1)m) Try all possible shifts Θ(m) Θ(m) stopping condition Assume reducing mod q is like random mapping from Σ* to Z q Matching loop invariant: when line 10 executed t s =T[s+1..s+m] mod q rule out spurious hit set of all finite-length strings formed from Σ Estimate (chance that t s = p (mod q)) = 1/q Expected # spurious hits is in O(n/q) Expected matching time = O(n) + O(m(v ( + n/q)) (v = # valid shifts) preprocessing + t s updates If v is in O(1) and q >= m time for explicit matching comparisons average-case running time is in Ο(n+m) 16
String Matching Algorithms Finite Automata 17
Finite Automata δ 32.6 Strategy: Build automaton for pattern, then examine each text character once. worst-case running time is in Θ(n) + automaton creation time 18
Finite Automata Difference: Our automaton will find all occurrences of pattern. 19
String-Matching Automaton Pattern = P = ababaca Absent arrows go to state 0. Automaton accepts all strings ending in P; catches all matches 32.7 20
String-Matching Automaton Suffix Function for P: σ (x) = length of longest prefix of P that is a suffix of x σ ( x) = max{ k : P x} (32.3) k (32.4) We will build up to this proof Automaton s operational invariant (32.5) i-character prefix of T at each step: : keep track of longest pattern prefix that is a suffix of what has been read so far 21
String-Matching Automaton Simulate behavior of string-matching automaton that finds occurrences of pattern P of flength m in T [1..n] We ll show automaton is in state σ(t i ) after scanning character T[i]. Since σ(t i )=m iff P (T i ), machine is in accepting state m iff it has just scanned pattern P. assuming automaton has already been created... worst-case running time of matching is in Θ(n) 22
String-Matching Automaton (continued) Correctness of matching procedure... 32.4 Automaton keeps track of longest pattern prefix that is a suffix of what has been read so far in the text. board work (32.4) 32.3 ( xa) σ ( P σ ( x) a) σ = to be proved next 23
String-Matching Automaton (continued) Correctness of matching procedure... to be used to prove Lemma 32.3 32.2 32.8 = P σ ( xa) 32.8 32.2 24
String-Matching Automaton (continued) Correctness of matching procedure... 32.3 32.9 32.2 32.1 = P σ ( x) = P σ ( xa ) 32.9 32.3 25
String-Matching Automaton (continued) Correctness of matching procedure is now established... 32.4 (32.4) 32.3 σ ( xa) = σ ( P σ ( x) a) 26
String-Matching Automaton (continued) This procedure computes the transition function δ from a given pattern P [1 m]. worst-case running time of automaton creation is in Ο(m 3 Σ ) ) can be improved to: Ο(m Σ ) ) worst-case running time of entire string-matching strategy is in Ο(m Σ ) ) + Ο(n) automaton creation time 27 pattern matching time
String Matching Algorithms Knuth-Morris-Pratt 28
Knuth-Morris-Pratt Overview Achieve e Θ(n+m) + ) time by shortening automaton preprocessing time below Ο(m Σ ) ) Approach: don t precompute automaton s transition function calculate enough transition data on-the-fly obtain data via alphabet-independent pattern preprocessing pattern preprocessing compares pattern against shifts of itself Use amortization for running time calculation. 29
Knuth-Morris-Pratt Algorithm determine how pattern matches against itself 32.10 30
Knuth-Morris-Pratt Algorithm 32.6 Equivalently, what is largest k < q such that P k P q? Prefix function π shows how pattern matches against itself π ( q ) = max{ k : k < q and P k Pq } π(q) is length of longest prefix of P that is a proper suffix of P q Example: 31
Knuth-Morris-Pratt Algorithm Somewhat similar in structure to FINITE-AUTOMATON AUTOMATON-MATCHER MATCHER Θ(m+n) using amortized analysis (see next slide) Θ(m) Θ(n) using amortized analysis* # characters matched scan text left-to-rightto next character does not match next character matches Is all of P matched? Look for next match 32 *, 2 nd edition uses potential function with Φ = q. 3 rd edition uses aggregate analysis.
Knuth-Morris-Pratt Algorithm Amortized Analysis Potential Method Φ = k k represents current state of algorithm Similar in structure to KMP-MATCHER MATCHER Potential is never negative since π (k) >= 0 for all k Θ(m) time initial potential value potential decreases potential increases by <=1 in each execution of for loop body amortized cost of loop body is in Ο(1) Θ(m) loop iterations, 2 nd edition. 3 rd edition uses aggregate analysis to show while loop executes O(m) times overall. 33
Knuth-Morris-Pratt Algorithm Correctness... Iterated t prefix function: 34
Knuth-Morris-Pratt Algorithm Correctness... 35
StringMatch Correctness of Compute-Prefix-Function.. This is nontrivial Lemma 32.5.. (Prefix-function function iteration lemma) Let P be a pattern of length m with prefix function π.. Then, for q =1 1, 2,, m,, we have π * [q] = {k : k < q and P k P q }. Proof. (using > for suffix symbol) 1. π * [q] {k : k < q and P k P q }. Let i = π (u ) [q] for some u > 0. We prove the inclusion by induction on u. For u = 1, we have i = π[q], [ and dth the claim follows since i < q and P π[q] P q. Assume the inclusion holds for i = π (u ) [q]. We need to prove it for i = π (u+1 u+1) [q] = π [π (u ) [q]]. i < π (u ) [q] and P i P π(u)[ )[q]. 4/23/2012 36 source: Textbook and Prof. Pecelli
StringMatch By induction assumption, P π(u)[ )[q] P q. Transitivity of the relation give that P i P q, as desired. 2. {k : k < q and P k P q } π * [q]. By contradiction. Suppose, to the contrary, that there is an integer in {k k : k < q and P k P q } - π * [q], and let j denote the largest such integer. Since π[q] is the largest value in {k{ : k < q and P k P q q},, and π[q] π * [q], we must have j < π[q].. Let j denote the smallest integer in π * [q]s.t. j > j. j {k : k < qand P k P q } implies P j P q ; j π * [q] implies P j P q. Lemma 32.1 (Overlapping Suffix) implies that P j P j, and j 4/23/2012 is the largest value less than j with this property. 37 source: Textbook and Prof. Pecelli
StringMatch This, in turn, forces the conclusion that π[j j ] ] = j and, since j π * [q], we must have j π * [q]. Contradiction. We now continue with another lemma: it is clear that, since π[1] = 0, line 2 of Compute-Prefix- Function provides the correct value. We need to extend this statement to all q > 1. 4/23/2012 38 source: Textbook and Prof. Pecelli
StringMatch Lemma 32.6.. Let P = P[1 [1 m], and let π be the prefix function for P.. For q = 0, 1,, m,, if π[q] ] > 0, then π[q] -1 π [q-1] 1]. Proof.. If r = π[q]>0 > 0, then r<qand P r P q. Thus r -1 < q -1 and P r-1 P q-1 (by dropping the last characters from P r and P q ). Lemma 32.5 implies that π[q] - 1 = r -1 π [q-1] 1]. 4/23/2012 39 source: Textbook and Prof. Pecelli
StringMatch We now introduce a new set: for q = 23 2, 3,, m, define E q-1 π [q-1] by: E q-1 = {k π 1 [q-1]: P[k+1] [ ] = P[q]} = {k : k < q-1 and P k P q-1 and P[k+ k+1] = P[q]} = {k : k < q-1 1 and P k+1 P q }. In other words, E q-1 consists of the values k < q -1 for which P k P q-1 and for which P k+1 P q, because P[k+1] = P[q]. E q-1 consists of those values k π [q-1] for which we can extend P k to P k+1 and still get a proper suffix of P q. 4/23/2012 40 source: Textbook and Prof. Pecelli
StringMatch Corollary 32.7.. Let P be a pattern of length m,, and let p be the prefix function for P.. For q = 2, 3,, m, π [ q ] = = 1+ max{ k E } q 1 if Eq 1. 0 if E q 1, Proof.. Case 1: E q-1 is empty. There is no k π [q-1] (including k = 0) for which we can extend P k to P k+1 and get a proper suffix of P q. Thus π[q] ] = 0. Case 2: E q-1 is not empty. 1. Prove π[q] ] 1 + max{k Ε q-1 }. For each k Ε q-1 we have k+1< 1<q q and P k+1 P q. The definition of π[q] gives the inequality. 4/23/2012 41 source: Textbook and Prof. Pecelli
StringMatch 2. Prove that π[q] ] 1 + max{k Ε q-1 }. Since E q-1 is non-empty, π[q] ] > 0. Let r = π[q] -1, hence r + 1 = π[q].. Since r+1>0 1 > 0, P[r +1]=P[q] P[q].. By Lemma 32.6 we also have r = π[q] -1 π [q -1] 1]. Therefore r Ε q-1, which implies r max{k Ε q-1 } and, immediately, the desired inequality. Combining i both inequalities, we have the result. Now glue all these results together to obtain a proof of correctness. 4/23/2012 42 source: Textbook and Prof. Pecelli