Analysis of Algorithms Prof. Karen Daniels

Similar documents
Graduate Algorithms CS F-20 String Matching

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Algorithms: COMP3121/3821/9101/9801

Efficient Sequential Algorithms, Comp309

Lecture 3: String Matching

String Search. 6th September 2018

Module 9: Tries and String Matching

INF 4130 / /8-2017

2. Exact String Matching

INF 4130 / /8-2014

Data Structures and Algorithm. Xiaoqing Zheng

Knuth-Morris-Pratt Algorithm

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

Knuth-Morris-Pratt Algorithm

New Minimal Weight Representations for Left-to-Right Window Methods

Lecture 9 Tuesday, 4/20/10. Linear Programming

Pattern Matching (Exact Matching) Overview

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

String Regularities and Degenerate Strings

Linear Selection and Linear Sorting

Lecture 5: The Shift-And Method

Searching Sear ( Sub- (Sub )Strings Ulf Leser

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Pattern-Matching for Strings with Short Descriptions

Hash tables. Hash tables

Subset construction. We have defined for a DFA L(A) = {x Σ ˆδ(q 0, x) F } and for A NFA. For any NFA A we can build a DFA A D such that L(A) = L(A D )

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Approximate Pattern Matching and the Query Complexity of Edit Distance

Sri vidya college of engineering and technology

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Improving the KMP Algorithm by Using Properties of Fibonacci String

Where did dynamic programming come from?

On Boyer-Moore Preprocessing

PATTERN MATCHING WITH SWAPS IN PRACTICE

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Deterministic Finite Automaton (DFA)

CSC236 Week 10. Larry Zhang

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Computability and Complexity

Samson Zhou. Pattern Matching over Noisy Data Streams

String Range Matching

CSCE 551: Chin-Tser Huang. University of South Carolina

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,

Part 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages

Optimal Superprimitivity Testing for Strings

String Matching II. Algorithm : Design & Analysis [19]

CS243, Logic and Computation Nondeterministic finite automata

Deterministic Finite Automata (DFAs)

Decision Problems with TM s. Lecture 31: Halting Problem. Universe of discourse. Semi-decidable. Look at following sets: CSCI 81 Spring, 2012

3515ICT: Theory of Computation. Regular languages

highlights CSE 311: Foundations of Computing highlights 1 in third position from end Fall 2013 Lecture 25: Non-regularity and limits of FSMs

Hashing Techniques For Finite Automata

String Matching with Variable Length Gaps

6. DYNAMIC PROGRAMMING II

Foundations of

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu

Announcements. CompSci 102 Discrete Math for Computer Science. Chap. 3.1 Algorithms. Specifying Algorithms

Theoretical Computer Science. Efficient string-matching allowing for non-overlapping inversions

CSCI Honor seminar in algorithms Homework 2 Solution

15.1 Proof of the Cook-Levin Theorem: SAT is NP-complete

Mid-term Exam Answers and Final Exam Study Guide CIS 675 Summer 2010

Online Computation of Abelian Runs

Classes and conversions

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Finding all covers of an indeterminate string in O(n) time on average

State Complexity of Neighbourhoods and Approximate Pattern Matching

Part 4 out of 5 DFA NFA REX. Automata & languages. A primer on the Theory of Computation. Last week, we showed the equivalence of DFA, NFA and REX

Deterministic Finite Automata (DFAs)

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Deterministic Finite Automata (DFAs)

Three new strategies for exact string matching

Computability Theory

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Theoretical Computer Science

Part I: Definitions and Properties

Integer Sorting on the word-ram

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING

October 6, Equivalence of Pushdown Automata with Context-Free Gramm

Advanced Automata Theory 7 Automatic Functions

Recap DFA,NFA, DTM. Slides by Prof. Debasis Mitra, FIT.

Lecture 3: Finite Automata. Finite Automata. Deterministic Finite Automata. Summary. Dr Kieran T. Herley

Computational Models: Class 3

Hash tables. Hash tables

Reducability. Sipser, pages

Hash tables. Hash tables

How do regular expressions work? CMSC 330: Organization of Programming Languages

arxiv: v2 [cs.ds] 1 Feb 2015

= 1 2x. x 2 a ) 0 (mod p n ), (x 2 + 2a + a2. x a ) 2

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

Algorithms Design & Analysis. String matching

Transcription:

UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1

Chapter Dependencies Automata Ch 32 String Matching You re responsible for material in Sections 32.1-32.4 of this chapter. 2

String Matching Algorithms Motivation & Basics 3

String Matching Problem Motivations: text-editing, editing, pattern matching in DNA sequences 32.1 Text: : array T [1...n] n m Pattern: : array P [1...m] Array Element: : Character from finite alphabet Σ Pattern P occurs with shift s in T if P [1...m] = T [s +1...s + m] 0 s n m 4

String Matching Algorithms: Worst-Case Execution Time Naive Algorithm Preprocessing: 0 Matching: O((n-m+1) +1)m) Overall: O(( ((n-m+1) +1)m) ) Rabin-Karp Preprocessing: Θ(m) Matching: O((n-m+1) +1)m) Overall: O(( ((n-m+1) +1)m) (Better than this on average and in practice) Finite Automaton Preprocess: O(m Σ )) Matching: Θ(n) Overall: O(n + m Σ ) Knuth-Morris-Pratt Preprocessing: Θ(m) Matching: Θ(n) Overall: Θ(n + m) Text: : array T [1...n] Pattern: : array P [1...m] 5

Notation & Terminology Σ* * = set of all finite-length strings formed using characters from alphabet Σ Empty string: ε x = length of string x w is a prefix of x: w x ab abcca w is a suffix of x: w x prefix, suffix are transitive cca abcca 6

Overlapping Suffix Lemma 32.1 32.33 32.1 7

String Matching Algorithms Naive Algorithm 8

Naive String Matching implicit loop worst-case running time is in Θ(( ((n-m+1) +1)m) 32.4 9

String Matching Algorithms Rabin-Karp 10

Rabin-Karp Algorithm Assume each character is digit in radix-d d notation (e.g. d=10) p = decimal value of pattern Convert to numeric representation for mod operations. t s = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m Strategy: compute p in O(m) time (in O(n)) compute all t i values in total of O(n) time find all valid shifts s in O(n) time by comparing p with each t s Compute p in O(m) time using Horner s rule: p = P[m] + d(p[m-1] + d(p[m-2] +... + d(p[2] + dp[1]))) Compute t 0 similarly from T[1..m] in O(m) time Compute remaining t i s in O(n-m) time t s+1 = d(t s -d m- 1 T[s+1]) + T[s+m+1] rolling window 11

Rabin-Karp Algorithm p, t s may be large, so mod by a prime q pattern match 32.5 12

Rabin-Karp Algorithm (continued) t s+1 = d(t s -d m-1 T[s+1]) + T[s+m+1] d m-1 mod q (32.2) p = 31415 spurious hit 13

Rabin-Karp Algorithm (continued) 14

Rabin-Karp Algorithm (continued) d is radix; q is modulus Θ(m) high-order digit position for m-digit window Θ(m) Θ((n-m+1)m) Try all possible shifts Θ(m) stopping condition Matching loop invariant: when line 10 executed t s =T [s+1..s+m] mod q rule out spurious hit worst-case running time is in Θ((n-m+1)m) 15

Rabin-Karp Algorithm (continued) d is radix; q is modulus Θ(m) high-order digit position for m-digit window Θ((n-m+1)m) Try all possible shifts Θ(m) Θ(m) stopping condition Assume reducing mod q is like random mapping from Σ* to Z q Matching loop invariant: when line 10 executed t s =T[s+1..s+m] mod q rule out spurious hit set of all finite-length strings formed from Σ Estimate (chance that t s = p (mod q)) = 1/q Expected # spurious hits is in O(n/q) Expected matching time = O(n) + O(m(v ( + n/q)) (v = # valid shifts) preprocessing + t s updates If v is in O(1) and q >= m time for explicit matching comparisons average-case running time is in Ο(n+m) 16

String Matching Algorithms Finite Automata 17

Finite Automata δ 32.6 Strategy: Build automaton for pattern, then examine each text character once. worst-case running time is in Θ(n) + automaton creation time 18

Finite Automata Difference: Our automaton will find all occurrences of pattern. 19

String-Matching Automaton Pattern = P = ababaca Absent arrows go to state 0. Automaton accepts all strings ending in P; catches all matches 32.7 20

String-Matching Automaton Suffix Function for P: σ (x) = length of longest prefix of P that is a suffix of x σ ( x) = max{ k : P x} (32.3) k (32.4) We will build up to this proof Automaton s operational invariant (32.5) i-character prefix of T at each step: : keep track of longest pattern prefix that is a suffix of what has been read so far 21

String-Matching Automaton Simulate behavior of string-matching automaton that finds occurrences of pattern P of flength m in T [1..n] We ll show automaton is in state σ(t i ) after scanning character T[i]. Since σ(t i )=m iff P (T i ), machine is in accepting state m iff it has just scanned pattern P. assuming automaton has already been created... worst-case running time of matching is in Θ(n) 22

String-Matching Automaton (continued) Correctness of matching procedure... 32.4 Automaton keeps track of longest pattern prefix that is a suffix of what has been read so far in the text. board work (32.4) 32.3 ( xa) σ ( P σ ( x) a) σ = to be proved next 23

String-Matching Automaton (continued) Correctness of matching procedure... to be used to prove Lemma 32.3 32.2 32.8 = P σ ( xa) 32.8 32.2 24

String-Matching Automaton (continued) Correctness of matching procedure... 32.3 32.9 32.2 32.1 = P σ ( x) = P σ ( xa ) 32.9 32.3 25

String-Matching Automaton (continued) Correctness of matching procedure is now established... 32.4 (32.4) 32.3 σ ( xa) = σ ( P σ ( x) a) 26

String-Matching Automaton (continued) This procedure computes the transition function δ from a given pattern P [1 m]. worst-case running time of automaton creation is in Ο(m 3 Σ ) ) can be improved to: Ο(m Σ ) ) worst-case running time of entire string-matching strategy is in Ο(m Σ ) ) + Ο(n) automaton creation time 27 pattern matching time

String Matching Algorithms Knuth-Morris-Pratt 28

Knuth-Morris-Pratt Overview Achieve e Θ(n+m) + ) time by shortening automaton preprocessing time below Ο(m Σ ) ) Approach: don t precompute automaton s transition function calculate enough transition data on-the-fly obtain data via alphabet-independent pattern preprocessing pattern preprocessing compares pattern against shifts of itself Use amortization for running time calculation. 29

Knuth-Morris-Pratt Algorithm determine how pattern matches against itself 32.10 30

Knuth-Morris-Pratt Algorithm 32.6 Equivalently, what is largest k < q such that P k P q? Prefix function π shows how pattern matches against itself π ( q ) = max{ k : k < q and P k Pq } π(q) is length of longest prefix of P that is a proper suffix of P q Example: 31

Knuth-Morris-Pratt Algorithm Somewhat similar in structure to FINITE-AUTOMATON AUTOMATON-MATCHER MATCHER Θ(m+n) using amortized analysis (see next slide) Θ(m) Θ(n) using amortized analysis* # characters matched scan text left-to-rightto next character does not match next character matches Is all of P matched? Look for next match 32 *, 2 nd edition uses potential function with Φ = q. 3 rd edition uses aggregate analysis.

Knuth-Morris-Pratt Algorithm Amortized Analysis Potential Method Φ = k k represents current state of algorithm Similar in structure to KMP-MATCHER MATCHER Potential is never negative since π (k) >= 0 for all k Θ(m) time initial potential value potential decreases potential increases by <=1 in each execution of for loop body amortized cost of loop body is in Ο(1) Θ(m) loop iterations, 2 nd edition. 3 rd edition uses aggregate analysis to show while loop executes O(m) times overall. 33

Knuth-Morris-Pratt Algorithm Correctness... Iterated t prefix function: 34

Knuth-Morris-Pratt Algorithm Correctness... 35

StringMatch Correctness of Compute-Prefix-Function.. This is nontrivial Lemma 32.5.. (Prefix-function function iteration lemma) Let P be a pattern of length m with prefix function π.. Then, for q =1 1, 2,, m,, we have π * [q] = {k : k < q and P k P q }. Proof. (using > for suffix symbol) 1. π * [q] {k : k < q and P k P q }. Let i = π (u ) [q] for some u > 0. We prove the inclusion by induction on u. For u = 1, we have i = π[q], [ and dth the claim follows since i < q and P π[q] P q. Assume the inclusion holds for i = π (u ) [q]. We need to prove it for i = π (u+1 u+1) [q] = π [π (u ) [q]]. i < π (u ) [q] and P i P π(u)[ )[q]. 4/23/2012 36 source: Textbook and Prof. Pecelli

StringMatch By induction assumption, P π(u)[ )[q] P q. Transitivity of the relation give that P i P q, as desired. 2. {k : k < q and P k P q } π * [q]. By contradiction. Suppose, to the contrary, that there is an integer in {k k : k < q and P k P q } - π * [q], and let j denote the largest such integer. Since π[q] is the largest value in {k{ : k < q and P k P q q},, and π[q] π * [q], we must have j < π[q].. Let j denote the smallest integer in π * [q]s.t. j > j. j {k : k < qand P k P q } implies P j P q ; j π * [q] implies P j P q. Lemma 32.1 (Overlapping Suffix) implies that P j P j, and j 4/23/2012 is the largest value less than j with this property. 37 source: Textbook and Prof. Pecelli

StringMatch This, in turn, forces the conclusion that π[j j ] ] = j and, since j π * [q], we must have j π * [q]. Contradiction. We now continue with another lemma: it is clear that, since π[1] = 0, line 2 of Compute-Prefix- Function provides the correct value. We need to extend this statement to all q > 1. 4/23/2012 38 source: Textbook and Prof. Pecelli

StringMatch Lemma 32.6.. Let P = P[1 [1 m], and let π be the prefix function for P.. For q = 0, 1,, m,, if π[q] ] > 0, then π[q] -1 π [q-1] 1]. Proof.. If r = π[q]>0 > 0, then r<qand P r P q. Thus r -1 < q -1 and P r-1 P q-1 (by dropping the last characters from P r and P q ). Lemma 32.5 implies that π[q] - 1 = r -1 π [q-1] 1]. 4/23/2012 39 source: Textbook and Prof. Pecelli

StringMatch We now introduce a new set: for q = 23 2, 3,, m, define E q-1 π [q-1] by: E q-1 = {k π 1 [q-1]: P[k+1] [ ] = P[q]} = {k : k < q-1 and P k P q-1 and P[k+ k+1] = P[q]} = {k : k < q-1 1 and P k+1 P q }. In other words, E q-1 consists of the values k < q -1 for which P k P q-1 and for which P k+1 P q, because P[k+1] = P[q]. E q-1 consists of those values k π [q-1] for which we can extend P k to P k+1 and still get a proper suffix of P q. 4/23/2012 40 source: Textbook and Prof. Pecelli

StringMatch Corollary 32.7.. Let P be a pattern of length m,, and let p be the prefix function for P.. For q = 2, 3,, m, π [ q ] = = 1+ max{ k E } q 1 if Eq 1. 0 if E q 1, Proof.. Case 1: E q-1 is empty. There is no k π [q-1] (including k = 0) for which we can extend P k to P k+1 and get a proper suffix of P q. Thus π[q] ] = 0. Case 2: E q-1 is not empty. 1. Prove π[q] ] 1 + max{k Ε q-1 }. For each k Ε q-1 we have k+1< 1<q q and P k+1 P q. The definition of π[q] gives the inequality. 4/23/2012 41 source: Textbook and Prof. Pecelli

StringMatch 2. Prove that π[q] ] 1 + max{k Ε q-1 }. Since E q-1 is non-empty, π[q] ] > 0. Let r = π[q] -1, hence r + 1 = π[q].. Since r+1>0 1 > 0, P[r +1]=P[q] P[q].. By Lemma 32.6 we also have r = π[q] -1 π [q -1] 1]. Therefore r Ε q-1, which implies r max{k Ε q-1 } and, immediately, the desired inequality. Combining i both inequalities, we have the result. Now glue all these results together to obtain a proof of correctness. 4/23/2012 42 source: Textbook and Prof. Pecelli