Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Similar documents
Analysis of Algorithms Prof. Karen Daniels

Efficient Sequential Algorithms, Comp309

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Graduate Algorithms CS F-20 String Matching

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithms: COMP3121/3821/9101/9801

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Data Structures and Algorithm. Xiaoqing Zheng

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Module 9: Tries and String Matching

Lecture 3: String Matching

Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt Algorithm

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

2. Exact String Matching

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Improving the KMP Algorithm by Using Properties of Fibonacci String

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

String Matching II. Algorithm : Design & Analysis [19]

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

String Search. 6th September 2018

Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of

Pattern Matching (Exact Matching) Overview

On Boyer-Moore Preprocessing

Searching Sear ( Sub- (Sub )Strings Ulf Leser

INF 4130 / /8-2017

Theoretical Computer Science. Efficient string-matching allowing for non-overlapping inversions

Algorithms Design & Analysis. String matching

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

INF 4130 / /8-2014

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Finding all covers of an indeterminate string in O(n) time on average

PATTERN MATCHING WITH SWAPS IN PRACTICE

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu

Approximate Pattern Matching and the Query Complexity of Edit Distance

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Three new strategies for exact string matching

Hashing Techniques For Finite Automata

Text Analytics. Searching Terms. Ulf Leser

On the Decision Tree Complexity of String Matching. December 29, 2017

Average Case Analysis of the Boyer-Moore Algorithm

String Range Matching

STRING-SEARCHING ALGORITHMS* (! j)(1 =<j _<-n- k + 1)A(pp2 p s.isj+l"" Sj+k-1). ON THE WORST-CASE BEHAVIOR OF

String Regularities and Degenerate Strings

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Each internal node v with d(v) children stores d 1 keys. k i 1 < key in i-th sub-tree k i, where we use k 0 = and k d =.

Lecture Lecture 3 Tuesday Sep 09, 2014

Everything You Always Wanted to Know About Parsing

Samson Zhou. Pattern Matching over Noisy Data Streams

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING

CS483 Design and Analysis of Algorithms

Designing 1-tape Deterministic Turing Machines

When is a number Fibonacci?

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Streaming and communication complexity of Hamming distance

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING ERKUT ERDEM. Apr. 28, 2015

Define M to be a binary n by m matrix such that:

Supplement to Sampling Strategies for Epidemic-Style Information Dissemination 1

Algorithms and Data Structures

IN MUSICAL SEQUENCE ABSTRACT 1 INTRODUCTION 2 APPROXIMATE MATCHING AND MUSICAL SEQUENCES

Lecture 18: Oct 15, 2014

On-line String Matching in Highly Similar DNA Sequences

Algorithms. Algorithms 5.3 SUBSTRING SEARCH. introduction brute force Knuth Morris Pratt Boyer Moore Rabin Karp ROBERT SEDGEWICK KEVIN WAYNE

Counting and Verifying Maximal Palindromes

Multiple Pattern Matching

Fast String Searching. J Strother Moore Department of Computer Sciences University of Texas at Austin

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

Data Structures and Algorithms Chapter 2

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Data Structures and Algorithms Chapter 3

Design and Analysis of Algorithms Recurrence. Prof. Chuhua Xian School of Computer Science and Engineering

Fingerprint idea. Assume:

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

About the impossibility to prove P NP or P = NP and the pseudo-randomness in NP

Lecture 5: The Shift-And Method

String Matching with Variable Length Gaps

On Pattern Matching With Swaps

Lecture 4. Quicksort

Simple Real-Time Constant-Space String Matching

Lecture 9. Greedy Algorithm

58093 String Processing Algorithms. Lectures, Fall 2010, period II

The NFA Segments Scan Algorithm

Efficient (δ, γ)-pattern-matching with Don t Cares

5.3 Substring Search

So far we have implemented the search for a key by carefully choosing split-elements.

A Uniform Proof Procedure for Classical and Non-Classical Logics

UNIT II REGULAR LANGUAGES

Intrusion Detection and Malware Analysis

Least Random Suffix/Prefix Matches in Output-Sensitive Time

Integer Sorting on the word-ram

Chapter 34: NP-Completeness

1. Problem I calculated these out by hand. It was tedious, but kind of fun, in a a computer should really be doing this sort of way.

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS

Theory of Computation

Introduction to Algorithms

arxiv: v2 [cs.ds] 23 Feb 2012

Transcription:

Knuth-Morris-Pratt & s by Robert C. St.Pierre Overview Notation review Knuth-Morris-Pratt algorithm Discussion of the Algorithm Example Boyer-Moore algorithm Discussion of the Algorithm Example Applications in automatic proving Notation Review (1) T[1..n] text to search, length n Characters from Σ P[1..m] pattern text, length m Characters from Σ m <= n Σ finite alphabet δ transition function in a FA Notation Review (2) P k k-character prefix P[1..k] of P[1..m] w is a prefix of x, denoted w x, if x = wy for some string y Σ * w is a suffix of x, denoted w x, if x = yw for some string y Σ * ε - empty string Notation Review (3) s shift 0 s n m If T[s+1..s+m] = P[1..m] then s is called a valid shift else s is called an invalid shift. The string-matching problem is to find all valid shifts. The Kunth-Morris-Pratt (KMP) Algorithm Θ(m) preprocessing time Saves a factor of Σ over the preprocessing time for the FA Done by using an auxiliary function π, instead of the transition function δ. Θ(n) matching time Section 32.4 in text. 1

Discussion Prefix Function (1) Discussion Prefix Function (2) The prefix function π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid testing useless shifts in the naive pattern matching algorithm or to avoid the precomputation of δ for a stringmatching automation. Discussion Prefix Function (3) Discussion Prefix Function (4) Formally: Given a pattern P[1..m], the prefix function for the pattern P is the function π : {1, 2,, m} -> {0, 1,, m-1} such that π[q] = max{k: k < q and P k P q }. π[q] is the length of the longest prefix of P that is a proper suffix of P q. Discussion Pseudocode (1) Compute-Prefix-Function(P) 1. m <- length[t] 2. π[1] <- 0 3. k <- 0 4. for q <- 2 to m 5. do while k > 0 and P[k+1]!= P[q] 6. do k <- π[k] 7. if P[k+1] = P[q] 8. then k <- k + 1 9. π[q] <- k 10. return π Discussion Pseudocode (2) KMP-Matcher(T, P) 1. n <- length[t] 2. m <- length[p] 3. π <- Compute-Prefix-Function(P) 4. q <- 0 ; Number of characters matched. 5. for i <- 1 to n ; Scan the text from left to right 6. do while q > 0 and P[q+1]!= T[i] 7. do q <- π[q] ;Next character does not match. 8. if P[q+1] = T[I] 9. then q <- q + 1 ;Next character matches 10. if q = m ;Is all of P matched? 11. then print Pattern occurs with shift i m 12. q <- π[q] ;Look for the next match 2

Discussion Correctness (1) Discussion Correctness (2) Lemma 32.5 (Prefix-function iteration lemma) Let P be a pattern of length m with prefix function π. Then, for q = 1, 2,, m, we have π [q] = {k: k < q and P k P q }. Lemma 32.6 Let P be a pattern of length m, and let π be the prefix function for P. For q = 1, 2,, m, if π[q] > 0, then π[q] 1 π [q - 1]. Examples (HTML based) http://www.cs.utexas.edu/users/moore/ best-ideas/string-searching/kpmexample.html http://www-igm.univmlv.fr/~lecroq/string/examples/exp8.ht ml Examples (Applets) http://www-igm.univmlv.fr/~lecroq/string/node8.html http://www-sr.informatik.unituebingen.de/~buehler/bm/bm.html The If the pattern P is relatively long and the alphabet Σ is reasonably large, then [this algorithm] is likely to be the most efficient string-matching algorithm. Matches right to left, unlike KMP. This algorithm is NOT in the second edition of our book, but is in section 34.5 of the first edition. Discussion Matcher Function (1) Boyer-Moore-Matcher(T, P, Σ) 1. n <- length[t] 2. m <- length[p] 3. λ <- Compute-Last-Occurrence-Function(P, m, Σ) 4. γ <- Compute-Good-Suffix-Function(P, m) 5. s <- 0 6. while s <= n m 7. do j <- m 8. while j > 0 and P[j] = T[s+j] 9. do j <- j 1 10. if j = 0 11. then print Pattern occurs at shift s 12. s <- s + γ[0] 13. else s <- s + max(γ[0], j λ[t[s+j]] ) 3

Discussion Matcher Function (2) The function Boyer-Moore-Matcher(T,P, Σ) looks remarkably like the naive stringmatching algorithm. Indeed, commenting out lines 3-4 and changing lines 12-13 to s <- s + 1, results in a version of the naive string-matching algorithm. The uses the greater of two heuristics to determine how much to shift next by. Discussion Heuristic Bad (1) The first heuristic, is the bad-character heuristic. In general, works as follows: P[j]!= T[s+j] for some j, where 1<= j <= m. Let k be the largest index in the range 1 <= k <= m such that T[s+j] = P[k], if any such k exists. Otherwise let k = 0. We can safely increase by j k, three cases to show this. Discussion Heuristic Bad (2) Case 1. k = 0, so increase by j. Discussion Heuristic Bad (3) Case 2. k < j, so increase by j k. Discussion Heuristic Bad (4) Case 3. k > j, resulting in a negative shift, but the good-suffix heuristic recommendation is ignored. Discussion Heuristic Bad (5) Compute-Last-Occurrence-Function(P,m,Σ) 1. for each character a Σ 2. do λ[a] = 0 3. for j <- 1 to m 4. do λ[p[j]] <- j 5. return λ The running time of this procedure is O( Σ + m). 4

Discussion Heuristic Good (1) Define the relation Q ~ R for strings Q and R to mean that Q R or R Q. If two strings are similar, then we can align them with their rightmost characters matched, and no pair of aligned characters will disagree. The relation ~ is symmetric. Q ~ R and S ~ R imply Q ~ S Discussion Heuristic Good (2) If P[j]!= T[s+j], where j < m, then the good-suffix heuristic says that we can safely advance by γ[j] = m max{k: 0 <= k < m and P[j+1..m] ~ P k } γ[j] is the least amount we can advance s and not cause any characters in the good suffix T[s + j + 1..s + m] to be mismatched against the new alignment of the pattern. γ[j] > 0 for all j = 1..m, which ensures that this algorithm makes progress. Discussion Heuristic Good (3) Discussion Heuristic Good (4) Compute-Good-Suffix-Function(P,m) 1. π <- Compute-Prefix-Function(P) 2. P = reverse(p) 3. π <- Compute-Prefix-Function(P ) 4. For j <- 0 to m 5. do γ[j] <- m π[m] 6. For i <- 1 to m 7. do j <- m π [i] 8. if γ[j] > i - π [i] 9. then γ[j] <- i - π [i] 10. return γ This function has running time O(m). Discussion Running Time Worst case is O((n m + 1)m + Σ ) Compute-Last-Occurrence-Function takes time O(m + Σ ). Compute-Good-Suffix-Function takes time O(m). O(m) time is spent validating each valid shift s. Examples (HTML based) http://www.cs.utexas.edu/users/moore/ best-ideas/string-searching/fstrposexample.html 5

Examples (Applets) http://www-igm.univmlv.fr/~lecroq/string/node14.html http://www.blarg.com/~doyle/bmi.html http://www-sr.informatik.unituebingen.de/~buehler/bm/bm.html References 1. Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms. 2 st Edition. MIT, 2001. 2. Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest. Introduction to Algorithms. 1 st Edition. MIT, 1990. 6