Lecture 3: String Matching

Similar documents
String Search. 6th September 2018

Analysis of Algorithms Prof. Karen Daniels

INF 4130 / /8-2017

2. Exact String Matching

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

INF 4130 / /8-2014

Module 9: Tries and String Matching

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Lecture 6: Introducing Complexity

Algorithms: COMP3121/3821/9101/9801

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Graduate Algorithms CS F-20 String Matching

Knuth-Morris-Pratt Algorithm

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

Efficient Sequential Algorithms, Comp309

NP, polynomial-time mapping reductions, and NP-completeness

Improving the KMP Algorithm by Using Properties of Fibonacci String

Lecture 4 : Adaptive source coding algorithms

Lecture notes on Turing machines

On Boyer-Moore Preprocessing

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

On Pattern Matching With Swaps

Lecture 5: The Shift-And Method

Data Structures and Algorithm. Xiaoqing Zheng

PATTERN MATCHING WITH SWAPS IN PRACTICE

CPSC 467b: Cryptography and Computer Security

Samson Zhou. Pattern Matching over Noisy Data Streams

Nondeterministic finite automata

Approximate Pattern Matching and the Query Complexity of Edit Distance

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of

CS Analysis of Recursive Algorithms and Brute Force

Topics in Computer Mathematics

Computational Models #1

Q 2.0.2: If it s 5:30pm now, what time will it be in 4753 hours? Q 2.0.3: Today is Wednesday. What day of the week will it be in one year from today?

COLLATZ CONJECTURE: IS IT FALSE?

Decision, Computation and Language

Computational Models - Lecture 1 1

Pattern Matching (Exact Matching) Overview

String Range Matching

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

1 Algorithms for Permutation Groups

COS597D: Information Theory in Computer Science October 19, Lecture 10

Pattern-Matching for Strings with Short Descriptions

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

MATH 2200 Final Review

(Refer Slide Time: 0:21)

Mapping Reducibility. Human-aware Robotics. 2017/11/16 Chapter 5.3 in Sipser Ø Announcement:

CPSC 467b: Cryptography and Computer Security

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

CS 361 Meeting 26 11/10/17

CSCE 564, Fall 2001 Notes 6 Page 1 13 Random Numbers The great metaphysical truth in the generation of random numbers is this: If you want a function

Three new strategies for exact string matching

CSCI Honor seminar in algorithms Homework 2 Solution

Online Computation of Abelian Runs

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Finding all covers of an indeterminate string in O(n) time on average

Knuth-Morris-Pratt Algorithm

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence

CS Asymptotic Notations for Algorithm Analysis

1 Maintaining a Dictionary

Computational Models Lecture 9, Spring 2009

Lecture 2: Just married

Streaming Periodicity with Mismatches. Funda Ergun, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou

Shannon-Fano-Elias coding

On the Decision Tree Complexity of String Matching. December 29, 2017

CS173 Lecture B, November 3, 2015

1 Reductions and Expressiveness

Lecture 8 HASHING!!!!!

Turing Machines Part Two

Further discussion of Turing machines

String Matching II. Algorithm : Design & Analysis [19]

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

where Q is a finite set of states

Automata-based Verification - III

Algorithm Theory - Exercise Class

Algorithms. Algorithms 5.3 SUBSTRING SEARCH. introduction brute force Knuth Morris Pratt Boyer Moore Rabin Karp ROBERT SEDGEWICK KEVIN WAYNE

String Matching with Variable Length Gaps

NP-Completeness. Algorithmique Fall semester 2011/12

MATH 243E Test #3 Solutions

2 - Strings and Binomial Coefficients

1. Problem I calculated these out by hand. It was tedious, but kind of fun, in a a computer should really be doing this sort of way.

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Implementing Approximate Regularities

Turing Machines, diagonalization, the halting problem, reducibility

M. Smith. 6 September 2016 / GSAC

Algorithms CMSC Basic algorithms in Number Theory: Euclid s algorithm and multiplicative inverse

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

15-451/651: Design & Analysis of Algorithms October 31, 2013 Lecture #20 last changed: October 31, 2013

Quantum pattern matching fast on average

Proving Programs Correct

Define M to be a binary n by m matrix such that:

LECSS Physics 11 Introduction to Physics and Math Methods 1 Revised 8 September 2013 Don Bloomfield

Continuing discussion of CRC s, especially looking at two-bit errors

SUBSTRING SEARCH BBM ALGORITHMS DEPT. OF COMPUTER ENGINEERING ERKUT ERDEM. Apr. 28, 2015

Lecture Examples of problems which have randomized algorithms

Transcription:

COMP36111: Advanced Algorithms I Lecture 3: String Matching Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18

Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt algorithm

Suppose we are given some English text A blazing sun upon a fierce August day was no greater rarity in southern France then, than at any other time, before or since. Everything in Marseilles, and about Marseilles, had stared at the fervid sky, and been stared at in return, until a staring habit had become universal there. and a search string, say Marseilles. We would like to find all instances (or just the first instance ) of the search string in the text. What s the best way?

Suppose we are given some English text A blazing sun upon a fierce August day was no greater rarity in southern France then, than at any other time, before or since. Everything in Marseilles, and about Marseilles, had stared at the fervid sky, and been stared at in return, until a staring habit had become universal there. and a search string, say Marseilles. We would like to find all instances (or just the first instance ) of the search string in the text. What s the best way?

As usual, we start by modelling the data: let Σ be a finite non-empty set (the alphabet); let T = T [0],..., T [n 1] be a string length n over a fixed alphabet Σ; let P = P[0],..., P[m 1] be a string length m over Σ; We formalize the notion of an occurrence of one string in another: string P occurs with shift i in string T if P[j] = T [i + j] for all j (0 i < P ). The we have the following problem MATCHING Given: strings T and P over some fixed alphabet Σ. Return: the set of integers i such that P occurs in T with shift i.

Here is a really stupid algorithm begin naivematch(t,p) I for i = 0 to T P j 0 until j = P or T [i + j] P[j] j++ if j = P I = I {i} return I end Graphically a b d 0 i a b c d 0 j Running time is O( T P ).

Here is a really stupid algorithm begin naivematch(t,p) I for i = 0 to T P j 0 until j = P or T [i + j] P[j] j++ if j = P I = I {i} return I end Graphically a b d 0 i a b c d 0 j Running time is O( T P ).

Here is a really stupid algorithm begin naivematch(t,p) I for i = 0 to T P j 0 until j = P or T [i + j] P[j] j++ if j = P I = I {i} return I end Graphically a b d 0 i a b c d 0 j Running time is O( T P ).

Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt algorithm

Let n = T and m = P. Think of the elements of Σ as digits in a base-b numeral, where b = Σ. Then P is the number P[0] b m 1 + + P[m 1] b 0. Similarly, T [i,, i + m 1] is T [i] b m 1 + + T [i + m 1] b 0. To calculate T [i + 1,, i + m] from T [i,, i + m 1], write: T [i+1,, i+m] = (T [i,, i+m 1] T [i] b m 1 ) b+t [i+m].

These numbers can get a bit large. However, we can work modulo q, for some constant q (usually a prime) such that bq is about the size of a computer word. Of course, we have T [i + 1,, i + m] = (T [i,, i + m 1] T [i] b m 1 ) b + T [i + m] (mod q). If T [i,, i + m 1] P mod q, then we know we do not have a match at shift i. If T [i,, i + m 1] = P mod q, then we simply check explicitly that T [i,, i + m 1] = P.

The worst-case running time of this algorithm is also O( T P ). On average, however, it works must better: A rough estimate of the probability of a spurious match is 1/q, since this is the probability that a random number will take a given value modulo q. (Well, that s actually nonsense, but never mind.) A reasonable estimate of the number of matches is O(1), since patterns are basically rare. This leads to an expected performance of about O(n + m + m(n/q)) Thus, expected running time will be about O(n + m), since presumably q > m.

Here is the algorithm begin Rabin-Karp(T,P,q,b) I m P t T [0] b m 1 + + T [m 1] b 0 mod q p P[0] b m 1 + + P[m 1] b 0 mod q i 0 while i T m if p = t j 0 while P[j] = T [i + j] and j < P j++ if j = P I I {i} t (t T [i] b m 1 ) b + T [i + m] mod q i++ return I end

Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt algorithm

The basic idea of this algorithm is as follows. Suppose we have a mismatch at some pattern position j. Call P[0,..., j 1] the good prefix. j Now let us look at the longest prefix of the good prefix which is also a proper suffix of the good prefix.

The basic idea of this algorithm is as follows. Suppose we have a mismatch at some pattern position j. Call P[0,..., j 1] the good prefix. j Now let us look at the longest prefix of the good prefix which is also a proper suffix of the good prefix.

Suppose that the length of the longest such proper prefix is k. It should be clear that there cannot be any matches involving shifts of less than k (for then, k would not be maximal). k j So we can shift the pattern up by k. Denote the length k of the longest prefix of P[0,..., j 1] that is also a proper suffix of P[0,..., j 1] by π(j), for all j (1 j P ), and set π(0) = 0.

Suppose that the length of the longest such proper prefix is k. It should be clear that there cannot be any matches involving shifts of less than k (for then, k would not be maximal). k j So we can shift the pattern up by k. Denote the length k of the longest prefix of P[0,..., j 1] that is also a proper suffix of P[0,..., j 1] by π(j), for all j (1 j P ), and set π(0) = 0.

Suppose that the length of the longest such proper prefix is k. It should be clear that there cannot be any matches involving shifts of less than k (for then, k would not be maximal). k j So we can shift the pattern up by k. Denote the length k of the longest prefix of P[0,..., j 1] that is also a proper suffix of P[0,..., j 1] by π(j), for all j (1 j P ), and set π(0) = 0.

Don t do it! Don t shift the pattern up by k! Instead, think what would happen to j if you did. The length of the good prefix would change: j π(j). a π(j) j π(j) Now repeat the process: either we get a match and make progress (incrementing j), or we get a mismatch and execute j π(j).

Here is the algorithm. begin Knuth-Morris-Pratt(T,P) I compute the function π i 0, j 0 while i < T if P[j] = T [i] if j = P 1 I I {i P + 1} i++, j π[ P ] i++, j++ else if j > 0 j π[j] else i++ return I end

The following algorithm computes π. begin compute-π(p) i 1 j 0 f (0) 0 while i < P if P[i] = P[j] f (i) j + 1 i++, j++ else if j > 0 j f (j 1) else f (i) 0 i++ return f end

Here is the general idea behind compute-π(p) j i P 1 j points to the square after the longest confirmed good-prefix for P[0]... P[i 1]. if we get a match, we can simply increase both i and j, and update π(i).

Here is the general idea behind compute-π(p) π(j 1) j i P 1 j points to the square after the longest confirmed good-prefix for P[0]... P[i 1]. if we get a match, we can simply increase both i and j, and update π(i). if we get a mis-match, we do not have to start at the beginning of the pattern; we can use π(j 1) to jump over characters we know will match.

The running time of Knuth-Morris-Pratt(T,P) (ignoring the construction of π is O( T ). Letting k = i j, each iteration of the loop either increments i or increases k by at least 1, and neither quantity reduces. Hence, the while loop can execute at most 2 T times. Note that π is one-off: it depends only P and not on T, so its computation is not critical. In fact, however, the running time of compute-π(p) is O( P ). Hence, overall running time is O( P + T ).