Algorithms Design & Analysis. String matching

Similar documents
15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Module 9: Tries and String Matching

Knuth-Morris-Pratt Algorithm

String Search. 6th September 2018

Pattern Matching (Exact Matching) Overview

INF 4130 / /8-2017

INF 4130 / /8-2014

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Lecture 2: Pairwise Alignment. CG Ron Shamir

Define M to be a binary n by m matrix such that:

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Analysis of Algorithms Prof. Karen Daniels

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

2. Exact String Matching

Lecture 1 : Data Compression and Entropy

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Graduate Algorithms CS F-20 String Matching

Algorithm Design and Analysis

Algorithms: COMP3121/3821/9101/9801

Sublinear Approximate String Matching

Lecture 4 : Adaptive source coding algorithms

Analysis and Design of Algorithms Dynamic Programming

Efficient Sequential Algorithms, Comp309

6.1 The Pumping Lemma for CFLs 6.2 Intersections and Complements of CFLs

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Intrusion Detection and Malware Analysis

Approximation: Theory and Algorithms

Context-Free Languages

Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of

Compressed Index for Dynamic Text

SUFFIX TREE. SYNONYMS Compact suffix trie

Small-Space Dictionary Matching (Dissertation Proposal)

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Hierarchical Overlap Graph

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

McCreight's suffix tree construction algorithm

All three must be approved Deadlines around: 21. sept, 26. okt, and 16. nov

Dynamic Programming. Prof. S.J. Soni

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Space-Efficient Construction Algorithm for Circular Suffix Tree

Bio nformatics. Lecture 3. Saad Mneimneh

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

1 Introduction to information theory

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Ukkonen's suffix tree construction algorithm

Samson Zhou. Pattern Matching over Noisy Data Streams

On Boyer-Moore Preprocessing

arxiv: v1 [cs.ds] 9 Apr 2018

Problem: Data base too big to fit memory Disk reads are slow. Example: 1,000,000 records on disk Binary search might take 20 disk reads

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Multiple Pattern Matching

Implementing Approximate Regularities

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)

MA/CSSE 474 Theory of Computation

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

CS483 Design and Analysis of Algorithms

Lecture 5: The Shift-And Method

On-line String Matching in Highly Similar DNA Sequences

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Binary Search Trees. Motivation

Theoretical Computer Science

Evolutionary Tree Analysis. Overview

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Lecture 9. Greedy Algorithm

Finite Automata. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

More Dynamic Programming

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

General Methods for Algorithm Design

Knuth-Morris-Pratt Algorithm

More Dynamic Programming

Lecture 13. More dynamic programming! Longest Common Subsequences, Knapsack, and (if time) independent sets in trees.

Data Structures in Java

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Self-Indexed Grammar-Based Compression

String Matching II. Algorithm : Design & Analysis [19]

arxiv: v2 [cs.ds] 16 Mar 2015

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

String Regularities and Degenerate Strings

String Matching Problem

Longest Common Prefixes

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

String Indexing for Patterns with Wildcards

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Finding all covers of an indeterminate string in O(n) time on average

CS Data Structures and Algorithm Analysis

Advanced Text Indexing Techniques. Johannes Fischer

Partha Sarathi Mandal

A Pattern Matching Algorithm Using Deterministic Finite Automata with Infixes Checking. Jung-Hua Hsu

On Pattern Matching With Swaps

Transcription:

Algorithms Design & Analysis String matching

Greedy algorithm Recap 2

Today s topics KM algorithm Suffix tree Approximate string matching 3

String Matching roblem Given a text string T of length n and a pattern string of length m, the exact string matching problem is to find all occurrences of in T. Example: T= AGCTTGA = GCT Applications: Searching keywords in a file Searching engines (like Google and Baidu) Database searching (GenBank) 4

Terminologies S= AGCTTGA S =7, length of S Substring: S i,j =S i S i+1 S j Example: S 2,4 = GCT Subsequence of S: deleting zero or more characters from S ACT and GCTT are subsequences. refix of S: S 1,k AGCT is a prefix of S. Suffix of S: S h, S CTTGA is a suffix of S. 5

A Brute-Force Algorithm Time: O(mn) where m= and n= T. 6

Two-phase Algorithms hase 1:Generate an array to indicate the moving direction. hase 2:Make use of the array to move and match the string KM algorithm: roposed by Knuth, Morris and ratt in 1977. Boyer-Moore algorithm: roposed by Boyer-Moore in 1977. 7

First Case KM Algorithm The first symbol of does not appear in again. slide to T 4, since T 4 4 in (a). 8

Second case KM Algorithm The first symbol of appears in again. T 7 7 in (a). We have to slide to T 6, since 6 = 1 =T 6. 9

Third case KM Algorithm The prefix of appears in again. T 8 8 in (a). We have to slide to T 6, since 6,7 = 1,2 =T 6,7. 10

rinciple of KM Algorithm a a 11

refix Function f(j)=largest k < j such that 1,k = j k+1,j f(j)=0 if no such k f(j)=k 12

refix Function 13 (5) determine f 0 (5) get we, Because ; if check then we, If 1; (4) (5) get then we, If thus 1, (4) 1 5 1 5 2 5 2 5 1 4 = = + = = = = f f f f

refix Function Suppose we have found f(8)=3. To determine f(9): f (8) = 3 means Now, 9 = Thus, we set f 4 6,8 = 1,3 (9) = f (8) + 1 = 4 14

To determine f(10): refix Function f ( 4) = 1 f ( 9) = 4 because 9 = f (9 1) + 1 = 4 f ( 4) = 1 because = f (4 1) + 1 = 1 4 = "A" f (10) = 2 because "T" = 10 = f (10 1) + 1 10 (10 1)) + 1 5 = "C" 2 = = = = f ( f f (10 1) + 1 = f (4) + 1 2 "T" 15

refix Function f ( j) = f k ( j 1) + 1 if j > 1 and there exists the smallest f ( j) = 0 otherwise k 1 such that j = f k ( j 1) + 1 j-1 j k=1 f(j)=f(j-1)+1 a f(j-1) j-1 j k=2 f(j)=f(f((j-1))+1 f(f(j-1)) f(j-1) 16

refix Function COMUTE-REFIX-FUNCTION () m length[] f[1] 0 k 0 for q 2 to m do while k >0 and [k+1] [q] do k f[k] if [k+1] = [q] then k k + 1 f[q] k return f Time complexity: O(m) 17

hase 2 An Example for KM Algorithm f(4 1)+1= f(3)+1=0+1=1 hase 1 matched f(12)+1= 4+1=5 18

KM Algorithm KM-MATCHER (T, ) n length[t] m length[] f COMUTE-REFIX-FUNCTION () q 0 for i 1 to n do while q >0 and [q+1] T[i] do q f[q] if [q+1] = T[i] then q q + 1 if q = m then print attern occurs with shift i - m q f[q] Time complexity: O(m + n) 19

Multiple Strings Matching roblem Given a text string T of length n and a set of pattern strings, the multiple strings matching problem is to find whether a pattern occurrences in T or not. Application of KM? Time complexity to compute prefix function is O(m) When is a large set 20

Suffixes Suffixes for S= ATCACATCATCA ATCACATCATCA S (1) TCACATCATCA S (2) CACATCATCA S (3) ACATCATCA S (4) CATCATCA S (5) ATCATCA S (6) TCATCA S (7) CATCA S (8) ATCA S (9) TCA S (10) CA S (11) A S (12) 21

Suffix Tree A suffix tree for S= ATCACATCATCA 22

roperties of a Suffix Tree Each tree edge is labeled by a substring of S. Each internal node has at least 2 children. Each S (i) has its corresponding labeled path from root to a leaf, for 1 i n. There are n leaves. No edges branching out from the same internal node can start with the same character. 23

Algorithm for Creating a Suffix Tree Step 1: Divide all suffixes into distinct groups according to their starting characters and create a node. (lexicographic order) Step 2: For each group, if it contains only one suffix, create a leaf node and a branch with this suffix as its label; otherwise, find the longest common prefix among all suffixes of this group and create a branch out of the node with this longest common prefix as its label. Delete this prefix from all suffixes of the group. Step 3: Repeat the above procedure for each node which is not terminated. 24

Example for Creating a Suffix Tree S= ATCACATCATCA. Starting characters: A, C, T In N 3, S(2) = TCACATCATCA S(7) = TCATCA S(10) = TCA Longest common prefix of N 3 is TCA 25

Example for Creating a Suffix Tree S= ATCACATCATCA. Second recursion: 26

Finding a Substring with the Suffix Tree S = ATCACATCATCA = TCAT is at position 7 in S. = TCA is at position 2, 7 and 10 in S. = TCATT is not in S. 27

Time Complexity A suffix tree for a text string T of length n can be constructed in O(n) time (with a complicated algorithm). Weiner (1973) McCreight (1978) Ukkonen (1995) To search a pattern of length m on a suffix tree needs O(m) comparisons. Exact string matching: O(n+m) time 28

The Suffix Array In a suffix array, all suffixes of S are in the non -decreasing lexical order. For example, S= ATCACATCATCA i 1 2 3 4 5 6 7 8 9 10 11 12 A 12 4 9 1 6 11 3 8 5 10 2 7 4 ATCACATCATCA S (1) 11 TCACATCATCA S (2) 7 CACATCATCA S (3) 2 ACATCATCA S (4) 9 CATCATCA S (5) 5 ATCATCA S (6) 12 TCATCA S (7) 8 CATCA S (8) 3 ATCA S (9) 10 TCA S (10) 6 CA S (11) 1 A S (12) 2 ACATCATCA S (4) 3 ATCA S (9) 4 ATCACATCATCA S (1) 5 ATCATCA S (6) 6 CA S (11) 7 CACATCATCA S (3) 8 CATCA S (8) 9 CATCATCA S (5) 10 TCA S (10) 11 TCACATCATCA S (2) 29

Searching in a Suffix Array If T is represented by a suffix array, we can find in T in O(mlogn) time with a binary search. A suffix array can be determined in O(n) time by lexical depth first searching in a suffix tree. Total time: O(n+mlogn) 30

Approximate String Matching Text string T, T =n attern string, =m k errors, where errors can be substituting, deleting, or inserting a character. Example: T = pttapa, = patt, k =2, T 1,2,T 1,3,T 1,4 and T 5,6 are all up to 2 errors with. 31

Suffix Edit Distance Given two strings S 1 and S 2, the suffix edit distance is the minimum number of substitutions, insertion and deletions, which will transform some suffix of S 1 into S 2. Example: S 1 = ptt and S 2 = p. The suffix edit distance between S 1 and S 2 is 1. S 1 = pt and S 2 = patt. The suffix edit distance between S 1 and S 2 is 2. 32

Suffix Edit Distance Used in Matching Given T and, if at least one of suffix edit distances between T 1,1, T 1,2,, T 1,n and is not greater than k, then there is an approximate matching with error not greater than k. Example: T = pttapa, = patt, k=2 For T 1,1 = p and = patt, the suffix edit distance is 3. For T 1,2 = pt and = patt, the suffix edit distance is 2. For T 1,5 = pttap and = patt, the suffix edit distance is 3. For T 1,6 = pttapa and = patt, the suffix edit distance is 2. 33

Approximate String Matching Solved by dynamic programming Let E(i,j) denote the suffix edit distance between T 1,j and 1,i. if i =T j E(i, j) = E(i 1, j 1) if i T j E(i, j) = min{e(i, j 1), E(i 1, j), E(i 1, j 1)}+1 34

Example for Appr. String Matching Example: T = pttapa, = patt, k=2 T 0 1 2 3 4 5 6 p t t a p a 0 0 0 0 0 0 0 0 1 p 1 0 1 1 1 0 1 2 a 2 1 1 2 1 1 0 3 t 3 2 1 1 2 2 1 4 t 4 3 2 1 2 3 2 35

Next Week External memory algorithm 36