Algorithms for Approximate String Matching

Similar documents
Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

A Multiobjective Approach to the Weighted Longest Common Subsequence Problem

Approximation: Theory and Algorithms

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

2. Exact String Matching

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Efficient (δ, γ)-pattern-matching with Don t Cares

Data Structures in Java

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Two Algorithms for LCS Consecutive Suffix Alignment

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

3515ICT: Theory of Computation. Regular languages

Chapter 5. Finite Automata

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

Finding all covers of an indeterminate string in O(n) time on average

Section Summary. Relations and Functions Properties of Relations. Combining Relations

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

Text and multimedia languages and properties

PATTERN MATCHING WITH SWAPS IN PRACTICE

Define M to be a binary n by m matrix such that:

The Longest Common Subsequence Problem

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Module 9: Tries and String Matching

An Algorithm for Computing the Invariant Distance from Word Position

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

Deterministic Finite Automata (DFAs)

On Antichains in Product Posets

Deterministic Finite Automata (DFAs)

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

EECS730: Introduction to Bioinformatics

1 Introduction to information theory

Algorithm Design and Analysis

Sequence Comparison. mouse human

Exact vs approximate search

DFA to Regular Expressions

Enumeration Schemes for Words Avoiding Permutations

{a, b, c} {a, b} {a, c} {b, c} {a}

Average Complexity of Exact and Approximate Multiple String Matching

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

Chapter 2: Finite Automata

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

Lecture 4 : Adaptive source coding algorithms

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Implementing Approximate Regularities

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

CS 154, Lecture 3: DFA NFA, Regular Expressions

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Samson Zhou. Pattern Matching over Noisy Data Streams

Some useful tasks involving language. Finite-State Machines and Regular Languages. More useful tasks involving language. Regular expressions

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Small-Space Dictionary Matching (Dissertation Proposal)

Data Compression. Limit of Information Compression. October, Examples of codes 1

NP and Computational Intractability

Longest Common Prefixes

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Robustness Analysis of Networked Systems

More Dynamic Programming

Computing Longest Common Substrings Using Suffix Arrays

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Theory of Computation

CS6901: review of Theory of Computation and Algorithms

More Dynamic Programming

Permutation distance measures for memetic algorithms with population management

Towards a Theory of Information Flow in the Finitary Process Soup

X-MA2C01-1: Partial Worked Solutions

State Complexity of Neighbourhoods and Approximate Pattern Matching

Theoretical Computer Science

Dynamic Programming ACM Seminar in Algorithmics

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Computing repetitive structures in indeterminate strings

Massachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr.

INF4130: Dynamic Programming September 2, 2014 DRAFT version

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Lecture 3: Finite Automata

Deterministic Finite Automata (DFAs)

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Entropy Rate of Stochastic Processes

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

CSC Discrete Math I, Spring Relations

Dept. of Linguistics, Indiana University Fall 2015

FINITE MEMORY DETERMINACY

Complexity Theory of Polynomial-Time Problems

Primary classes of compositions of numbers

Midterm 2 for CS 170

On Pattern Matching With Swaps

String Search. 6th September 2018

Solutions to Homework Set #3 Channel and Source coding

Image Registration Lecture 2: Vectors and Matrices

Deterministic Finite Automata (DFAs)

Transcription:

Levenshtein Distance Algorithms for Approximate String Matching Part I Levenshtein Distance Hamming Distance Approximate String Matching with k Differences Longest Common Subsequences Part II A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequences Problem Yoan Pinzón Universidad Nacional de Colombia Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial ypinzon@unal.edu.co www.pinzon.co.uk August 26 Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 965. If you cannot spell or pronounce Levenshtein, the metric is also called edit distance. The edit distance δ(p, t) between two strings p (pattern) and t (text) (m = p, n = t ) is the minimum number of insertions, deletions and replacements to make p equal to t. [Insertion] insert a new letter a into x. An insertion operation on the string x = vw consists in adding a letter a, converting x into x = vaw. [Deletion] delete a letter a from x. A deletion operation on the string x = vaw consists in removing a letter, converting x into x = vw. [Replacement] replace a letter a in x. A replacement operation on the string x = vaw consists in replacing a letter for another, converting x into x = vbw.

Example : p ="approximate matching" t ="appropiate meaning" 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 String t: a p p r o p r i a t e m ǫ e a n i n g String p: a p p r o x i m a t e m a t c h i n g δ(t, p) = 7 Solution: Using Dynamic Programing (DP): We need to compute a matrix D[..m,..n], where D i,j represents the minimum number of operations needed to match p..i to t..j. This is computed as follows: D[i,] =i D[, j]=j D[i, j] =min{d[i, j] +, D[i, j ] +, D[i, j] + δ(p i, t j )} δ(p, t) = D[m, n] Pseudo-code: Example 2: p ="surgery" t ="survey" δ(t, p) = 2 2 3 4 5 6 7 String t: s u r v e ǫ y String p: s u r g e r y procedure ED(p, t) {m = p, n = t } 2 begin 3 for i to m do D[i,] i 4 for j to n do D[, j] j 5 for i to m do 6 for j to n do 7 if p i = t j then D[i, j] D[i, j ] 8 else 9 D[i, j] min(d[i, j ], D[i, j], D[i, j ]) + od od 2 return D[m, n] 3 end Running time: O(nm) 2 3

A Graph Reformulation for the Edit Distance problem Example: p="survey", t="surgery" 2 3 4 5 6 7 ǫ s u r g e r y ǫ 2 3 4 5 6 7 s 2 3 4 5 6 2 u 2 2 3 4 5 3 r 3 2 2 3 4 4 v 4 3 2 2 3 4 5 e 5 4 3 2 2 2 3 6 y 6 5 4 3 3 2 2 2 String t: String p: 2 3 4 5 6 7 s u r g e r y s u r v e ǫ y The Dynamic Programming matrix D can be seen as a graph where the nodes are the cells and the edges represent the operations. The cost (weight) of the edges corresponds to the cost of the operations. δ(t, p) = shortest-path from node [,] to the node [n,m]. Running time: O(nm log(nm)) Example: p="survey", t="surgery" 2 3 4 5 6 [,] s u r v e y s u r g e 2 3 4 5 6 7 r y [6,7] 4 5

Hamming Distance The Hamming distance H is defined only for strings of the same length. For two strings p and t, H(p, t) is the number of places in which the two strings differ, i.e., have different characters. Examples: H("pinzon", "pinion") = H("josh", "jose") = H("here", "hear") = 2 H("kelly", "belly") = H("AAT", "TAA") = 2 H("AGCAA", "ACATA") = 3 H("AGCACACA", "ACACACTA") = 6 Pseudo-code: too easy!! Approximate String Matching with k Differences Problem: The k-differences approximate string matching problem is to find all occurrences of the pattern string p in the text string t with at most k differences (substitution, insertions, deletions). Solution: Using DP D[i,] =i D[, j]= D[i, j] =min{d[i, j] +, D[i, j ] +, D[i, j] + δ(p i, t j )} if D[m, j] k then we say that p occurs at position j of t. Running time: O(n) 6 7

Example: Pseudo-code: procedure KDifferences(p, t, k) {m = p, n = t } 2 begin 3 for i to m do D[i,] i 4 for j to n do D[, j] 5 for i to m do 6 for j to n do 7 if p i = t j then D[i, j] D[i, j ] 8 else 9 D[i, j] min(d[i, j ], D[i, j], D[i, j ]) + od od 2 for j to n do 3 if D[m][j] k then Output(j) 4 od 5 end Running time: O(nm) p ="CDDA", t ="CADDACDACDBACBA" k = 2 3 4 5 6 7 8 9 2 3 4 5 ǫ C A D D A C D A C D B A C B A ǫ C 2 D 2 2 2 2 3 D 3 2 2 2 2 2 2 2 2 2 4 A 4 3 2 2 2 2 2 2 2 2 2 3 2 p occurs in t ending at positions 5, 8 and 2. 8 9

Longest Common Subsequence Preliminaries For two sequences x = x x m and y y n (n m) we say that x is a subsequence of y and equivalently, y is a supersequence of x, if for some i < < i p, x j = y ij. Given a finite set of sequences, S, a longest common subsequence (LCS) of S is a longest possible sequence s such that each sequence in S is a supersequence of s. Example: y="longest", x="large" LCS(y,x)="lge" String y: l o n g e s t String x: l a r g e Longest Common Subsequence Problem: The Longest Common Subsequence (LCS) of two strings, p and t, is a subsequence of both p and of t of maximum possible length. Solution: Using Dynamic Programing: We need to compute a matrix L[..m,..n], where L i,j represent the LCS for p..i and t..j. This is computed as follows:, if either i = or j = L[i, j] = L[i, j ] +, if p i = t j max{l[i, j], L[i, j ]}, if p i t j Pseudo-code: procedure LCS(p, t) {m = p, n = t } 2 begin 3 for i to m do L[i,] 4 for j to n do L[, j] 5 for i to m do 6 for j to n do 7 if p i = t j then L[i, j] L[i, j ] + 8 else 9 if L[i, j ] > L[i, j] then L[i, j] L[i, j ] else L[i, j] L[i, j] od 2 od 3 return L[m][n] 4 end Running time: O(nm)

Example : p ="survey" and t ="surgery". 2 3 4 5 6 7 ǫ s u r g e r y ǫ s 2 u 2 2 2 2 2 2 3 r 2 3 3 3 3 3 4 v 2 3 3 3 3 3 5 e 2 3 3 4 4 4 6 y 2 3 3 4 4 5 String t: String p: 2 3 4 5 6 7 s u r g e r y s u r v e y Example 2: p ="ttgatacatt" t ="gaataagacc" 2 3 4 5 6 7 8 9 ǫ g a a t a a g a c c ǫ t 2 t 3 g 2 2 2 2 4 a 2 2 2 2 2 2 3 3 3 5 t 2 2 3 3 3 3 3 3 3 6 a 2 3 3 4 4 4 4 4 4 7 c 2 3 3 4 4 4 4 5 5 8 a 2 3 3 4 5 5 5 5 5 9 t 2 3 4 4 5 5 5 5 5 LCS(p, t) ="surey" LLCS(p, t) = L[6,7] = 5 LCS(p, t) =? LLCS(p, t) = L[9,] = 5 2 3

Some More Definitions Part II A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequence Problem The ordered pair of positions i and j of L, denoted [i, j], is a match iff x i = y j. If [i, j] is a match, and an LCS s i,j of x x 2...x i and y y 2...y j has length k, then k is the rank of [i, j]. The match [i, j] is k-dominant if it has rank k and for any other pair [i, j ] of rank k, either i > i and j j or i i and j > j. Computing the k-dominant matches is all that is needed to solve the LCS problem, since the LCS of x and y has length p iff the maximum rank attained by a dominant match is p. A match [i, j] precedes a match [i, j ] if i < i and j < j. 4 5

k= k=2 k=3 k=4 k=5 Let r be the total number of matches points, and d be the total number of dominant points (all ranks). Then p d r nm. Let R denote a partial order relation on the set of matches in L. A set of matches such that in any pair one of the matches always precedes the other in R constitutes a chain relative to the partial order relation R. A set of matches such that in any pair neither element of the pair precedes the other in R constitutes an antichain. Sankoff and Sellers (973) observed that the LCS problem translates to finding a longest chain in the poset of matches induced by R. A decomposition of a poset into antichains partitions the poset into the minimum possible number of antichains. "aaagtgacctagcccg" "tccagatg" t = p = t 2 3 4 5 6 7 8 9 2 3 4 5 6 a a a g t g a c c t a g c c c g t c c a g a t g p match chain k-dominat match anti-chain (LCS) k-dominat match & LCS symbol "tccagg" LCS(p,t) = k=6 2 3 4 5 6 7 8 6 7

"aaagtgacctagcccg" "tccagatg" t 2 3 4 5 6 7 8 9 2 3 4 5 6 a a a g t g a c c t a g c c c g t c c a g a t g A Simple Bit-Vector Algorithm Here we will make use of word-level parallelism in order to compute the matrix L more efficiently. The algorithm is based on the O()-time computation of each column in L by using a bitparallel formula under the assumption that m w, where w is the number of bits in a machine word or O(nm/w)-time for the general case. An interesting property of the LCS allows to represent each column in L by using O()-space. That is, the values in the columns (rows) of L increase by at most one. i.e. L[i, j] = L[i, j] L[i, j] {,} for any (i, j) {..m} {..n}. In other words L will use the relative encoding of the dynamic programming table L. L is defined as NOT L. p t = p = 2 3 4 5 6 7 8 8 9

Example: x= ttgatacatt and y= gaataagacc. 2 3 4 5 6 7 8 9 ǫ G A A T A A G A C C ǫ T 2 T 3 G 2 2 2 2 4 A 2 2 2 2 2 2 3 3 3 5 T 2 2 3 3 3 3 3 3 3 6 A 2 3 3 4 4 4 4 4 4 7 C 2 3 3 4 4 4 4 5 5 8 A 2 3 3 4 5 5 5 5 5 9 T 2 3 4 4 5 5 5 5 5 (a) Matrix L 2 3 4 5 6 7 8 9 G A A T A A G A C C T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Matrix L 2 3 4 5 6 7 8 9 G A A T A A G A C C 2 3 4 5 6 7 8 9 ǫ G A A T A A G A C C ǫ T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T (b) Matrix L T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T L 6 =< > 2 2

First we compute the array M of the vectors that result for each possible text character. If both the strings x and y range over the alphabet Σ then M[Σ] is defined as M[α] i = if y i = α else. Example: x= ttgatacatt and y= gaataagacc. A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Basic steps of the algorithm. Computation of M and M 2. Computation of matrix L (L) as follows: { 2 L = m, for j = (L j + (L j AND M(y j ))) OR (L j AND M (y j )), for j {..n} 3. Let LLCS be the number of times a carry took place. (a) Matrix M (a) Matrix M 22 23

Illustration of L 4 Computation for x= gaataagacc and y= ttgatacatt. Pseudo-code LLCS(x, y) n = y, m = x, p = begin 2 Preprocessing 3 for i until m do 4 M[α](i) y i = α 5 M [α](i) y i α 6 Initialization 7 L = 2 m 8 TheMainStep 9 for j until n do L j (L j + (L j AND M[y j ])) OR (L j AND M [y j ]) if L j (m + ) = then p++ 2 return p 3 end 2 3 4 5 6 7 8 9 ǫ G A A T A A G A C C T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Bit-Carry (a) Matrix L 9 T 8 7 6 5 4 3 2 A C A T A G T T 2 3 4 A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T (b) Matrix M L' 3 M[x =T] 4 L' 4 24 25

(4) {}}{ (3) {}}{ () (2) L 4 ( L 3 + {}}{ (L 3 &M T ) ) {}}{ (L 3 &M T ) (4) {}}{ (3) {}}{ () (2) L 4 ( L 3 + {}}{ (L 3 &M T ) ) {}}{ (L 3 &M T ) L 3 & M T () L 3 & M T (2) L 3 + () (3) (3) (2) (4) L 3 & M T () L 3 & M T (2) L 3 + () (3) (3) (2) (4) 26 27

Automata for Addition Experimental Results carry states,, 8 6 AllisonDix This,, carry,, carry Time (in micro secs.) 4 2 8 6, 4, 2 2 3 4 5 6 7 8 9 Text Length 28 29