Approximation: Theory and Algorithms

Similar documents
Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Approximation: Theory and Algorithms

Algorithm Design and Analysis

CSE 202 Dynamic Programming II

6. DYNAMIC PROGRAMMING II

6.6 Sequence Alignment

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Sequence Comparison with Mixed Convex and Concave Costs

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

An Algorithm for Computing the Invariant Distance from Word Position

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Analysis and Design of Algorithms Dynamic Programming

Partha Sarathi Mandal

Sequence Comparison. mouse human

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Lecture 2: Pairwise Alignment. CG Ron Shamir

More Dynamic Programming

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Bio nformatics. Lecture 3. Saad Mneimneh

Computing a Longest Common Palindromic Subsequence

Implementing Approximate Regularities

Algorithms and Theory of Computation. Lecture 9: Dynamic Programming

CS 580: Algorithm Design and Analysis

Fundamentals of Similarity Search

Data Structures in Java

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

More Dynamic Programming

Dynamic Programming. Prof. S.J. Soni

TASM: Top-k Approximate Subtree Matching

Chapter 6. Dynamic Programming. CS 350: Winter 2018

Algebraic Dynamic Programming. Dynamic Programming, Old Country Style

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Approximation: Theory and Algorithms

CSE 591 Foundations of Algorithms Homework 4 Sample Solution Outlines. Problem 1

Module 9: Tries and String Matching

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Pairwise sequence alignment

Dynamic Programming: Edit Distance

General Methods for Algorithm Design

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

arxiv: v1 [cs.ds] 9 Apr 2018

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Copyright 2000, Kevin Wayne 1

Algorithms for Approximate String Matching

Sequence analysis and Genomics

Longest Common Prefixes

Structure-Based Comparison of Biomolecules

Dynamic Programming. p. 1/43

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

EECS730: Introduction to Bioinformatics

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

String Search. 6th September 2018

Areas. ! Bioinformatics. ! Control theory. ! Information theory. ! Operations research. ! Computer science: theory, graphics, AI, systems,.

Samson Zhou. Pattern Matching over Noisy Data Streams

Lecture 4: September 19

6.S078 A FINE-GRAINED APPROACH TO ALGORITHMS AND COMPLEXITY LECTURE 1

String Matching with Variable Length Gaps

Algorithms Design & Analysis. String matching

(pp ) PDAs and CFGs (Sec. 2.2)

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

/463 Algorithms - Fall 2013 Solution to Assignment 4

INF 4130 / /8-2017

On Pattern Matching With Swaps

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

Introduction. I Dynamic programming is a technique for solving optimization problems. I Key element: Decompose a problem into subproblems, solve them

(pp ) PDAs and CFGs (Sec. 2.2)

Algorithm Design and Analysis

Trace Reconstruction Revisited

Avoiding Approximate Squares

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Dynamic programming. Curs 2017

Graduate Algorithms CS F-09 Dynamic Programming

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Edit Distance with Move Operations


Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

arxiv: v2 [cs.cc] 2 Apr 2015

CS473 - Algorithms I

Robustness Analysis of Networked Systems

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Dynamic programming. Curs 2015

INF 4130 / /8-2014

Dynamic Programming 1

Lecture 7: Dynamic Programming I: Optimal BSTs

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Complexity Theory of Polynomial-Time Problems

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

2. Exact String Matching

Transcription:

Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 1 / 27

Outline 1 String Edit Distance Motivation and Definition Brute Force Algorithm Dynamic Programming Algorithm Edit Distance Variants 2 Conclusion Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 2 / 27

Outline String Edit Distance Motivation and Definition 1 String Edit Distance Motivation and Definition Brute Force Algorithm Dynamic Programming Algorithm Edit Distance Variants 2 Conclusion Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 3 / 27

Motivation String Edit Distance Motivation and Definition How different are hello and hello? hello and hallo? hello and hell? hello and shell? Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 4 / 27

String Edit Distance Motivation and Definition What is a String Distance Function? Definition (String Distance Function) Given a finite alphabet Σ, a string distance function, δ s, maps each pair of strings (x, y) Σ Σ to a positive real number (including zero). δ s : Σ Σ R + 0 Σ is the set of all strings over Σ, including the empty string ε. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 5 / 27

String Edit Distance The String Edit Distance Motivation and Definition Definition (String Edit Distance) The string edit distance between two strings, ed(x, y), is the minimum number of character insertions, deletions and replacements that transforms x to y. Example: hello hallo: replace e by a hello hell: delete o hello shell: delete o, insert s Also called Levenshtein distance. 1 1 Levenshtein introduced this distance for signal processing in 1965 [Lev65]. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 6 / 27

Outline String Edit Distance Brute Force Algorithm 1 String Edit Distance Motivation and Definition Brute Force Algorithm Dynamic Programming Algorithm Edit Distance Variants 2 Conclusion Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 7 / 27

Gap Representation String Edit Distance Brute Force Algorithm Gap representation of the string transformation x y: Place string x above string y with a gap in x for every insertion, with a gap in y for every deletion, with different characters in x and y for every replacement. Example: h a l l o s h e l l insert s replace a by e delete o Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 8 / 27

String Edit Distance Deriving the Recursive Formula Brute Force Algorithm Example: h a l l o s h e l l Given: Gap representation, gap(x, y), of the shortest edit distance between two strings x and y, such that gap(x, y) = ed(x, y). Claim: If we remove the last column, then the remaining columns represent the shortest edit distance, gap(x, y ) = ed(x, y ), between the remaining substrings, x and y. Proof (by contradiction): Last column contributes with c = 0 or c = 1 to gap(x, y), thus gap(x, y) = gap(x, y ) + c. If we assume ed(x, y ) < gap(x, y ), then we could find a new gap representation gap (x, y ) = ed(x, y ) such that gap (x, y) = gap (x, y ) + c < ed(x, y). Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 9 / 27

String Edit Distance Deriving the Recursive Formula Brute Force Algorithm Example: h a l l o s h e l l Notation: x[1... i] is the substring of the first i characters of x (x[1... 0] = ε) x[i] is the i-th character of x Recursive Formula: ed(ε, ε) = 0 ed(x[1..i], ε] = i ed(ε, y[1..j] = j ed(x[1..i], y[1..j]) = min(ed(x[1..i 1], y[1..j 1]) + c, ed(x[1..i 1], y[1..j]) + 1, ed(x[1..i], y[1..j 1]) + 1) where c = 0 if x[i] = y[i], otherwise c = 1. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 10 / 27

Brute Force Algorithm String Edit Distance Brute Force Algorithm ed-bf(x, y) m = x, n = y if m = 0 then return n if n = 0 then return m if x[m] = y[n] then c = 0 else c = 1 return min(ed-bf(x, y[1... n 1]) + 1, ed-bf(x[1... m 1], y) + 1, ed-bf(x[1... m 1], y[1... n 1]) + c) Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 11 / 27

Brute Force Algorithm String Edit Distance Brute Force Algorithm Recursion tree for ed-bf(ab, ax): ab,x ab,ε a,x a,ε a,ε ε,x ε,ε a,ε ab,xb a,xb a,x ε,xb ε,x ε,ε ε,x a,ε a,x ε,x ε,ε Exponential runtime in string length :-( Observation: Subproblems are computed repeatedly (e.g. ed-bf(a, x) is computed 3 times) Approach: Reuse previously computed results! Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 12 / 27

Outline String Edit Distance Dynamic Programming Algorithm 1 String Edit Distance Motivation and Definition Brute Force Algorithm Dynamic Programming Algorithm Edit Distance Variants 2 Conclusion Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 13 / 27

String Edit Distance Dynamic Programming Algorithm Dynamic Programming Algorithm Store distances between all prefixes of x and y Use matrix C 0..m,0..n with C i,j = ed(x[1... i], y[1... j]) where x[1..0] = y[1..0] = ε. Example: ε x b ε 0 1 2 a 1 1 2 b 2 2 1 ab,xb ab,x a,xb a,x ab,ε a,x a,ε a,x ε,xb ε,x a,ε ε,x ε,ε a,ε ε,x ε,ε a,ε ε,x ε,ε Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 14 / 27

String Edit Distance Dynamic Programming Algorithm Dynamic Programming Algorithm ed-dyn(x, y) C : array[0.. x ][0.. y ] for i = 0 to x do C[i, 0] = i for j = 1 to y do C[0, j] = j for j = 1 to y do for i = 1 to x do if x[i] = y[j] then c = 0 else c = 1 C[i, j] = min(c[i 1, j 1] + c, C[i 1, j] + 1, C[i, j 1] + 1) Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 15 / 27

String Edit Distance Understanding the Solution Dynamic Programming Algorithm Example: x = moon y = mond m o n d 0 1 2 3 4 m 1 0 1 2 3 o 2 1 0 1 2 o 3 2 1 1 2 n 4 3 2 1 2 m o o n m o n d m o o n m o n d m o o n m o n d Solution 1: replace n by d and (second) o by n in x Solution 2: insert d after n and delete (first) o in x Solution 3: insert d after n and delete (second) o in x Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 16 / 27

String Edit Distance Dynamic Programming Algorithm Dynamic Programming Algorithm Properties Complexity: O(mn) time (nested for-loop) O(mn) space (the (m+1) (n+1)-matrix C) Improving space complexity (assume m < n): we need only the previous column to compute the next column we can forget all other columns O(m) space complexity Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 17 / 27

String Edit Distance Dynamic Programming Algorithm Dynamic Programming Algorithm ed-dyn + (x, y) col 0 : array[0.. x ] col 1 : array[0.. x ] for i = 0 to x do col 0 [i] = i for j = 1 to y do col 1 [0] = j for i = 1 to x do if x[i] = y[j] then c = 0 else c = 1 col 1 [i] = min(col 0 [i 1] + c, col 1 [i 1] + 1, col 0 [i] + 1) col 0 = col1 Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 18 / 27

Outline String Edit Distance Edit Distance Variants 1 String Edit Distance Motivation and Definition Brute Force Algorithm Dynamic Programming Algorithm Edit Distance Variants 2 Conclusion Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 19 / 27

Distance Metric String Edit Distance Edit Distance Variants Definition (Distance Metric) A distance function δ is a distance metric if and only if for any x, y, z the following hold: δ(x, y) = 0 x = y (identity) δ(x, y) = δ(y, x) (symmetric) δ(x, y) + δ(y, z) δ(x, z) (triangle inequality) Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 20 / 27

Introducing Weights String Edit Distance Edit Distance Variants Look at the edit operations as a set of rules with a cost: α(ε, b) = ω ins (insert) α(a, ε) = ω del (delete) α(a, b) = ω rep (replace) where a, b Σ, and ω ins, ω del, ω rep R + 0. Edit script: sequence of rules that transform x to y Edit distance: edit script with minimum cost (adding up costs of single rules) Example: so far we assumed ω ins = ω del = ω rep = 1. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 21 / 27

String Edit Distance Weighted Edit Distance Edit Distance Variants Recursive formula with weights: C 0,0 = 0 C i,j = min(c i 1,j 1 + α(x[i], y[j]), C i 1,j + α(x[i], ε), C i,j 1 + α(ε, y[j])) where α(a, a) = 0 for all a Σ, and C 1,j = C i, 1 =. We can easily adapt the dynamic programming algorithm. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 22 / 27

String Edit Distance Variants of the Edit Distance Edit Distance Variants Unit cost edit distance (what we did up to now): ω ins = ω del = ω rep = 1 0 ed(x, y) max( x, y ) distance metric Hamming distance [Ham50, SK83]: called also string matching with k mismatches allows only replacements ω rep = 1, ω ins = ω del = 0 d(x, y) x if x = y, otherwise d(x, y) = distance metric Longest Common Subsequence (LCS) distance [NW70, AG87]: allows only insertions and deletions ω ins = ω del = 1, ω rep = 0 d(x, y) x + y distance metric Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 23 / 27

Allowing Transposition String Edit Distance Edit Distance Variants Transpositions switch two adjacent characters can be simulated by delete and insert typos are often transpositions New rule for transposition α(ab, ba) = ω trans allows us to assign a weight different from ω ins + ω del Recursive formula that includes transposition: C 0,0 = 0 C i,j = min(c i 1,j 1 + α(x[i], y[j]), C i 1,j + α(x[i], ε), C i,j 1 + α(ε, y[j]), C i 2,j 2 + α(x[i 1]x[i], y[j][j 1])) where α(ab, cd) = if a d or b c, α(a, a) = 0 for all a Σ, and C 1,j = C i, 1 = C 2,j = C j, 2 =. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 24 / 27

Text Searching String Edit Distance Edit Distance Variants Goal: search pattern p in text t ( p < t ) allow k errors match may start at any position of the text Difference to distance computation: C 0,j = 0 (instead of C 0,j = j, as text may start at any position) result: all C m,j k are endpoints of matches Example: p = survey t = surgery k = 2 ε s u r g e r y ε 0 0 0 0 0 0 0 0 s 1 0 1 1 1 1 1 1 u 2 1 0 1 2 2 2 2 r 3 2 1 0 1 2 2 3 v 4 3 2 1 1 2 3 3 e 5 4 3 2 2 1 2 3 y 6 5 4 3 3 2 2 2 3 matching positions with k 2 found. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 25 / 27

Summary Conclusion Edit distance between two strings: the minimum number of edit operations that transforms one string into the another Dynamic programming algorithm with O(mn) time and O(m) space complexity, where m n are the string lengths. Basic algorithm can easily be extended in order to: weight edit operations differently, support transposition, simulate Hamming distance and LCS, search pattern in text with k errors. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 26 / 27

What s Next? Conclusion q-gram distance for strings: O(n log n) algorithm positional q-grams q-gram filter for the edit distance implementation in a relational database Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 27 / 27

Alberto Apostolico and Zvi Galill. The longest common subsequence problem revisited. Algorithmica, 2(1):315 336, March 1987. Richard W. Hamming. Error detecting and error correcting codes. Bell System Technical Journal, 26(2):147 160, 1950. Vladimir I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8 17, 1965. Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443 453, 1970. David Sankoff and Josef B. Kruskal, editors. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 27 / 27

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. Nikolaus Augsten (DIS) Approximation: Theory and Algorithms Unit 2 March 6, 2009 27 / 27