Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Similar documents
Similarity Search. The String Edit Distance. Nikolaus Augsten.

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Approximation: Theory and Algorithms

Algorithm Design and Analysis

Approximation: Theory and Algorithms

6. DYNAMIC PROGRAMMING II

6.6 Sequence Alignment

CSE 202 Dynamic Programming II

Sequence Comparison with Mixed Convex and Concave Costs

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

An Algorithm for Computing the Invariant Distance from Word Position

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

More Dynamic Programming

Sequence Comparison. mouse human

Bio nformatics. Lecture 3. Saad Mneimneh

General Methods for Algorithm Design

More Dynamic Programming

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Partha Sarathi Mandal

Implementing Approximate Regularities

Fundamentals of Similarity Search

Data Structures in Java

CS 580: Algorithm Design and Analysis

Lecture 2: Pairwise Alignment. CG Ron Shamir

Algorithms and Theory of Computation. Lecture 9: Dynamic Programming

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Dynamic Programming. p. 1/43

Computing a Longest Common Palindromic Subsequence

Analysis and Design of Algorithms Dynamic Programming

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Algebraic Dynamic Programming. Dynamic Programming, Old Country Style

Efficient High-Similarity String Comparison: The Waterfall Algorithm

CSE 591 Foundations of Algorithms Homework 4 Sample Solution Outlines. Problem 1

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Chapter 6. Dynamic Programming. CS 350: Winter 2018

Algorithms for Approximate String Matching

arxiv: v1 [cs.ds] 9 Apr 2018

Dynamic Programming. Prof. S.J. Soni

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Dynamic Programming: Edit Distance

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Longest Common Prefixes

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

EECS730: Introduction to Bioinformatics

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Module 9: Tries and String Matching

arxiv: v2 [cs.cc] 2 Apr 2015

Complexity Theory of Polynomial-Time Problems

Samson Zhou. Pattern Matching over Noisy Data Streams

Lecture 4: September 19

String Search. 6th September 2018

Sequence analysis and Genomics

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Dynamic programming. Curs 2017

Structure-Based Comparison of Biomolecules

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

/463 Algorithms - Fall 2013 Solution to Assignment 4

Pairwise sequence alignment

INF 4130 / /8-2017

On Pattern Matching With Swaps

CS473 - Algorithms I

Algorithm Design and Analysis

String Matching with Variable Length Gaps

Avoiding Approximate Squares

Copyright 2000, Kevin Wayne 1

Trace Reconstruction Revisited

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

State Complexity of Neighbourhoods and Approximate Pattern Matching

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Areas. ! Bioinformatics. ! Control theory. ! Information theory. ! Operations research. ! Computer science: theory, graphics, AI, systems,.

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

Edit Distance with Move Operations

Robustness Analysis of Networked Systems

Lecture 7: Dynamic Programming I: Optimal BSTs

6.S078 A FINE-GRAINED APPROACH TO ALGORITHMS AND COMPLEXITY LECTURE 1

INF 4130 / /8-2014

Dynamic programming. Curs 2015

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Greedy. Outline CS141. Stefano Lonardi, UCR 1. Activity selection Fractional knapsack Huffman encoding Later:

Weighted Activity Selection

Introduction. I Dynamic programming is a technique for solving optimization problems. I Key element: Decompose a problem into subproblems, solve them

Soundex distance metric


TASM: Top-k Approximate Subtree Matching

Analysis of Algorithms Prof. Karen Daniels

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Linear Classifiers (Kernels)

Graduate Algorithms CS F-09 Dynamic Programming

Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming

Finding a longest common subsequence between a run-length-encoded string and an uncompressed string

Transcription:

Outline Similarity Search The Nikolaus Augsten nikolaus.augsten@sbg.ac.at Department of Computer Sciences University of Salzburg 1 http://dbresearch.uni-salzburg.at WS 2017/2018 Version March 12, 2018 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 1 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 2 / 27 Outline Motivation 1 How different are hello and hello? hello and hallo? hello and hell? hello and shell? Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 3 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 4 / 27

What is a String Distance Function? The Definition (String Distance Function) Given a finite alphabet Σ, a string distance function, δ s, maps each pair of strings (x, y) Σ Σ to a positive real number (including zero). δ s : Σ Σ R + 0 Σ is the set of all strings over Σ, including the empty string ε. Definition () The string edit distance between two strings, ed(x, y), is the minimum number of character insertions, deletions and replacements that transforms x to y. hello hallo: replace e by a hello hell: delete o hello shell: delete o, insert s Also called Levenshtein distance. 1 1 Levenshtein introduced this distance for signal processing in 1965 [Lev65]. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 5 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 6 / 27 Outline Gap Representation 1 Gap representation of the string transformation x y: Place string x above string y with a gap in x for every insertion, with a gap in y for every deletion, with different characters in x and y for every replacement. Any sequence of edit operations can be represented with gaps. h a l l o s h e l l insert s replace a by e delete o Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 7 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 8 / 27

Deriving the Recursive Formula Deriving the Recursive Formula h a l l o s h e l l Given: Gap representation, gap(x, y), of the shortest edit distance between two strings x and y, such that gap(x, y) = ed(x, y). Claim: If we remove the last column, then the remaining columns represent the shortest edit distance, gap(x, y ) = ed(x, y ), between the remaining substrings, x and y. Proof (by contradiction): Last column contributes with c = 0 or c = 1 to gap(x, y), thus gap(x, y) = gap(x, y ) + c. If we assume ed(x, y ) < gap(x, y ), then we could find a new gap representation gap (x, y ) = ed(x, y ) < gap(x, y ) such that gap (x, y) = gap (x, y ) + c < gap(x, y ) + c = ed(x, y). Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 9 / 27 h a l l o s h e l l Notation: x[1... i] is the substring of the first i characters of x (x[1... 0] = ε) x[i] is the i-th character of x Recursive Formula: ed(ε, ε) = 0 ed(x[1..i], ε] = i ed(ε, y[1..j] = j ed(x[1..i], y[1..j]) = min(ed(x[1..i 1], y[1..j 1]) + c, ed(x[1..i 1], y[1..j]) + 1, ed(x[1..i], y[1..j 1]) + 1) where c = 0 if x[i] = y[j], otherwise c = 1. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 10 / 27 ed-bf(x, y) m = x, n = y if m = 0 then return n if n = 0 then return m if x[m] = y[n] then c = 0 else c = 1 return min(ed-bf(x, y[1... n 1]) + 1, ed-bf(x[1... m 1], y) + 1, ed-bf(x[1... m 1], y[1... n 1]) + c) Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 11 / 27 Recursion tree for ed-bf(ab, xb): ab,xb ab,x b ab,ε b Exponential runtime in string length :-( Observation: Subproblems are computed repeatedly (e.g. ed-bf(a, x) is computed 3 times) Approach: Reuse previously computed results! Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 12 / 27

Outline 1 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 13 / 27 Store distances between all prefixes of x and y Use matrix C 0..m,0..n with where x[1..0] = y[1..0] = ε. ε x b ε 0 1 2 a 1 1 2 b 2 2 1 ab,xb C i,j = ed(x[1... i], y[1... j]) ab,x b ab,ε b Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 14 / 27 Understanding the Solution ed-dyn(x, y) C : array[0.. x ][0.. y ] for i = 0 to x do C[i, 0] = i for j = 1 to y do C[0, j] = j for j = 1 to y do for i = 1 to x do if x[i] = y[j] then c = 0 else c = 1 C[i, j] = min(c[i 1, j 1] + c, C[i 1, j] + 1, C[i, j 1] + 1) Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 15 / 27 x = moon y = mond ins ε m o n d ε 0 1 2 3 4 del m 1 0 1 2 3 o 2 1 0 1 2 o 3 2 1 1 2 n 4 3 2 1 2 Solution 1: replace n by d and (second) o by n in x Solution 2: insert d after n and delete (first) o in x m o o n m o n d m o o n m o n d m o o n m o Solution 3: insert d after n and delete (second) o in x Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 16 / 27 n d

Properties Complexity: O(mn) time (nested for-loop) O(mn) space (the (m+1) (n+1)-matrix C) Improving space complexity (assume m < n): we need only the previous column to compute the next column we can forget all other columns O(m) space complexity ed-dyn + (x, y) col 0 : array[0.. x ] col 1 : array[0.. x ] for i = 0 to x do col 0 [i] = i for j = 1 to y do col 1 [0] = j for i = 1 to x do if x[i] = y[j] then c = 0 else c = 1 col 1 [i] = min(col 0 [i 1] + c, col 1 [i 1] + 1, col 0 [i] + 1) col 0 = col1 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 17 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 18 / 27 Outline Distance Metric 1 Definition (Distance Metric) A distance function δ is a distance metric if and only if for any x, y, z the following hold: δ(x, y) = 0 x = y (identity) δ(x, y) = δ(y, x) (symmetric) δ(x, y) + δ(y, z) δ(x, z) (triangle inequality) Examples: the Euclidean distance is a metric d(a, b) = a b is not a metric (not symmetric) Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 19 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 20 / 27

Introducing Weights Weighted Edit Distance Look at the edit operations as a set of rules with a cost: α(ε, b) = ω ins (insert) α(a, ε) = ω { del (delete) ω rep if a b α(a, b) = (replace) 0 if a = b where a, b Σ, and ω ins, ω del, ω rep R + 0. Edit script: sequence of rules that transform x to y Edit distance: edit script with minimum cost (adding up costs of single rules) so far we assumed ω ins = ω del = ω rep = 1. Recursive formula with weights: C 0,0 = 0 C i,j = min(c i 1,j 1 + α(x[i], y[j]), C i 1,j + α(x[i], ε), C i,j 1 + α(ε, y[j])) where α(a, a) = 0 for all a Σ, and C 1,j = C i, 1 =. We can easily adapt the dynamic programming algorithm. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 21 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 22 / 27 Variants of the Edit Distance Allowing Transposition Unit cost edit distance (what we did so far): ω ins = ω del = ω rep = 1 0 ed(x, y) max( x, y ) distance metric Hamming distance [Ham50, SK83]: called also string matching with k mismatches allows only replacements ω rep = 1, ω ins = ω del = 0 d(x, y) x if x = y, otherwise d(x, y) = distance metric Longest Common Subsequence (LCS) distance [NW70, AG87]: allows only insertions and deletions ω ins = ω del = 1, ω rep = 0 d(x, y) x + y distance metric LCS(x, y) = ( x + y d(x, y))/2 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 23 / 27 Transpositions switch two adjacent characters can be simulated by delete and insert typos are often transpositions New rule for transposition α(ab, ba) = ω trans allows us to assign a weight different from ω ins + ω del Recursive formula that includes transposition: C 0,0 = 0 C i,j = min(c i 1,j 1 + α(x[i], y[j]), C i 1,j + α(x[i], ε), C i,j 1 + α(ε, y[j]), C i 2,j 2 + α(x[i 1]x[i], y[j 1]y[j])) where α(ab, cd) = if a d or b c, α(a, a) = 0 for all a Σ, and C 1,j = C i, 1 = C 2,j = C j, 2 =. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 24 / 27

Edit Distance with Transposition Text Searching Compute distance between x =meal and y =mael using the edit distance with transposition (ω ins = ω del = ω rep = ω trans = 1) ε m a e l ε 0 1 2 3 4 m 1 0 1 2 3 e 2 1 1 1 2 a 3 2 1 1 2 l 4 3 2 2 1 Goal: search pattern p in text t ( p < t ) allow k errors match may start at any position of the text Difference to distance computation: C 0,j = 0 (instead of C 0,j = j, as text may start at any position) result: all C m,j k are endpoints of matches The value in red results from the transposition of ea to ae. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 25 / 27 Text Searching p = survey t = surgery k = 2 ε s u r g e r y ε 0 0 0 0 0 0 0 0 s 1 0 1 1 1 1 1 1 u 2 1 0 1 2 2 2 2 r 3 2 1 0 1 2 2 3 v 4 3 2 1 1 2 3 3 e 5 4 3 2 2 1 2 3 y 6 5 4 3 3 2 2 2 Solutions: 3 matching positions with k 2 found. s u r v e y s u r g e s u r v e y s u r g e r s u r v e y s u r g e r y Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 27 / 27 Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 26 / 27 Alberto Apostolico and Zvi Galill. The longest common subsequence problem revisited. Algorithmica, 2(1):315 336, March 1987. Richard W. Hamming. Error detecting and error correcting codes. Bell System Technical Journal, 26(2):147 160, 1950. Vladimir I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8 17, 1965. Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443 453, 1970. David Sankoff and Josef B. Kruskal, editors. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 27 / 27

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. Augsten (Univ. Salzburg) Similarity Search WS 2017/2018 27 / 27