Efficient High-Similarity String Comparison: The Waterfall Algorithm

Similar documents
Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet

Two Algorithms for LCS Consecutive Suffix Alignment

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Implementing Approximate Regularities

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

BLAST: Basic Local Alignment Search Tool

Analysis and Design of Algorithms Dynamic Programming

Theoretical Computer Science

Compressed Index for Dynamic Text

Small-Space Dictionary Matching (Dissertation Proposal)

Lecture 2: Pairwise Alignment. CG Ron Shamir

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

String Indexing for Patterns with Wildcards

arxiv: v1 [cs.ds] 15 Feb 2012

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

6.6 Sequence Alignment

Approximation: Theory and Algorithms

Define M to be a binary n by m matrix such that:

Computing repetitive structures in indeterminate strings

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

arxiv: v2 [cs.ds] 21 May 2014

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Efficient Algorithms for the Longest Common Subsequence Problem with Sequential Substring Constraints

Data Structures in Java

Similarity Search. The String Edit Distance. Nikolaus Augsten.

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Samson Zhou. Pattern Matching over Noisy Data Streams

Lecture 4: September 19

Integer Sorting on the word-ram

CS325: Analysis of Algorithms, Fall Midterm

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Introduction to spectral alignment

Module 9: Tries and String Matching

Sequence Comparison. mouse human

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Module 1: Analyzing the Efficiency of Algorithms

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

arxiv: v2 [cs.ds] 5 Mar 2014

2. Exact String Matching

Least Random Suffix/Prefix Matches in Output-Sensitive Time

arxiv: v2 [cs.ds] 16 Mar 2015

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

An Algorithm for Computing the Invariant Distance from Word Position

Approximate periods of strings

Bio nformatics. Lecture 3. Saad Mneimneh

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Succinct Data Structures for the Range Minimum Query Problems

Lecture notes for Advanced Graph Algorithms : Verification of Minimum Spanning Trees

Finding all covers of an indeterminate string in O(n) time on average

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem

Finding a longest common subsequence between a run-length-encoded string and an uncompressed string

PATTERN MATCHING WITH SWAPS IN PRACTICE

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

SUFFIX TREE. SYNONYMS Compact suffix trie

Complexity Theory of Polynomial-Time Problems

More Dynamic Programming

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

More Dynamic Programming

Fast Distance Multiplication of Unit-Monge Matrices

Sublinear Approximate String Matching

Rank and Select Operations on Binary Strings (1974; Elias)

String Range Matching

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Lecture 4. 1 Circuit Complexity. Notes on Complexity Theory: Fall 2005 Last updated: September, Jonathan Katz

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

CS 580: Algorithm Design and Analysis

Approximate Pattern Matching and the Query Complexity of Edit Distance

String Matching with Variable Length Gaps

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

arxiv: v1 [cs.ds] 9 Apr 2018

Sequence analysis and Genomics

Edit Distance with Move Operations

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Efficient Enumeration of Regular Languages

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Alphabet-Independent Compressed Text Indexing

Trace Reconstruction Revisited

Algorithm Design and Analysis

Lecture 18 April 26, 2012

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Rotation and Lighting Invariant Template Matching

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Dynamic programming. Curs 2015

Algorithm efficiency analysis

Complexity Theory of Polynomial-Time Problems

6.S078 A FINE-GRAINED APPROACH TO ALGORITHMS AND COMPLEXITY LECTURE 1

Transcription:

Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick) The Waterfall Algorithm 1 / 38

1 Semi-local string comparison 2 The transposition network method Alexander Tiskin (Warwick) The Waterfall Algorithm 2 / 38

1 Semi-local string comparison 2 The transposition network method Alexander Tiskin (Warwick) The Waterfall Algorithm 3 / 38

Semi-local LCS and edit distance Consider strings (= sequences) over an alphabet of size σ Distinguish contiguous substrings and not necessarily contiguous subsequences Special cases of substring: prefix, suffix Notation: strings a, b of length m, n respectively Assume where necessary: m n; m, n reasonably close The longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0 In biological terms, loss-free alignment (unlike lossy BLAST) Alexander Tiskin (Warwick) The Waterfall Algorithm 4 / 38

Semi-local LCS and edit distance The LCS problem Give the LCS score for a vs b LCS: running time O(mn) [Wagner, Fischer: 1974] O ( ) mn σ = O(1) [Masek, Paterson: 1980] log 2 n [Crochemore+: 2003] O ( mn(log log n) 2 ) [Paterson, Dančík: 1994] log 2 n [Bille, Farach-Colton: 2008] Running time varies depending on the RAM model version We assume word-ram with word size log n (where it matters) Alexander Tiskin (Warwick) The Waterfall Algorithm 5 / 38

Semi-local LCS and edit distance LCS computation by dynamic programming lcs(a, ) = 0 lcs(, b) = 0 lcs(aα, bβ) = { max(lcs(aα, b), lcs(a, bβ)) lcs(a, b) + 1 if α β if α = β d e f i n e 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 1 2 2 2 2 2 s 0 1 2 2 2 2 2 i 0 1 2 2 3 3 3 g 0 1 2 2 3 3 3 n 0 1 2 2 3 4 4 lcs( define, design ) = 4 LCS(a, b) can be traced back through the table at no extra asymptotic cost Alexander Tiskin (Warwick) The Waterfall Algorithm 6 / 38

Semi-local LCS and edit distance LCS on the alignment graph (directed, acyclic) B A A B C B C A B A A B C A B C A B A C A blue = 0 red = 1 score( BAABCBCA, BAABCABCABACA ) = len( BAABCBCA ) = 8 LCS = highest-score path from top-left to bottom-right Alexander Tiskin (Warwick) The Waterfall Algorithm 7 / 38

Semi-local LCS and edit distance LCS: dynamic programming [WF: 1974] Sweep cells in any -compatible order Cell update: time O(1) Overall time O(mn) Alexander Tiskin (Warwick) The Waterfall Algorithm 8 / 38

Semi-local LCS and edit distance LCS: micro-block dynamic programming [MP: 1980; BF: 2008] Sweep cells in micro-blocks, in any -compatible order Micro-block size: t = O(log n) when σ = O(1) t = O ( log n log log n) otherwise Micro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bits Micro-block update: time O(1), by precomputing all possible interfaces Overall time O ( ) ( mn log 2 n when σ = O(1), O mn(log log n) 2 ) log 2 n otherwise Alexander Tiskin (Warwick) The Waterfall Algorithm 9 / 38

Semi-local LCS and edit distance Begin at the beginning, the King said gravely, and go on till you come to the end: then stop. L. Carroll, Alice in Wonderland Standard approach by dynamic programming Alexander Tiskin (Warwick) The Waterfall Algorithm 10 / 38

Semi-local LCS and edit distance Sometimes dynamic programming can be run from both ends for extra flexibility Is there a better, fully flexible alternative (e.g. for comparing compressed strings, comparing strings dynamically or in parallel, etc.)? Alexander Tiskin (Warwick) The Waterfall Algorithm 11 / 38

Semi-local LCS and edit distance The semi-local LCS problem Give the (implicit) matrix of O ( (m + n) 2) LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string b Cf.: dynamic programming gives prefix-prefix LCS Alexander Tiskin (Warwick) The Waterfall Algorithm 12 / 38

Semi-local LCS and edit distance Semi-local LCS on the alignment graph B A A B C B C A B A A B C A B C A B A C A blue = 0 red = 1 score( BAABCBCA, CABCABA ) = len( ABCBA ) = 5 String-substring LCS: all highest-score top-to-bottom paths Semi-local LCS: all highest-score boundary-to-boundary paths Alexander Tiskin (Warwick) The Waterfall Algorithm 13 / 38

Score matrices and seaweed matrices The score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8-1 0 1 2 3 4 5 5 6 7 7 7 7 7-2 -1 0 1 2 3 4 4 5 6 6 6 6 7-3 -2-1 0 1 2 3 3 4 5 5 6 6 7-4 -3-2 -1 0 1 2 2 3 4 4 5 5 6-5 -4-3 -2-1 0 1 2 3 4 4 5 5 6-6 -5-4 -3-2 -1 0 1 2 3 3 4 4 5-7 -6-5 -4-3 -2-1 0 1 2 2 3 3 4-8 -7-6 -5-4 -3-2 -1 0 1 2 3 3 4-9 -8-7 -6-5 -4-3 -2-1 0 1 2 3 4-10 -9-8 -7-6 -5-4 -3-2 -1 0 1 2 3-11 -10-9 -8-7 -6-5 -4-3 -2-1 0 1 2-12 -11-10 -9-8 -7-6 -5-4 -3-2 -1 0 1-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 0 a = BAABCBCA b = BAABCABCABACA H(i, j) = score(a, b i : j ) H(4, 11) = 5 H(i, j) = j i if i > j Alexander Tiskin (Warwick) The Waterfall Algorithm 14 / 38

Score matrices and seaweed matrices Semi-local LCS: output representation and running time size query time O(n 2 ) O(1) trivial O(m 1/2 n) O(log n) string-substring [Alves+: 2003] O(n) O(n) string-substring [Alves+: 2005] O(n log n) O(log 2 n) [T: 2006]... or any 2D orthogonal range counting data structure running time O(mn 2 ) naive O(mn) string-substring [Schmidt: 1998; Alves+: 2005] O(mn) [T: 2006] O ( ) mn [T: 2006] log 0.5 n O ( mn(log log n) 2 ) [T: 2007] log 2 n Alexander Tiskin (Warwick) The Waterfall Algorithm 15 / 38

Score matrices and seaweed matrices The score matrix H and the seaweed matrix P H(i, j): the number of matched characters for a vs substring b i : j j i H(i, j): the number of unmatched characters Properties of matrix j i H(i, j): simple unit-monge therefore, = P Σ, where P = H is a permutation matrix P is the seaweed matrix, giving an implicit representation of H Range tree for P: memory O(n log n), query time O(log 2 n) Alexander Tiskin (Warwick) The Waterfall Algorithm 16 / 38

Score matrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8-1 0 1 2 3 4 5 5 6 7 7 7 7 7-2 -1 0 1 2 3 4 4 5 6 6 6 6 7-3 -2-1 0 1 2 3 3 4 5 5 6 6 7-4 -3-2 -1 0 1 2 2 3 4 4 5 5 6-5 -4-3 -2-1 0 1 2 3 4 4 5 5 6-6 -5-4 -3-2 -1 0 1 2 3 3 4 4 5-7 -6-5 -4-3 -2-1 0 1 2 2 3 3 4-8 -7-6 -5-4 -3-2 -1 0 1 2 3 3 4-9 -8-7 -6-5 -4-3 -2-1 0 1 2 3 4-10 -9-8 -7-6 -5-4 -3-2 -1 0 1 2 3-11 -10-9 -8-7 -6-5 -4-3 -2-1 0 1 2-12 -11-10 -9-8 -7-6 -5-4 -3-2 -1 0 1-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 0 a = BAABCBCA b = BAABCABCABACA H(i, j) = score(a, b i : j ) H(4, 11) = 5 H(i, j) = j i if i > j blue: difference in H is 0 red: difference in H is 1 green: P(i, j) = 1 H(i, j) = j i P Σ (i, j) Alexander Tiskin (Warwick) The Waterfall Algorithm 17 / 38

Score matrices and seaweed matrices The score matrix H and the seaweed matrix P a = BAABCBCA b = BAABCABCABACA H(4, 11) = 11 4 P Σ (i, j) = 11 4 2 = 5 Alexander Tiskin (Warwick) The Waterfall Algorithm 18 / 38

Score matrices and seaweed matrices The seaweed braid in the alignment graph B A A B C B C A B A A B C A B C A B A C A a = BAABCBCA b = BAABCABCABACA H(4, 11) = 11 4 P Σ (i, j) = 11 4 2 = 5 P(i, j) = 1 corresponds to seaweed top i bottom j Also define top right, left right, left bottom seaweeds Gives bijection between top-left and bottom-right graph boundaries Alexander Tiskin (Warwick) The Waterfall Algorithm 19 / 38

Score matrices and seaweed matrices Seaweed braid: a highly symmetric object (element of the 0-Hecke monoid of the symmetric group) Can be built recursively by assembling subbraids from separate parts Highly flexible: local alignment, compression, parallel computation... Alexander Tiskin (Warwick) The Waterfall Algorithm 20 / 38

Weighted alignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (w M ), mismatches (w X ) and gaps (w G ) LCS score: w M = 1, w X = w G = 0 Levenshtein score: w M = 2, w X = 1, w G = 0 Alignment score is rational, if w M, w X, w G Equivalent to LCS score on blown-up strings are rational numbers Edit distance: minimum cost to transform a into b by weighted character edits (insertion, deletion, substitution) Corresponds to weighted alignment score with w M = 0, insertion/deletion weight w G, substitution weight w X Alexander Tiskin (Warwick) The Waterfall Algorithm 21 / 38

Weighted alignment Weighted alignment graph B A A B C B C A B A A B C A B C A B A C A blue = 0 red (solid) = 2 red (dotted) = 1 Levenshtein score( BAABCBCA, CABCABA ) = 11 Alexander Tiskin (Warwick) The Waterfall Algorithm 22 / 38

Weighted alignment Alignment graph for blown-up strings $B $A $A $B $C $A $B $C $A $B $A $C $A blue = 0 $B red = 0.5 or 1 $A $A $B $C $B $C $A Levenshtein score( BAABCBCA, CABCABA ) = 2 5.5 Alexander Tiskin (Warwick) The Waterfall Algorithm 23 / 38

Weighted alignment Rational-weighted semi-local alignment reduced to semi-local LCS $B $A $A $B $C $A $B $C $A $B $A $C $A $B $A $A $B $C $B $C $A Let w M = 1, w X = µ ν, w G = 0 Increase ν 2 in complexity (can be reduced to ν) Alexander Tiskin (Warwick) The Waterfall Algorithm 24 / 38

1 Semi-local string comparison 2 The transposition network method Alexander Tiskin (Warwick) The Waterfall Algorithm 25 / 38

The transposition network method Transposition networks Comparison network: a circuit of comparators A comparator sorts two inputs and outputs them in prescribed order Comparison networks traditionally used for non-branching merging/sorting Classical comparison networks # comparators merging O(n log n) [Batcher: 1968] sorting O(n log 2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Comparison networks are visualised by wire diagrams Transposition network: all comparisons are between adjacent wires Alexander Tiskin (Warwick) The Waterfall Algorithm 26 / 38

The transposition network method Transposition networks Seaweed combing as a transposition network A C B C A B C A 7 5 3 1 +1 +3 +5 +7 Character mismatches correspond to comparators 7 1 +3 3 5 +7 +5 +1 Inputs anti-sorted (sorted in reverse); each value traces a seaweed Alexander Tiskin (Warwick) The Waterfall Algorithm 27 / 38

The transposition network method Transposition networks Global LCS: transposition network with binary input A C B C A B C A 0 0 0 0 1 1 1 1 Inputs still anti-sorted, but may not be distinct Comparison between equal values is indeterminate 0 0 1 0 0 1 1 1 Alexander Tiskin (Warwick) The Waterfall Algorithm 28 / 38

The transposition network method Parameterised string comparison Parameterised string comparison String comparison sensitive e.g. to low similarity: small λ = LCS(a, b) high similarity: small κ = dist LCS (a, b) = m + n 2λ Can also use weighted alignment score or edit distance Assume m = n, therefore κ = 2(n λ) Alexander Tiskin (Warwick) The Waterfall Algorithm 29 / 38

The transposition network method Parameterised string comparison Low-similarity comparison: small λ sparse set of matches, may need to look at them all preprocess matches for fast searching, time O(n log σ) High-similarity comparison: small κ set of matches may be dense, but only need to look at small subset no need to preprocess, linear search is OK Flexible comparison: sensitive to both high and low similarity, e.g. by both comparison types running alongside each other Alexander Tiskin (Warwick) The Waterfall Algorithm 30 / 38

The transposition network method Parameterised string comparison Parameterised string comparison: running time Low-similarity, after preprocessing in O(n log σ) O(nλ) [Hirschberg: 1977] [Apostolico, Guerra: 1985] [Apostolico+: 1992] High-similarity, no preprocessing O(n κ) [Ukkonen: 1985] [Myers: 1986] Flexible O(λ κ log n) no preproc [Myers: 1986; Wu+: 1990] O(λ κ) after preproc [Rick: 1995] Alexander Tiskin (Warwick) The Waterfall Algorithm 31 / 38

The transposition network method Parameterised string comparison Parameterised string comparison: the waterfall algorithm Low-similarity: O(n λ) High-similarity: O(n κ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 Trace 0s through network in contiguous blocks and gaps Alexander Tiskin (Warwick) The Waterfall Algorithm 32 / 38

The transposition network method Dynamic string comparison The dynamic LCS problem Maintain current LCS score under updates to one or both input strings Both input strings are streams, updated on-line: appending characters at left or right deleting characters at left or right Assume for simplicity m n, i.e. m = Θ(n) Goal: linear time per update O(n) per update of a (n = b ) O(m) per update of b (m = a ) Alexander Tiskin (Warwick) The Waterfall Algorithm 33 / 38

The transposition network method Dynamic string comparison Dynamic LCS in linear time: update models left right app+del standard DP [Wagner, Fischer: 1974] app app a fixed [Landau+: 1998], [Kim, Park: 2004] app app [Ishida+: 2005] app+del app+del [T: NEW] Main idea: for append only, maintain seaweed matrix P a,b for append+delete, maintain partial seaweed layout by tracing a transposition network Alexander Tiskin (Warwick) The Waterfall Algorithm 34 / 38

The transposition network method Bit-parallel string comparison Bit-parallel string comparison String comparison using standard instructions on words of size w Bit-parallel string comparison: running time O(mn/w) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001] Alexander Tiskin (Warwick) The Waterfall Algorithm 35 / 38

The transposition network method Bit-parallel string comparison Bit-parallel string comparison: binary transposition network In every cell: input bits s, c; output bits s, c ; match/mismatch flag µ µ s c s s 0 1 0 1 0 1 0 1 c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 µ s c c + s s 0 1 0 1 0 1 0 1 c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c 2c + s (s + (s µ) + c) (s µ) S (S + (S M)) (S M), where S, M are words of bits s, µ Alexander Tiskin (Warwick) The Waterfall Algorithm 36 / 38

The transposition network method Bit-parallel string comparison High-similarity bit-parallel string comparison κ = dist LCS (a, b) Assume κ odd, m = n Waterfall algorithm within diagonal band of width κ + 1: time O(nκ/w) Band waterfall supported from below by separator matches Alexander Tiskin (Warwick) The Waterfall Algorithm 37 / 38

The transposition network method Bit-parallel string comparison High-similarity bit-parallel multi-string comparison: a vs b 0,..., b r 1 κ i = dist LCS (a, b i ) κ 0 i < r Waterfalls within r diagonal bands of width κ + 1: time O(nrκ/w) Each band s waterfall supported from below by separator matches Alexander Tiskin (Warwick) The Waterfall Algorithm 38 / 38