Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Similar documents
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

String Matching with Variable Length Gaps

arxiv: v1 [cs.ds] 9 Apr 2018

A Faster Grammar-Based Self-Index

arxiv: v1 [cs.ds] 15 Feb 2012

Theoretical Computer Science

Space-Efficient Re-Pair Compression

Approximate String Matching with Lempel-Ziv Compressed Indexes

String Search. 6th September 2018

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

1 Alphabets and Languages

INF 4130 / /8-2017

Approximate String Matching with Ziv-Lempel Compressed Indexes

A Unifying Framework for Compressed Pattern Matching

Small-Space Dictionary Matching (Dissertation Proposal)

On Compressing and Indexing Repetitive Sequences

String Range Matching

arxiv: v2 [cs.ds] 8 Apr 2016

Succinct 2D Dictionary Matching with No Slowdown

Compressed Index for Dynamic Text

Smaller and Faster Lempel-Ziv Indices

Alternative Algorithms for Lyndon Factorization

Analysis and Design of Algorithms Dynamic Programming

Sparse Text Indexing in Small Space

String Indexing for Patterns with Wildcards

INF 4130 / /8-2014

Exact vs approximate search

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Algorithm Design and Analysis

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

CS 530: Theory of Computation Based on Sipser (second edition): Notes on regular languages(version 1.1)

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Processing Compressed Texts: A Tractability Border

Optimal-Time Text Indexing in BWT-runs Bounded Space

Opportunistic Data Structures with Applications

Analysis of Algorithms Prof. Karen Daniels

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

String Matching with Variable Length Gaps

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Sequence comparison by compression

2. Exact String Matching

Shift-And Approach to Pattern Matching in LZW Compressed Text

A Dichotomy for Regular Expression Membership Testing

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Two Algorithms for LCS Consecutive Suffix Alignment

arxiv: v3 [cs.ds] 6 Sep 2018

State Complexity of Neighbourhoods and Approximate Pattern Matching

arxiv: v1 [cs.ds] 30 Nov 2018

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

CS4800: Algorithms & Data Jonathan Ullman

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Deterministic Finite Automaton (DFA)

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

HKN CS/ECE 374 Midterm 1 Review. Nathan Bleier and Mahir Morshed

Data Structures in Java

Oblivious String Embeddings and Edit Distance Approximations

Internal Pattern Matching Queries in a Text and Applications

SUPPLEMENTARY INFORMATION

Theoretical Computer Science. Efficient algorithms to compute compressed longest common substrings and compressed palindromes

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Hash tables. Hash tables

Faster Regular Expression Matching. Philip Bille Mikkel Thorup

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Efficient Sequential Algorithms, Comp309

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

CS 455/555: Mathematical preliminaries

Università degli studi di Udine

arxiv: v1 [cs.ds] 24 Sep 2015

Module 9: Tries and String Matching

Suffix Array of Alignment: A Practical Index for Similar Data

Lecture 1 : Data Compression and Entropy

arxiv: v1 [cs.ds] 8 Sep 2018

Complementary Contextual Models with FM-index for DNA Compression

Algorithms Design & Analysis. String matching

Regular Languages. Problem Characterize those Languages recognized by Finite Automata.

Theory of Computation (II) Yijia Chen Fudan University

arxiv: v2 [cs.ds] 5 Mar 2014

Implementing Approximate Regularities

Integer Sorting on the word-ram

On Pattern Matching With Swaps

Motivation for Arithmetic Coding

CSE 105 THEORY OF COMPUTATION

Lecture 3: Nondeterministic Finite Automata

Finite State Transducers

CSE 105 THEORY OF COMPUTATION

More Speed and More Compression: Accelerating Pattern Matching by Text Compression

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

On Adaptive Integer Sorting

Average Complexity of Exact and Approximate Multiple String Matching

T U M I N S T I T U T F Ü R I N F O R M A T I K. Text Indexing with Errors. Moritz G. Maaß and Johannes Nowak

Sublinear Approximate String Matching

Transcription:

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz Technical University of Denmark

Approximate String Matching The edit distance between two strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the other. E.g., edit-distance( cocoa, cola ) = 2. Let P and Q be strings and let k (integer > 0) be an error threshold. The approximate string matching problem is to find all ending positions of substrings in Q whose edit distance to P is at most k.

Results Time Space Reference O(um) O(m) [Sellers1980] O(uk) O(m) [LV1989] O ( uk 4 m + u ) O(m) [CH2002] P = m and Q = u

Ziv-Lempel 1978 compression Q = ananas a 0 n Z = (0,a)(0,n)(1,n)(1,s) s 1 2 n Z = z 0 z 1 z 2 z 3 z 4 4 3

Approximate String Matching on ZL78 compressed texts Let P be a string and Z be a ZL78 compressed representation of a string Q. Given P and Z, the compressed approximate string matching problem is to solve the approximate string matching for P and Q without decompressing Z. Goal: Do it more efficiently than decompressing Z and using the best (uncompressed) approximate string matching algorithm.

Applications Textual data bases (e.g. DNA sequence collections) issues: Save space = keep data in compressed form. Search efficiently. Solution: Compressed string matching algorithms.

Results Let P = m and Z = n. Kärkkäinen, Navarro, and Ukkonen [KNU2003]: O(nmk + occ) time and O(nmk) space. Our result (Theorem 1): For any parameter τ 1: O(n(τ + m + t(m, 2m + 2k, k)) + occ) expected time and O(n/τ + m + s(m, 2m + 2k, k)) + occ) space.

Example Results Time Space Reference O(nmk + occ) O(nmk) [KNU2003] O(nmk + occ) O ( n mk + m + occ ) LV + τ = mk This O(nk 4 + nm + occ) O ( ) n k 4 + m + m + occ CH + τ = k 4 + m paper P = m and Z = n

Selecting Compression Elements Z = For parameter τ 1, select a subset C of the compression elements of Z such that: C = O(n/τ). From any compression element z i, the distance (minimum number of references) to any compression element in C is at most 2τ.

Selecting Compression Elements Z = Maintain C using dynamic perfect hashing while scanning Z from left-to-right. Initially, set C = {z 0 }. z i+1 To process element follow references until we encounter y C: If the distance l from z i+1 to y is less than 2τ we are done. Otherwise ( l = 2τ), insert element the element at distance τ into C.

Selecting Compression Elements Z = Lemma: For any parameter τ 1, C is constructed in O(nτ) expected time and O(n/τ) space.

Computing Matches Q =... phrase(z i 1 ) phrase(z i )... Strategy: Process Z from left-to-right. z i At we compute all matches ending in the substring encoded by. z i

Computing Overlapping Matches rsuf(z i 1 ) rpre(z i ) Q =... phrase(z i 1 ) phrase(z i )... Let [u i, u i + l i 1] be the positions in Q of phrase(z i ). Goal: Find all overlapping matches for and ending in [u i, u i + l i 1]. z i, i.e., the matches starting before u i Decompress substrings rpre(z i ) and rsuf(z i 1 ) of length m + k around u i. Run favorite (uncompressed) approximate string matching algorithm to find matches of P in rsuf(z i 1 ) rpre(z i ). Add offset to these to get the overlapping matches for z i.

Computing the Relevant Prefix and Suffix z i m + k For parameter τ 1, select a subset C of the compression elements of Z according to Lemma 1. For each element in C at distance more than m + k from z 0 add shortcut to element at distance m + k.

Computing the Relevant Prefix z i m + k Follow references to nearest element in C. Follow shortcut if present. Compute the relevant prefix by decompressing length m + k substring.

Computing the Relevant Suffix z i m + k Follow references to decompress substring of length m + k. If the phrase is shorter than m + k, recursively apply to z i 1 until we have m + k characters.

Analysis Time = preprocess + n(find nearest element + decompress + match) = O(nτ + n(τ + m + t(m, 2m + 2k, k)) Space = preprocess + decompress + match = O(n/τ + m + s(m, 2m + 2k, k))

Computing Internal Matches rsuf (z i )... phrase(z i 1 ) phrase(z i )... z i Goal: Find all internal matches for, i.e., all matches starting and ending within [u i, u i + l i 1]. Compute and store all the internal match sets indexed by compression elements using dynamic perfect hashing. Decompress substring rsuf (z i ) of length min(l i, m + k) ending at u i + l i 1. Internal matches for z i = (internal matches for reference(z i )) (matches of P in rsuf (z i )

Analysis Time = n (decompress + match + internal matches) = O(n(m + t(m, m + k, k)) + occ) Space = decompress + match + total number of internal matches = O(m + s(m, m + k, k) + occ)

Putting the Pieces Together Merging overlapping and internal matches we get all matches for within [u i, u i + l i 1]. z i ending Implies Theorem 1: For any parameter τ 1: O(n(τ + m + t(m, 2m + 2k, k)) + occ) expected time and O(n/τ + m + s(m, 2m + 2k, k)) + occ) space. Does not hold for ZLW compressed texts, unless Ω(n) space is used. For Ω(n) space the bounds hold in the worst-case and work for both ZL78 and ZLW.

Regular Expression Matching A regular expression is a generalized pattern composed from simple characters using union, concatenation, and Kleene star. Given a regular expression R and a string Qthe regular expression matching problem is to find all ending positions of substrings in Qthat matches a string in the language generated by. R

Regular Expression Matching Let R = m and Q = u. Classic solution [Thompson1968]: O(um) time and O(m) space. Several improvements based on the Four-Russian technique or word-level parallelism [Myers1992, NR2004, BFC2005, Bille2006].

Compressed Regular Expression Matching Let R = m and Z = n. Navarro [Navarro2003] simplified and without word-level parallel techniques: O(nm 2 + occ m log m) time and O(nm 2 ) space. Our result (Theorem 2): For any parameter τ 1: O(nm(m + τ) + occ m log m) time and O(nm 2 /τ + nm) space. E.g. τ = m gives O(nm 2 + occ m log m) time and O(nm) space.

Remarks Compressed strings are large and therefore feasible for large texts. Ω(n) space space may not be Our result for compressed approximate string matching is one of the very few algorithms for compressed matching that uses o(n) space. More sublinear space compressed string matching algorithms are needed!