Small-Space Dictionary Matching (Dissertation Proposal)

Similar documents
Succinct 2D Dictionary Matching with No Slowdown

Compressed Index for Dynamic Text

Smaller and Faster Lempel-Ziv Indices

Alphabet Friendly FM Index

Lecture 18 April 26, 2012

Alphabet-Independent Compressed Text Indexing

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

arxiv: v1 [cs.ds] 15 Feb 2012

Space-Efficient Construction Algorithm for Circular Suffix Tree

Text matching of strings in terms of straight line program by compressed aleshin type automata

Data Structure for Dynamic Patterns

String Range Matching

Compressed Representations of Sequences and Full-Text Indexes

Longest Gapped Repeats and Palindromes

arxiv: v2 [cs.ds] 8 Apr 2016

String Matching with Variable Length Gaps

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

String Regularities and Degenerate Strings

Stronger Lempel-Ziv Based Compressed Text Indexing

On-line String Matching in Highly Similar DNA Sequences

A Unifying Framework for Compressed Pattern Matching

Finding all covers of an indeterminate string in O(n) time on average

Shift-And Approach to Pattern Matching in LZW Compressed Text

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

arxiv: v1 [cs.ds] 19 Apr 2011

Fast Fully-Compressed Suffix Trees

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Define M to be a binary n by m matrix such that:

A Faster Grammar-Based Self-Index

On-Line Construction of Suffix Trees I. E. Ukkonen z

arxiv: v1 [cs.ds] 25 Nov 2009

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

arxiv: v2 [cs.ds] 5 Mar 2014

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Opportunistic Data Structures with Applications

Succinct Suffix Arrays based on Run-Length Encoding

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Space-Efficient Re-Pair Compression

Data Compression Techniques

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Succincter text indexing with wildcards

An on{line algorithm is presented for constructing the sux tree for a. the desirable property of processing the string symbol by symbol from left to

On the Number of Distinct Squares

Faster Compact On-Line Lempel-Ziv Factorization

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

String Indexing for Patterns with Wildcards

Compressed Representations of Sequences and Full-Text Indexes

arxiv: v1 [cs.ds] 22 Nov 2012

A Space-Efficient Frameworks for Top-k String Retrieval

Compact Indexes for Flexible Top-k Retrieval

SUFFIX TREE. SYNONYMS Compact suffix trie

Self-Indexed Grammar-Based Compression

On Compressing and Indexing Repetitive Sequences

Efficient High-Similarity String Comparison: The Waterfall Algorithm

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Number of occurrences of powers in strings

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Self-Indexed Grammar-Based Compression

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Hierarchical Overlap Graph

Pattern Matching (Exact Matching) Overview

Internal Pattern Matching Queries in a Text and Applications

Converting SLP to LZ78 in almost Linear Time

BLAST: Basic Local Alignment Search Tool

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Motif Extraction from Weighted Sequences

Optimal Dynamic Sequence Representations

Implementing Approximate Regularities

String Regularities and Degenerate Strings

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Suffix Array of Alignment: A Practical Index for Similar Data

Preview: Text Indexing

Average Complexity of Exact and Approximate Multiple String Matching

A Simple Alphabet-Independent FM-Index

Simple Real-Time Constant-Space String Matching

How many double squares can a string contain?

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Computing Longest Common Substrings Using Suffix Arrays

Forbidden Patterns. {vmakinen leena.salmela

Longest Lyndon Substring After Edit

Text Indexing: Lecture 6

Sequence comparison by compression

Computing a Longest Common Palindromic Subsequence

Optimal-Time Text Indexing in BWT-runs Bounded Space

arxiv: v1 [cs.ds] 8 Sep 2018

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Fast profile matching algorithms A survey

String Search. 6th September 2018

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Intrusion Detection and Malware Analysis

arxiv: v1 [cs.ds] 30 Nov 2018

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS

Rank and Select Operations on Binary Strings (1974; Elias)

Theoretical Computer Science

Transcription:

Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012

Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length n. Output: All positions in text at which a dictionary pattern occurs.

Applications Dictionary Matching Search for specific phrases in a book Scanning file for virus signatures Network intrusion detection systems Searching DNA sequence for a set of motifs

Small-Space Challenge: Limited storage capacity in devices. Massive proliferation of data. Goal: efficient algorithms with respect to both time and space.

Research Plan Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching

Small-Space 1D One-Dimensional Two-Dimensional 1D single pattern matching in linear time and O(1) working space: Galil and Seiferas (1981) Crochemore and Perrin (1991) Rytter (2003) Gasieniec and Kolpakov (2004)

Empirical Entropy One-Dimensional Two-Dimensional How is space measured in succinct data structures and algorithms? empirical entropy Definition Empirical entropy of a string, H 0 or H k describes the minimum number of bits that are needed to encode the string within context. H 0 (S) = σ i=1 n i n log n i n

Empirical Entropy One-Dimensional Two-Dimensional H 0( s) = σ i= 1 ni ni log n n

Empirical Entropy One-Dimensional Two-Dimensional H k (S): lower bound to the compression we can achieve for each symbol using a code which depends on the k symbols preceding it. H k (S) = 1 w n S H 0 (w S ) w Σ k w S is the sequence of characters following occurrences of w in S.

Empirical Entropy One-Dimensional Two-Dimensional For any string S and k 0, H k (S) H k 1 (S) H 0 (S) log Σ

Small-Space 1D One-Dimensional Two-Dimensional 1D dictionary matching in small space : Space (bits) Search Time Reference O(l log l) O(n + occ) Aho-Corasick (1975) O(l) O((n+occ)log 2 l) Chan et al. (2007) lh k (D)+o(llogσ)+O(d logl) O(n(log ǫ l +logd)+occ) Hon et al. (2008) l(h 0 +O(1))+O(d log(l/d)) O(n+occ) Belazzougui (2010) lh k (D)+O(l) O(n+occ) Hon et al. (2010) d is the number of patterns in D. l is the total size of the dictionary.

Small-Space 1D Dictionary One-Dimensional Two-Dimensional Compressed AC automaton [Hon et al. (2010)]: No slowdown! Separates the three functions of the AC automaton. * goto * report * fail Encodes each function differently. Space complexity meets H k (D), kth order empirical entropy.

Dynamic Dictionary Matching One-Dimensional Two-Dimensional 1D dynamic dictionary matching: Preprocess Time Update Time Search Time Ref. O(llogl) O(plogl) O((n+occ)logl) AF91 O(llogσ) O(plogl) O((n+occ)logl) IS94 O(llogσ) O(p logl loglogl ) logl O((n+occ) loglogl ) AFIPS95 O(l) O(p) O(n + occ) SV96 d is the number of patterns in D. l is the total size of the dictionary. p is the size of a pattern to add / remove.

Dynamic Dictionary Matching One-Dimensional Two-Dimensional Previous dynamic algorithms require O(l log l) extra space. 1D dynamic dictionary matching in small space : Chan et al. (2007) Update Time: O(plog 2 l) Search Time: O((n+occ)log 2 l) Space: O(lσ) Hon et al. (2009) Update Time: O(plogσ +logl) Search Time: O(nlogl+occ) Space: meets kth order empirical entropy

Small-Space 2D One-Dimensional Two-Dimensional 2D linear-time single pattern matching Crochemore et al. (1995): Preprocessing: linear time within log space. Text Scanning: linear time, O(1) extra space. Use small-space 2D single pattern matching for set of patterns * requires several scans of text.

2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Giancarlo (1995) Idury, Schaffer (1995) Require working space proportional to input.

2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) * For rectangular patterns same size in one dimension. * Linearize dictionary and label text. * Use 1D dictionary matching. * Linear time and space. Amir, Farach (1992) Giancarlo (1995) Idury, Schaffer (1995)

2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) * For square patterns of different sizes. * Linearization around diagonals. * Use suffix tree and AC automaton. * Linear time and space. Giancarlo (1995) Idury, Schaffer (1995)

2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Giancarlo (1995) * For square patterns of different sizes. * Uses 2D suffix tree. * Slower than linear time. * Linear space. Idury, Schaffer (1995)

2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Giancarlo (1995) Idury, Schaffer (1995) * For rectangular patterns of different sizes. * Multidemensional range searching and dictionary matching. * Slower than linear text processing. * Linear space.

Dynamic 2D Dictionary Matching One-Dimensional Two-Dimensional For square patterns: Update Time Search Time Ref. O(p 2 log D ) O((n 2 +occ)log D ) AFIPS (1995) O(p 2 log 2 D ) O((n 2 logd +occ)log D ) Giancarlo95 O(p 2 +plogd) O((n 2 +occ)logd) CL96 Pattern P i is a square of size p i p i. d d D = pi 2 and D = p i. i=1 i=1 P, of size p p, is added / removed.

Dynamic 2D Dictionary Matching One-Dimensional Two-Dimensional For rectangular patterns of different sizes: Idury, Schaffer (1995) Groups patterns by size. Multidemensional range searching and dictionary matching. O(log 4 D ) slowdown in all stages * Preprocess dictionary * Scan text * Update dictionary

Research Plan One-Dimensional Two-Dimensional Thesis Goals: 1 Succinct 2D dictionary matching Develop first efficient algorithm Linear Time Small-Space 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching

Problem Definition One-Dimensional Two-Dimensional 2D Dictionary Matching Input: Dictionary of d patterns, each is m m in size. Text T of size n n. Output: All positions in text at which a dictionary pattern occurs.

2D Dictionary Matching One-Dimensional Two-Dimensional Bird / Baker Convert 2D data to 1D representation. Name patterns rows. Name text positions. Use 1D dictionary matching to find pattern occurrences.

2D Dictionary Matching One-Dimensional Two-Dimensional Bird / Baker Convert 2D data to 1D representation. Name patterns rows. Name text positions. Use 1D dictionary matching to find pattern occurrences. Our work: mimic Bird/Baker algorithm in small space.

1D Periodicity One-Dimensional Two-Dimensional Definition A string p is periodic in u if p = u k u where u is a proper prefix of u, u is primitive, and k 2. aabcaabcaabcaa

1D Periodicity One-Dimensional Two-Dimensional Definition A string p is periodic in u if p = u k u where u is a proper prefix of u, u is primitive, and k 2. We divide patterns into 2 groups based on 1D periodicity. In each case, different difficulties to overcome.

Types of Patterns One-Dimensional Two-Dimensional Case I: Patterns with ALL rows periodic, period m/4. Problem: can have more candidates than the space we allow. Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions.

Types of Patterns One-Dimensional Two-Dimensional Case I: Patterns with ALL rows periodic, period m/4. Problem: can have more candidates than the space we allow.

Lyndon Words One-Dimensional Two-Dimensional Definition Two words x, y are conjugate if x = uv, y = vu for some u, v. Definition A Lyndon word is a primitive string which is lexicographically smaller than any of its conjugates. Canonization computes the least conjugate of a word.

Naming One-Dimensional Two-Dimensional New technique for naming rows: same name if periods are conjugate

Naming One-Dimensional Two-Dimensional New technique for naming rows: same name if periods are conjugate How is this done in linear time, yet small space? witness tree Witness tree stores a distinction between two names. To name a new row, it is compared to only one other row.

Types of Patterns One-Dimensional Two-Dimensional Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions.

Types of Patterns One-Dimensional Two-Dimensional Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions. Many 1D names can overlap in a text block row. Identification of candidates is simpler. Identify candidates with aperiodic row of each pattern.

Types of Patterns One-Dimensional Two-Dimensional Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions. Difficulty: single pass verification. We introduce dynamic dueling. Eliminate vertically inconsistent candidates. 1 Use 1D representations in each column. 2 Use witness tree to compare row names in constant time. Then verify consistent candidates against text rows.

Research Plan One-Dimensional Two-Dimensional Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching Develop dynamic algorithm Adapts to modifications of dictionary * Efficiently insert pattern * Efficiently delete pattern Linear Time Small-Space Dynamic succinct version of Bird/Baker algorithm. 3 for succinct 1D dictionary matching

Small-Space 1D Indexing Dictionary Matching 1D dictionary matching in small space : Space (bits) Search Time Reference O(l log l) O(n + occ) Aho-Corasick (1975) O(l) O((n+occ)log 2 l) Chan et al. (2007) lh k (D)+o(llogσ)+O(d logl) O(n(log ǫ l +logd)+occ) Hon et al. (2008) l(h 0 +O(1))+O(d log(l/d)) O(n+occ) Belazzougui (2010) lh k (D)+O(l) O(n+occ) Hon et al. (2010) d is the number of patterns in D. l is the total size of the dictionary.

1D Dictionary Matching Indexing Dictionary Matching For 1D data, Time-Space efficient dictionary matching has been achieved. * Only in the theoretical realm. * Implementations lag behind. * Rely on complex data structures that have not been implemented. Our goal: Efficient software for succinct dictionary matching that relies on popular succinct data structures.

Suffix Tree Indexing Dictionary Matching M i s s i s s i p p i $

Suffix Tree Indexing Dictionary Matching Applications of Suffix Tree: Single pattern matching Longest common substring of suffixes Longest repeated substring Longest palindrome

Suffix Array Indexing Dictionary Matching sort suffixes alphabetically

Suffix Tree Indexing Dictionary Matching M i s s i s s i p p i $ s 12 s 11 s 8 s 3 s 6 s 5 s 2 s 1 s10 s 9 s 7 s 4

Suffix Tree Indexing Dictionary Matching Dynamic Suffix Tree Choi, Lam (1997) Strings can be inserted or deleted efficiently. Update time: O(p) to insert/delete string of size p. No edge labeled by a deleted string. Additional Storage: two-way pointer for each edge.

Compressed Suffix Tree Indexing Dictionary Matching Compressed suffix tree (CST) Compressed self-index Replaces input data and answers queries No more space than underlying data Minor slowdown in compressed suffix array as well

Compressed Suffix Tree Indexing Dictionary Matching Compressed suffix tree (CST) is composed of several compressed data structures: * Compressed Suffix Array * Compressed LCP Array * Navigation between nodes

Compressed Suffix Tree Indexing Dictionary Matching Uncompressed Suffix Tree: O(n log n) bits of space. Compressed Suffix Tree: 1 Sadakane (2007): O(n log σ) bits, O(polylog(n)) slowdown. 2 Russo et al. (2011): kth order empirical entropy, O(log n) slowdown. 3 Fischer et al. (2009): kth order empirical entropy, sub-logarithmic slowdown. 4 Ohlebusch, Fischer, Gog (2011): some queries in O(1) time. Have all been implemented.

Compressed Suffix Tree Indexing Dictionary Matching Dynamic Compressed Suffix Tree 1 Chan et al. (2007): Adaptation of Sadakane (2007). O(n) bits O(log 2 n) slowdown. 2 Russo et al. (2011): nh k +o(nlogσ) bits. O(log 2 n) slowdown. Have not been implemented.

Suffix Tree Indexing Dictionary Matching Suffix Tree for set of patterns: 1 Concatenate patterns with unique delimiter between them. OR * Problem: indexes artificial suffixes. 2 Generalized suffix tree * Combines suffixes of each pattern in a single data structure. * Construction: Ukkonen s online suffix tree construction algorithm.

Suffix Links Indexing Dictionary Matching Definition A suffix link is a pointer from an internal node labeled xs to another internal node labeled S, where x is an arbitrary character and S is a possibly empty substring. Suffix links facilitate traversal of suffix tree during and after construction.

Suffix Links Indexing Dictionary Matching M i s s i s s i p p i $ s 12 s 11 s 8 s 3 s 6 s 5 s 2 s 1 s10 s 9 s 7 s 4

Indexing Dictionary Matching Algorithm for 1D dictionary matching on suffix tree: Generalized suffix tree: index of several strings Ukkonen can insert one string at a time Our algorithm: modeled after Ukkonen s suffix tree construction algorithm Online processing of text Linear time: as if inserting new pattern Skip-count trick uses suffix links

Indexing Dictionary Matching Algorithm for 1D dictionary matching on suffix tree: Generalized suffix tree: index of several strings Ukkonen can insert one string at a time Our algorithm: modeled after Ukkonen s suffix tree construction algorithm Online processing of text Linear time: as if inserting new pattern Skip-count trick uses suffix links Small-Space: compressed suffix tree

Skip-Count Trick Indexing Dictionary Matching M i s s i s s i p p i $ s 12 s 11 s 8 s 3 s 6 s 5 s 2 s 1 s10 s 9 s 7 s 4

Research Plan Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching Uses compressed suffix tree Space meets entropy-compressed bounds Linear time, with slowdown to query compressed self-index

Evaluation Indexing Dictionary Matching Data Sets: Biological * Fly sequences: FlyBase * Flu sequences: Influenza Virus Sequence Database Security * Network intrusion detection signatures: ClamAV Virus detection * Virus signatures: Snort

Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching

Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching Case I: presented at CPM 2010 Case II: presented at WADS 2011 * With no slowdown! * Complete algorithm accepted to Algorithmica 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching

Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching Algorithm has been developed Complexity has been proven Manuscript is in final editing stages 3 for succinct 1D dictionary matching

Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching Dictionary matching over generalized Ukkonen suffix tree. Ported code to run over compressed suffix tree, using SDSL. Assumption: no pattern is substring of another. Add lowest marked ancestor framework to overcome. Evaluate time/space trade-offs of CST implementations. Run over large, realistic data sets.

Indexing Dictionary Matching Thank you!