Small-Space Dictionary Matching (Dissertation Proposal)
|
|
- Leo Greer
- 5 years ago
- Views:
Transcription
1 Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012
2 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length n. Output: All positions in text at which a dictionary pattern occurs.
3 Applications Dictionary Matching Search for specific phrases in a book Scanning file for virus signatures Network intrusion detection systems Searching DNA sequence for a set of motifs
4 Small-Space Challenge: Limited storage capacity in devices. Massive proliferation of data. Goal: efficient algorithms with respect to both time and space.
5 Research Plan Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching
6 Small-Space 1D One-Dimensional Two-Dimensional 1D single pattern matching in linear time and O(1) working space: Galil and Seiferas (1981) Crochemore and Perrin (1991) Rytter (2003) Gasieniec and Kolpakov (2004)
7 Empirical Entropy One-Dimensional Two-Dimensional How is space measured in succinct data structures and algorithms? empirical entropy Definition Empirical entropy of a string, H 0 or H k describes the minimum number of bits that are needed to encode the string within context. H 0 (S) = σ i=1 n i n log n i n
8 Empirical Entropy One-Dimensional Two-Dimensional H 0( s) = σ i= 1 ni ni log n n
9 Empirical Entropy One-Dimensional Two-Dimensional H k (S): lower bound to the compression we can achieve for each symbol using a code which depends on the k symbols preceding it. H k (S) = 1 w n S H 0 (w S ) w Σ k w S is the sequence of characters following occurrences of w in S.
10 Empirical Entropy One-Dimensional Two-Dimensional For any string S and k 0, H k (S) H k 1 (S) H 0 (S) log Σ
11 Small-Space 1D One-Dimensional Two-Dimensional 1D dictionary matching in small space : Space (bits) Search Time Reference O(l log l) O(n + occ) Aho-Corasick (1975) O(l) O((n+occ)log 2 l) Chan et al. (2007) lh k (D)+o(llogσ)+O(d logl) O(n(log ǫ l +logd)+occ) Hon et al. (2008) l(h 0 +O(1))+O(d log(l/d)) O(n+occ) Belazzougui (2010) lh k (D)+O(l) O(n+occ) Hon et al. (2010) d is the number of patterns in D. l is the total size of the dictionary.
12 Small-Space 1D Dictionary One-Dimensional Two-Dimensional Compressed AC automaton [Hon et al. (2010)]: No slowdown! Separates the three functions of the AC automaton. * goto * report * fail Encodes each function differently. Space complexity meets H k (D), kth order empirical entropy.
13 Dynamic Dictionary Matching One-Dimensional Two-Dimensional 1D dynamic dictionary matching: Preprocess Time Update Time Search Time Ref. O(llogl) O(plogl) O((n+occ)logl) AF91 O(llogσ) O(plogl) O((n+occ)logl) IS94 O(llogσ) O(p logl loglogl ) logl O((n+occ) loglogl ) AFIPS95 O(l) O(p) O(n + occ) SV96 d is the number of patterns in D. l is the total size of the dictionary. p is the size of a pattern to add / remove.
14 Dynamic Dictionary Matching One-Dimensional Two-Dimensional Previous dynamic algorithms require O(l log l) extra space. 1D dynamic dictionary matching in small space : Chan et al. (2007) Update Time: O(plog 2 l) Search Time: O((n+occ)log 2 l) Space: O(lσ) Hon et al. (2009) Update Time: O(plogσ +logl) Search Time: O(nlogl+occ) Space: meets kth order empirical entropy
15 Small-Space 2D One-Dimensional Two-Dimensional 2D linear-time single pattern matching Crochemore et al. (1995): Preprocessing: linear time within log space. Text Scanning: linear time, O(1) extra space. Use small-space 2D single pattern matching for set of patterns * requires several scans of text.
16 2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Giancarlo (1995) Idury, Schaffer (1995) Require working space proportional to input.
17 2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) * For rectangular patterns same size in one dimension. * Linearize dictionary and label text. * Use 1D dictionary matching. * Linear time and space. Amir, Farach (1992) Giancarlo (1995) Idury, Schaffer (1995)
18 2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) * For square patterns of different sizes. * Linearization around diagonals. * Use suffix tree and AC automaton. * Linear time and space. Giancarlo (1995) Idury, Schaffer (1995)
19 2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Giancarlo (1995) * For square patterns of different sizes. * Uses 2D suffix tree. * Slower than linear time. * Linear space. Idury, Schaffer (1995)
20 2D Dictionary Matching One-Dimensional Two-Dimensional Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Giancarlo (1995) Idury, Schaffer (1995) * For rectangular patterns of different sizes. * Multidemensional range searching and dictionary matching. * Slower than linear text processing. * Linear space.
21 Dynamic 2D Dictionary Matching One-Dimensional Two-Dimensional For square patterns: Update Time Search Time Ref. O(p 2 log D ) O((n 2 +occ)log D ) AFIPS (1995) O(p 2 log 2 D ) O((n 2 logd +occ)log D ) Giancarlo95 O(p 2 +plogd) O((n 2 +occ)logd) CL96 Pattern P i is a square of size p i p i. d d D = pi 2 and D = p i. i=1 i=1 P, of size p p, is added / removed.
22 Dynamic 2D Dictionary Matching One-Dimensional Two-Dimensional For rectangular patterns of different sizes: Idury, Schaffer (1995) Groups patterns by size. Multidemensional range searching and dictionary matching. O(log 4 D ) slowdown in all stages * Preprocess dictionary * Scan text * Update dictionary
23 Research Plan One-Dimensional Two-Dimensional Thesis Goals: 1 Succinct 2D dictionary matching Develop first efficient algorithm Linear Time Small-Space 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching
24 Problem Definition One-Dimensional Two-Dimensional 2D Dictionary Matching Input: Dictionary of d patterns, each is m m in size. Text T of size n n. Output: All positions in text at which a dictionary pattern occurs.
25 2D Dictionary Matching One-Dimensional Two-Dimensional Bird / Baker Convert 2D data to 1D representation. Name patterns rows. Name text positions. Use 1D dictionary matching to find pattern occurrences.
26 2D Dictionary Matching One-Dimensional Two-Dimensional Bird / Baker Convert 2D data to 1D representation. Name patterns rows. Name text positions. Use 1D dictionary matching to find pattern occurrences. Our work: mimic Bird/Baker algorithm in small space.
27 1D Periodicity One-Dimensional Two-Dimensional Definition A string p is periodic in u if p = u k u where u is a proper prefix of u, u is primitive, and k 2. aabcaabcaabcaa
28 1D Periodicity One-Dimensional Two-Dimensional Definition A string p is periodic in u if p = u k u where u is a proper prefix of u, u is primitive, and k 2. We divide patterns into 2 groups based on 1D periodicity. In each case, different difficulties to overcome.
29 Types of Patterns One-Dimensional Two-Dimensional Case I: Patterns with ALL rows periodic, period m/4. Problem: can have more candidates than the space we allow. Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions.
30 Types of Patterns One-Dimensional Two-Dimensional Case I: Patterns with ALL rows periodic, period m/4. Problem: can have more candidates than the space we allow.
31 Lyndon Words One-Dimensional Two-Dimensional Definition Two words x, y are conjugate if x = uv, y = vu for some u, v. Definition A Lyndon word is a primitive string which is lexicographically smaller than any of its conjugates. Canonization computes the least conjugate of a word.
32 Naming One-Dimensional Two-Dimensional New technique for naming rows: same name if periods are conjugate
33 Naming One-Dimensional Two-Dimensional New technique for naming rows: same name if periods are conjugate How is this done in linear time, yet small space? witness tree Witness tree stores a distinction between two names. To name a new row, it is compared to only one other row.
34 Types of Patterns One-Dimensional Two-Dimensional Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions.
35 Types of Patterns One-Dimensional Two-Dimensional Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions. Many 1D names can overlap in a text block row. Identification of candidates is simpler. Identify candidates with aperiodic row of each pattern.
36 Types of Patterns One-Dimensional Two-Dimensional Case II: Patterns contain aperiodic row or row with period > m/4. Problem: several patterns can overlap in both directions. Difficulty: single pass verification. We introduce dynamic dueling. Eliminate vertically inconsistent candidates. 1 Use 1D representations in each column. 2 Use witness tree to compare row names in constant time. Then verify consistent candidates against text rows.
37 Research Plan One-Dimensional Two-Dimensional Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching Develop dynamic algorithm Adapts to modifications of dictionary * Efficiently insert pattern * Efficiently delete pattern Linear Time Small-Space Dynamic succinct version of Bird/Baker algorithm. 3 for succinct 1D dictionary matching
38 Small-Space 1D Indexing Dictionary Matching 1D dictionary matching in small space : Space (bits) Search Time Reference O(l log l) O(n + occ) Aho-Corasick (1975) O(l) O((n+occ)log 2 l) Chan et al. (2007) lh k (D)+o(llogσ)+O(d logl) O(n(log ǫ l +logd)+occ) Hon et al. (2008) l(h 0 +O(1))+O(d log(l/d)) O(n+occ) Belazzougui (2010) lh k (D)+O(l) O(n+occ) Hon et al. (2010) d is the number of patterns in D. l is the total size of the dictionary.
39 1D Dictionary Matching Indexing Dictionary Matching For 1D data, Time-Space efficient dictionary matching has been achieved. * Only in the theoretical realm. * Implementations lag behind. * Rely on complex data structures that have not been implemented. Our goal: Efficient software for succinct dictionary matching that relies on popular succinct data structures.
40 Suffix Tree Indexing Dictionary Matching M i s s i s s i p p i $
41 Suffix Tree Indexing Dictionary Matching Applications of Suffix Tree: Single pattern matching Longest common substring of suffixes Longest repeated substring Longest palindrome
42 Suffix Array Indexing Dictionary Matching sort suffixes alphabetically
43 Suffix Tree Indexing Dictionary Matching M i s s i s s i p p i $ s 12 s 11 s 8 s 3 s 6 s 5 s 2 s 1 s10 s 9 s 7 s 4
44 Suffix Tree Indexing Dictionary Matching Dynamic Suffix Tree Choi, Lam (1997) Strings can be inserted or deleted efficiently. Update time: O(p) to insert/delete string of size p. No edge labeled by a deleted string. Additional Storage: two-way pointer for each edge.
45 Compressed Suffix Tree Indexing Dictionary Matching Compressed suffix tree (CST) Compressed self-index Replaces input data and answers queries No more space than underlying data Minor slowdown in compressed suffix array as well
46 Compressed Suffix Tree Indexing Dictionary Matching Compressed suffix tree (CST) is composed of several compressed data structures: * Compressed Suffix Array * Compressed LCP Array * Navigation between nodes
47 Compressed Suffix Tree Indexing Dictionary Matching Uncompressed Suffix Tree: O(n log n) bits of space. Compressed Suffix Tree: 1 Sadakane (2007): O(n log σ) bits, O(polylog(n)) slowdown. 2 Russo et al. (2011): kth order empirical entropy, O(log n) slowdown. 3 Fischer et al. (2009): kth order empirical entropy, sub-logarithmic slowdown. 4 Ohlebusch, Fischer, Gog (2011): some queries in O(1) time. Have all been implemented.
48 Compressed Suffix Tree Indexing Dictionary Matching Dynamic Compressed Suffix Tree 1 Chan et al. (2007): Adaptation of Sadakane (2007). O(n) bits O(log 2 n) slowdown. 2 Russo et al. (2011): nh k +o(nlogσ) bits. O(log 2 n) slowdown. Have not been implemented.
49 Suffix Tree Indexing Dictionary Matching Suffix Tree for set of patterns: 1 Concatenate patterns with unique delimiter between them. OR * Problem: indexes artificial suffixes. 2 Generalized suffix tree * Combines suffixes of each pattern in a single data structure. * Construction: Ukkonen s online suffix tree construction algorithm.
50 Suffix Links Indexing Dictionary Matching Definition A suffix link is a pointer from an internal node labeled xs to another internal node labeled S, where x is an arbitrary character and S is a possibly empty substring. Suffix links facilitate traversal of suffix tree during and after construction.
51 Suffix Links Indexing Dictionary Matching M i s s i s s i p p i $ s 12 s 11 s 8 s 3 s 6 s 5 s 2 s 1 s10 s 9 s 7 s 4
52 Indexing Dictionary Matching Algorithm for 1D dictionary matching on suffix tree: Generalized suffix tree: index of several strings Ukkonen can insert one string at a time Our algorithm: modeled after Ukkonen s suffix tree construction algorithm Online processing of text Linear time: as if inserting new pattern Skip-count trick uses suffix links
53 Indexing Dictionary Matching Algorithm for 1D dictionary matching on suffix tree: Generalized suffix tree: index of several strings Ukkonen can insert one string at a time Our algorithm: modeled after Ukkonen s suffix tree construction algorithm Online processing of text Linear time: as if inserting new pattern Skip-count trick uses suffix links Small-Space: compressed suffix tree
54 Skip-Count Trick Indexing Dictionary Matching M i s s i s s i p p i $ s 12 s 11 s 8 s 3 s 6 s 5 s 2 s 1 s10 s 9 s 7 s 4
55 Research Plan Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching Uses compressed suffix tree Space meets entropy-compressed bounds Linear time, with slowdown to query compressed self-index
56 Evaluation Indexing Dictionary Matching Data Sets: Biological * Fly sequences: FlyBase * Flu sequences: Influenza Virus Sequence Database Security * Network intrusion detection signatures: ClamAV Virus detection * Virus signatures: Snort
57 Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching
58 Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching Case I: presented at CPM 2010 Case II: presented at WADS 2011 * With no slowdown! * Complete algorithm accepted to Algorithmica 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching
59 Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching Algorithm has been developed Complexity has been proven Manuscript is in final editing stages 3 for succinct 1D dictionary matching
60 Timeline Indexing Dictionary Matching Thesis Goals: 1 Succinct 2D dictionary matching 2 Succinct dynamic 2D dictionary matching 3 for succinct 1D dictionary matching Dictionary matching over generalized Ukkonen suffix tree. Ported code to run over compressed suffix tree, using SDSL. Assumption: no pattern is substring of another. Add lowest marked ancestor framework to overcome. Evaluate time/space trade-offs of CST implementations. Run over large, realistic data sets.
61 Indexing Dictionary Matching Thank you!
Succinct 2D Dictionary Matching with No Slowdown
Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationAlphabet-Independent Compressed Text Indexing
Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationSpace-Efficient Construction Algorithm for Circular Suffix Tree
Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes
More informationText matching of strings in terms of straight line program by compressed aleshin type automata
Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical
More informationData Structure for Dynamic Patterns
Data Structure for Dynamic Patterns Chouvalit Khancome and Veera Booning Member IAENG Abstract String matching and dynamic dictionary matching are significant principles in computer science. These principles
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationLongest Gapped Repeats and Palindromes
Discrete Mathematics and Theoretical Computer Science DMTCS vol. 19:4, 2017, #4 Longest Gapped Repeats and Palindromes Marius Dumitran 1 Paweł Gawrychowski 2 Florin Manea 3 arxiv:1511.07180v4 [cs.ds] 11
More informationarxiv: v2 [cs.ds] 8 Apr 2016
Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl
More informationString Matching with Variable Length Gaps
String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationString Regularities and Degenerate Strings
M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationOn-line String Matching in Highly Similar DNA Sequences
On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis
More informationA Unifying Framework for Compressed Pattern Matching
A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {
More informationFinding all covers of an indeterminate string in O(n) time on average
Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering
More informationShift-And Approach to Pattern Matching in LZW Compressed Text
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {kida,
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationFast Fully-Compressed Suffix Trees
Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University
More informationCSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182
CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182 Bell Labs Honors Pattern matching 10-07 CSE182 Just the Facts Consider the set of all substrings
More informationDefine M to be a binary n by m matrix such that:
The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]
More informationA Faster Grammar-Based Self-Index
A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University
More informationOn-Line Construction of Suffix Trees I. E. Ukkonen z
.. t Algorithmica (1995) 14:249-260 Algorithmica 9 1995 Springer-Verlag New York Inc. On-Line Construction of Suffix Trees I E. Ukkonen z Abstract. An on-line algorithm is presented for constructing the
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationTheoretical aspects of ERa, the fastest practical suffix tree construction algorithm
Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem
More informationarxiv: v2 [cs.ds] 5 Mar 2014
Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,
More informationFast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200
Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Alex.Smola@anu.edu.au joint work with S.V.N. Vishwanathan Slides (soon) available
More informationOpportunistic Data Structures with Applications
Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationSpace-Efficient Re-Pair Compression
Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques
More informationComputing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome
Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)
More informationSuccincter text indexing with wildcards
University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview
More informationAn on{line algorithm is presented for constructing the sux tree for a. the desirable property of processing the string symbol by symbol from left to
(To appear in ALGORITHMICA) On{line construction of sux trees 1 Esko Ukkonen Department of Computer Science, University of Helsinki, P. O. Box 26 (Teollisuuskatu 23), FIN{00014 University of Helsinki,
More informationOn the Number of Distinct Squares
Frantisek (Franya) Franek Advanced Optimization Laboratory Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Invited talk - Prague Stringology Conference 2014 Outline
More informationFaster Compact On-Line Lempel-Ziv Factorization
Faster Compact On-Line Lempel-Ziv Factorization Jun ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz
More informationString Indexing for Patterns with Wildcards
MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More informationSUFFIX TREE. SYNONYMS Compact suffix trie
SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationEfficient High-Similarity String Comparison: The Waterfall Algorithm
Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)
More informationA Fully Compressed Pattern Matching Algorithm for Simple Collage Systems
A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)
More informationA GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *
A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,
More informationNumber of occurrences of powers in strings
Author manuscript, published in "International Journal of Foundations of Computer Science 21, 4 (2010) 535--547" DOI : 10.1142/S0129054110007416 Number of occurrences of powers in strings Maxime Crochemore
More informationDynamic Entropy-Compressed Sequences and Full-Text Indexes
Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationBandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)
Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner
More informationHierarchical Overlap Graph
Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018 arxiv:1802.04632 2018 B. Cazaux & E. Rivals 1 / 29 Overlap Graph for a set of words Consider the set P := {abaa,
More informationPattern Matching (Exact Matching) Overview
CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm
More informationInternal Pattern Matching Queries in a Text and Applications
Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords
More informationConverting SLP to LZ78 in almost Linear Time
CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP
More informationBLAST: Basic Local Alignment Search Tool
.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationMotif Extraction from Weighted Sequences
Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationString Regularities and Degenerate Strings
M.Sc. Engg. Thesis String Regularities and Degenerate Strings by Md. Faizul Bari Submitted to Department of Computer Science and Engineering in partial fulfilment of the requirments for the degree of Master
More informationDictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha
More informationSuffix Array of Alignment: A Practical Index for Similar Data
Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer
More informationPreview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationAverage Complexity of Exact and Approximate Multiple String Matching
Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationSimple Real-Time Constant-Space String Matching
Simple Real-Time Constant-Space String Matching Dany Breslauer 1, Roberto Grossi 2, and Filippo Mignosi 3 1 Caesarea Rothchild Institute, University of Haifa, Haifa, Israel 2 Dipartimento di Informatica,
More informationHow many double squares can a string contain?
How many double squares can a string contain? F. Franek, joint work with A. Deza and A. Thierry Algorithms Research Group Department of Computing and Software McMaster University, Hamilton, Ontario, Canada
More informationBurrows-Wheeler Transforms in Linear Time and Linear Bits
Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.
More informationSkriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)
Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.
More informationComputing Longest Common Substrings Using Suffix Arrays
Computing Longest Common Substrings Using Suffix Arrays Maxim A. Babenko, Tatiana A. Starikovskaya Moscow State University Computer Science in Russia, 2008 Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing
More informationForbidden Patterns. {vmakinen leena.salmela
Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu
More informationLongest Lyndon Substring After Edit
Longest Lyndon Substring After Edit Yuki Urabe Department of Electrical Engineering and Computer Science, Kyushu University, Japan yuki.urabe@inf.kyushu-u.ac.jp Yuto Nakashima Department of Informatics,
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationSequence comparison by compression
Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.
More informationComputing a Longest Common Palindromic Subsequence
Fundamenta Informaticae 129 (2014) 1 12 1 DOI 10.3233/FI-2014-860 IOS Press Computing a Longest Common Palindromic Subsequence Shihabur Rahman Chowdhury, Md. Mahbubul Hasan, Sumaiya Iqbal, M. Sohel Rahman
More informationOptimal-Time Text Indexing in BWT-runs Bounded Space
Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned
More informationarxiv: v1 [cs.ds] 8 Sep 2018
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology
More informationSkriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)
1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,
More informationText Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2
Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable
More informationFast profile matching algorithms A survey
Theoretical Computer Science 395 (2008) 137 157 www.elsevier.com/locate/tcs Fast profile matching algorithms A survey Cinzia Pizzi a,,1, Esko Ukkonen b a Department of Computer Science, University of Helsinki,
More informationString Search. 6th September 2018
String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts 3 PHILIP BILLE IT University of Copenhagen ROLF FAGERBERG University of Southern Denmark AND INGE LI
More informationON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION
ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced
More informationText Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des
Text Searching Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, du Traitement de l Information et des Systèmes. International PhD School in Formal Languages and Applications Tarragona,
More informationIntrusion Detection and Malware Analysis
Intrusion Detection and Malware Analysis IDS feature extraction Pavel Laskov Wilhelm Schickard Institute for Computer Science Metric embedding of byte sequences Sequences 1. blabla blubla blablabu aa 2.
More informationarxiv: v1 [cs.ds] 30 Nov 2018
Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl
More informationOPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS
Parallel Processing Letters, c World Scientific Publishing Company OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS COSTAS S. ILIOPOULOS Department of Computer Science, King s College London,
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationTheoretical Computer Science
Theoretical Computer Science 443 (2012) 25 34 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs String matching with variable
More information