A Faster Grammar-Based Self-Index
|
|
- Preston Skinner
- 6 years ago
- Views:
Transcription
1 A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University of Bonn King s College, London LATA 12 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 1 / 21
2 Motivation We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching. Pattern matching? In this talk pattern matching = exact pattern matching, i.e., given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 2 / 21
3 Problem, more precisely We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases. What is LZ77? Given a text, we split it into z disjoint fragments called phrases. Each fragment is a single letter, or a substring of the already defined prefix. The number of those phrases is believed to be the right measure of how repetitive the text is. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 3 / 21
4 Solution? Straight-line program, or grammar representation Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just O(z log n) words. The program can be assumed to be balanced, meaning that for each production A BC we have that B C. Extracting an arbitrary substring of length l from a balanced SLP takes O(log n + l) time. But how do we search?! Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 4 / 21
5 6 l 7 Framework (of Navarro) A LZ77 Self-Index Current Lempel-Ziv Indexes A LZ77 Self-Index Conclusions _ 7 (_,1) (a,1) b d 1 3 l r $ $ a a la ba r_ a _ la _a la ba rda$ _ (a,1) (l,2) 5 $ _ b (l,5) r b _ d 1 8 G. Navarro Indexing LZ77 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 5 / 21
6 Old Idea ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21
7 Old Idea ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21
8 Old Idea Secondary occurrence An occurrence is secondary iff it is completely contained in some phrase. ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21
9 Old Idea Primary occurrence An occurrence is primary iff it crosses some boundary. ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21
10 Old Idea Primary occurrence An occurrence is primary iff it crosses some boundary. Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21
11 Old Idea, continued Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21
12 Old Idea, continued P[1..i] P[i+1..m] Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21
13 Old Idea, continued P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 i m, we 1 search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries, 2 search for (P[1..i]) R in the Patricia tree of the reversed phrases, 3 check the results via random access, 4 use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m]. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21
14 New Ideas We don t use random access during search. We need extract only from the phrase boundaries, so we store bookmarks to them. We also store a data structure for 1D range reporting (or a bitvector) at each node with depth at most log log z. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 8 / 21
15 Bounds Calculation shows that if we choose our data structures carefully and can extract from bookmarks in O(1) time per character then with O(z log log z) extra words we can find all occ occurrences of P in O ( m 2 + occ log log n ) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 9 / 21
16 Comparison source total space (bits) search time C & N 09 O( SLP ) + r log n O ( (m 2 + h(m + occ)) log r ) C & N 11 (2 + o(1)) CFG + R log n + ɛ r log r O ( (m 2 /ɛ) log R + occ log r ) K & N 11 2z log(n/z) + z log z + 5z log σ + O(z) + o(n) O ( m 2 d + (m + occ) log z ) Us BSLP + O(z(log n + log z log log z)) O ( m 2 + occ log log n ) r: number of rules in the grammar h: height of the parse tree R: total length of the right-hand sides d: depth of nesting in the parse Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für10 Informat / 21
17 Bookmarking Lemma Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given l g, we can extract S[b l..b + l] in O(l + log g) time. Corollary We can store O(log z) words such that, given any l, we can extract S[b l..b + l] in O(l) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für11 Informat / 21
18 Parse Tree u v w O(log g) i k j = i + 2g b = i + g k + 1 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für12 Informat / 21
19 Space Bounds (in words) Patricia trees O(z) bookmarks O(z log z) 1D range reporting O(z log log z) 4-sided 2D range reporting O(z log log z) 2-sided 2D range reporting O(z) O(z log log z) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für13 Informat / 21
20 Time Bounds searching in Patricia trees O ( m 2) (with perfect hashing if necessary) extracting from bookmarks O ( m 2) 1D or 4-sided 2D range reporting O ( m 2) 2-sided 2D range reporting O(occ log log n) O ( m 2 + occ log log n ) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für14 Informat / 21
21 Final Result Theorem Given a balanced SLP for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words such that, given a pattern P[1..m], we can find all occ occurrences of P in O ( m 2 + occ log log n ) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für15 Informat / 21
22 News Flash!!! (not in the paper) Theorem Given a balanced SLP with r rules for a string S[1..n] whose LZ77 parse consists of z phrases, we can build an O(r + z log log z)-word data structure such that, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log z + occ log log n) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für16 Informat / 21
23 PAUSE FOR BREATH Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für17 Informat / 21
24 Relative Lempel-Ziv Kuruppu, Puglisi and Zobel proposed that, to store a genomic database, we 1 build an FM-index for the first genome G (or an artificial reference genome), 2 compress the rest with a version of LZ77 that allows phrases to be copied only from G. Do, Jansson, Sadakane and Sung designed an RLZ-index before we did, but didn t publish it. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für18 Informat / 21
25 So what happens in RLZ? Theorem We can store the database in O ( n(h k (G) + 1) + z(log n + log z log log z) ) bits such that we can find all occ occurrences of P in time O((m + occ) log ɛ (n + z)). The log ɛ (n + z) factor in the query time comes from accessing the suffix array using Grossi, Gupta and Vitter s CSA. In real life, it should be log n. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für19 Informat / 21
26 Future work: 1 Is it possible to construct from LZ77 parse a SLP of size smaller than O(z log n)? 2 Can we achieve O(m + occ) query time? (maybe with a slightly bigger self-index?) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für20 Informat / 21
27 QUESTIONS? Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für21 Informat / 21
arxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationA Faster Grammar-Based Self-index
A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University
More informationPractical Indexing of Repetitive Collections using Relative Lempel-Ziv
Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationUniversità degli studi di Udine
Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationConverting SLP to LZ78 in almost Linear Time
CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP
More informationarxiv: v1 [cs.ds] 30 Nov 2018
Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl
More informationarxiv: v2 [cs.ds] 6 Jul 2015
Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More informationarxiv: v3 [cs.ds] 6 Sep 2018
Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of
More informationOptimal-Time Text Indexing in BWT-runs Bounded Space
Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned
More informationarxiv: v1 [cs.ds] 8 Sep 2018
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationForbidden Patterns. {vmakinen leena.salmela
Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu
More informationPreview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationJournal of Discrete Algorithms
Journal of Discrete Algorithms 18 (2013) 100 112 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ESP-index: A compressed index based on edit-sensitive
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationSuffix Array of Alignment: A Practical Index for Similar Data
Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz
More informationApproximate String Matching with Lempel-Ziv Compressed Indexes
Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationarxiv: v2 [cs.ds] 8 Apr 2016
Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl
More informationAt the Roots of Dictionary Compression: String Attractors
At the Roots of Dictionary Compression: String Attractors Dominik Kempa Department of Computer Science, University of Helsinki Helsinki, Finland dkempa@cs.helsinki.fi ABSTRACT A well-known fact in the
More informationJumbled String Matching: Motivations, Variants, Algorithms
Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov
More informationOn-line String Matching in Highly Similar DNA Sequences
On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis
More informationApproximate String Matching with Ziv-Lempel Compressed Indexes
Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationarxiv: v2 [cs.ds] 5 Mar 2014
Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,
More informationA Fully Compressed Pattern Matching Algorithm for Simple Collage Systems
A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)
More informationString Searching with Ranking Constraints and Uncertainty
Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards
More informationOn the Size of Lempel-Ziv and Lyndon Factorizations
On the Size of Lempel-Ziv and Lyndon Factorizations Juha Kärkkäinen 1, Dominik Kempa, Yuto Nakashima 3, Simon J. Puglisi 4, and Arseny M. Shur 5 1 Helsinki Institute for Information Technology (HIIT),
More informationModule 9: Tries and String Matching
Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School
More informationInternal Pattern Matching Queries in a Text and Applications
Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords
More information1 Alphabets and Languages
1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationReducing the Space Requirement of LZ-Index
Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of
More informationFaster Compact On-Line Lempel-Ziv Factorization
Faster Compact On-Line Lempel-Ziv Factorization Jun ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationAlphabet-Independent Compressed Text Indexing
Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationarxiv: v4 [cs.ds] 6 Feb 2010
Grammar-Based Compression in a Streaming Model Travis Gagie 1, and Pawe l Gawrychowski 2 arxiv:09120850v4 [csds] 6 Feb 2010 1 Department of Computer Science University of Chile travisgagie@gmailcom 2 Institute
More informationApproximation of smallest linear tree grammar
Approximation of smallest linear tree grammar Artur Jeż 1 and Markus Lohrey 2 1 MPI Informatik, Saarbrücken, Germany / University of Wrocław, Poland 2 University of Siegen, Germany Abstract A simple linear-time
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationContext-Free and Noncontext-Free Languages
Examples: Context-Free and Noncontext-Free Languages a*b* is regular. A n B n = {a n b n : n 0} is context-free but not regular. A n B n C n = {a n b n c n : n 0} is not context-free The Regular and the
More informationOn Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004
On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,
More informationComplementary Contextual Models with FM-index for DNA Compression
2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical
More informationSuccincter text indexing with wildcards
University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview
More informationarxiv: v2 [cs.ds] 29 Jan 2014
Fully Online Grammar Compression in Constant Space arxiv:1401.5143v2 [cs.ds] 29 Jan 2014 Preferred Infrastructure, Inc. maruyama@preferred.jp Shirou Maruyama and Yasuo Tabei Abstract PRESTO, Japan Science
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationSpace-Efficient Re-Pair Compression
Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationarxiv: v1 [cs.ds] 21 Nov 2012
The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More information(Preliminary Version)
Relations Between δ-matching and Matching with Don t Care Symbols: δ-distinguishing Morphisms (Preliminary Version) Richard Cole, 1 Costas S. Iliopoulos, 2 Thierry Lecroq, 3 Wojciech Plandowski, 4 and
More informationSpace-Efficient Construction Algorithm for Circular Suffix Tree
Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes
More informationarxiv: v4 [cs.ds] 24 Apr 2018
A compressed dynamic self-index for highly repetitive text collections Takaaki Nishimoto 1 Yoshimasa Takabatake 2 and Yasuo Tabei 1 1 RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan
More informationNPDA, CFG equivalence
NPDA, CFG equivalence Theorem A language L is recognized by a NPDA iff L is described by a CFG. Must prove two directions: ( ) L is recognized by a NPDA implies L is described by a CFG. ( ) L is described
More informationGrammar Compressed Sequences with Rank/Select Support
Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,
More informationDefine M to be a binary n by m matrix such that:
The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]
More informationString Matching with Variable Length Gaps
String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length
More informationMotivation for Arithmetic Coding
Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationCompressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park
Compressing Kinetic Data From Sensor Networks Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Motivation Motivation Computer Science Graphics: Image and video
More informationSolution. S ABc Ab c Bc Ac b A ABa Ba Aa a B Bbc bc.
Section 12.4 Context-Free Language Topics Algorithm. Remove Λ-productions from grammars for langauges without Λ. 1. Find nonterminals that derive Λ. 2. For each production A w construct all productions
More informationDictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha
More informationAutomata and Computability. Solutions to Exercises
Automata and Computability Solutions to Exercises Spring 27 Alexis Maciel Department of Computer Science Clarkson University Copyright c 27 Alexis Maciel ii Contents Preface vii Introduction 2 Finite Automata
More informationFast Fully-Compressed Suffix Trees
Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University
More informationFoundations of Informatics: a Bridging Course
Foundations of Informatics: a Bridging Course Week 3: Formal Languages and Semantics Thomas Noll Lehrstuhl für Informatik 2 RWTH Aachen University noll@cs.rwth-aachen.de http://www.b-it-center.de/wob/en/view/class211_id948.html
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationCompressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey
More informationAutomata and Computability. Solutions to Exercises
Automata and Computability Solutions to Exercises Fall 28 Alexis Maciel Department of Computer Science Clarkson University Copyright c 28 Alexis Maciel ii Contents Preface vii Introduction 2 Finite Automata
More informationIntroduction to Theory of Computing
CSCI 2670, Fall 2012 Introduction to Theory of Computing Department of Computer Science University of Georgia Athens, GA 30602 Instructor: Liming Cai www.cs.uga.edu/ cai 0 Lecture Note 3 Context-Free Languages
More informationA Unifying Framework for Compressed Pattern Matching
A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {
More informationAlternative Algorithms for Lyndon Factorization
Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076
More informationarxiv: v1 [cs.ds] 9 Apr 2018
From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract
More informationOBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS
OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS Tuğkan Batu a, Funda Ergun b, and Cenk Sahinalp b a LONDON SCHOOL OF ECONOMICS b SIMON FRASER UNIVERSITY LSE CDAM Seminar Oblivious String Embeddings
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the
More informationPattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1
Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt
More informationON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION
ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationContext-Free Languages (Pre Lecture)
Context-Free Languages (Pre Lecture) Dr. Neil T. Dantam CSCI-561, Colorado School of Mines Fall 2017 Dantam (Mines CSCI-561) Context-Free Languages (Pre Lecture) Fall 2017 1 / 34 Outline Pumping Lemma
More informationPattern Matching of Compressed Terms and Contexts and Polynomial Rewriting
Pattern Matching of Compressed Terms and Contexts and Polynomial Rewriting Manfred Schmidt-Schauß 1 Institut für Informatik Johann Wolfgang Goethe-Universität Postfach 11 19 32 D-60054 Frankfurt, Germany
More information2. Exact String Matching
2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.
More informationTheoretical aspects of ERa, the fastest practical suffix tree construction algorithm
Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem
More informationSuccinct 2D Dictionary Matching with No Slowdown
Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d
More informationEven More on Dynamic Programming
Algorithms & Models of Computation CS/ECE 374, Fall 2017 Even More on Dynamic Programming Lecture 15 Thursday, October 19, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 26 Part I Longest Common Subsequence
More informationOpportunistic Data Structures with Applications
Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and
More information