A Faster Grammar-Based Self-Index

Size: px
Start display at page:

Download "A Faster Grammar-Based Self-Index"

Transcription

1 A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University of Bonn King s College, London LATA 12 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 1 / 21

2 Motivation We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching. Pattern matching? In this talk pattern matching = exact pattern matching, i.e., given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 2 / 21

3 Problem, more precisely We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases. What is LZ77? Given a text, we split it into z disjoint fragments called phrases. Each fragment is a single letter, or a substring of the already defined prefix. The number of those phrases is believed to be the right measure of how repetitive the text is. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 3 / 21

4 Solution? Straight-line program, or grammar representation Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just O(z log n) words. The program can be assumed to be balanced, meaning that for each production A BC we have that B C. Extracting an arbitrary substring of length l from a balanced SLP takes O(log n + l) time. But how do we search?! Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 4 / 21

5 6 l 7 Framework (of Navarro) A LZ77 Self-Index Current Lempel-Ziv Indexes A LZ77 Self-Index Conclusions _ 7 (_,1) (a,1) b d 1 3 l r $ $ a a la ba r_ a _ la _a la ba rda$ _ (a,1) (l,2) 5 $ _ b (l,5) r b _ d 1 8 G. Navarro Indexing LZ77 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 5 / 21

6 Old Idea ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

7 Old Idea ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

8 Old Idea Secondary occurrence An occurrence is secondary iff it is completely contained in some phrase. ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

9 Old Idea Primary occurrence An occurrence is primary iff it crosses some boundary. ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

10 Old Idea Primary occurrence An occurrence is primary iff it crosses some boundary. Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

11 Old Idea, continued Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21

12 Old Idea, continued P[1..i] P[i+1..m] Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21

13 Old Idea, continued P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 i m, we 1 search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries, 2 search for (P[1..i]) R in the Patricia tree of the reversed phrases, 3 check the results via random access, 4 use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m]. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21

14 New Ideas We don t use random access during search. We need extract only from the phrase boundaries, so we store bookmarks to them. We also store a data structure for 1D range reporting (or a bitvector) at each node with depth at most log log z. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 8 / 21

15 Bounds Calculation shows that if we choose our data structures carefully and can extract from bookmarks in O(1) time per character then with O(z log log z) extra words we can find all occ occurrences of P in O ( m 2 + occ log log n ) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 9 / 21

16 Comparison source total space (bits) search time C & N 09 O( SLP ) + r log n O ( (m 2 + h(m + occ)) log r ) C & N 11 (2 + o(1)) CFG + R log n + ɛ r log r O ( (m 2 /ɛ) log R + occ log r ) K & N 11 2z log(n/z) + z log z + 5z log σ + O(z) + o(n) O ( m 2 d + (m + occ) log z ) Us BSLP + O(z(log n + log z log log z)) O ( m 2 + occ log log n ) r: number of rules in the grammar h: height of the parse tree R: total length of the right-hand sides d: depth of nesting in the parse Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für10 Informat / 21

17 Bookmarking Lemma Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given l g, we can extract S[b l..b + l] in O(l + log g) time. Corollary We can store O(log z) words such that, given any l, we can extract S[b l..b + l] in O(l) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für11 Informat / 21

18 Parse Tree u v w O(log g) i k j = i + 2g b = i + g k + 1 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für12 Informat / 21

19 Space Bounds (in words) Patricia trees O(z) bookmarks O(z log z) 1D range reporting O(z log log z) 4-sided 2D range reporting O(z log log z) 2-sided 2D range reporting O(z) O(z log log z) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für13 Informat / 21

20 Time Bounds searching in Patricia trees O ( m 2) (with perfect hashing if necessary) extracting from bookmarks O ( m 2) 1D or 4-sided 2D range reporting O ( m 2) 2-sided 2D range reporting O(occ log log n) O ( m 2 + occ log log n ) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für14 Informat / 21

21 Final Result Theorem Given a balanced SLP for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words such that, given a pattern P[1..m], we can find all occ occurrences of P in O ( m 2 + occ log log n ) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für15 Informat / 21

22 News Flash!!! (not in the paper) Theorem Given a balanced SLP with r rules for a string S[1..n] whose LZ77 parse consists of z phrases, we can build an O(r + z log log z)-word data structure such that, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log z + occ log log n) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für16 Informat / 21

23 PAUSE FOR BREATH Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für17 Informat / 21

24 Relative Lempel-Ziv Kuruppu, Puglisi and Zobel proposed that, to store a genomic database, we 1 build an FM-index for the first genome G (or an artificial reference genome), 2 compress the rest with a version of LZ77 that allows phrases to be copied only from G. Do, Jansson, Sadakane and Sung designed an RLZ-index before we did, but didn t publish it. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für18 Informat / 21

25 So what happens in RLZ? Theorem We can store the database in O ( n(h k (G) + 1) + z(log n + log z log log z) ) bits such that we can find all occ occurrences of P in time O((m + occ) log ɛ (n + z)). The log ɛ (n + z) factor in the query time comes from accessing the suffix array using Grossi, Gupta and Vitter s CSA. In real life, it should be log n. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für19 Informat / 21

26 Future work: 1 Is it possible to construct from LZ77 parse a SLP of size smaller than O(z log n)? 2 Can we achieve O(m + occ) query time? (maybe with a slightly bigger self-index?) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für20 Informat / 21

27 QUESTIONS? Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für21 Informat / 21

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

A Faster Grammar-Based Self-index

A Faster Grammar-Based Self-index A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

Converting SLP to LZ78 in almost Linear Time

Converting SLP to LZ78 in almost Linear Time CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP

More information

arxiv: v1 [cs.ds] 30 Nov 2018

arxiv: v1 [cs.ds] 30 Nov 2018 Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl

More information

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v2 [cs.ds] 6 Jul 2015 Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

arxiv: v3 [cs.ds] 6 Sep 2018

arxiv: v3 [cs.ds] 6 Sep 2018 Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of

More information

Optimal-Time Text Indexing in BWT-runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Journal of Discrete Algorithms

Journal of Discrete Algorithms Journal of Discrete Algorithms 18 (2013) 100 112 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ESP-index: A compressed index based on edit-sensitive

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

Suffix Array of Alignment: A Practical Index for Similar Data

Suffix Array of Alignment: A Practical Index for Similar Data Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

arxiv: v2 [cs.ds] 8 Apr 2016

arxiv: v2 [cs.ds] 8 Apr 2016 Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl

More information

At the Roots of Dictionary Compression: String Attractors

At the Roots of Dictionary Compression: String Attractors At the Roots of Dictionary Compression: String Attractors Dominik Kempa Department of Computer Science, University of Helsinki Helsinki, Finland dkempa@cs.helsinki.fi ABSTRACT A well-known fact in the

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

Approximate String Matching with Ziv-Lempel Compressed Indexes

Approximate String Matching with Ziv-Lempel Compressed Indexes Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)

More information

String Searching with Ranking Constraints and Uncertainty

String Searching with Ranking Constraints and Uncertainty Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

On the Size of Lempel-Ziv and Lyndon Factorizations

On the Size of Lempel-Ziv and Lyndon Factorizations On the Size of Lempel-Ziv and Lyndon Factorizations Juha Kärkkäinen 1, Dominik Kempa, Yuto Nakashima 3, Simon J. Puglisi 4, and Arseny M. Shur 5 1 Helsinki Institute for Information Technology (HIIT),

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

1 Alphabets and Languages

1 Alphabets and Languages 1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

Faster Compact On-Line Lempel-Ziv Factorization

Faster Compact On-Line Lempel-Ziv Factorization Faster Compact On-Line Lempel-Ziv Factorization Jun ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

arxiv: v4 [cs.ds] 6 Feb 2010

arxiv: v4 [cs.ds] 6 Feb 2010 Grammar-Based Compression in a Streaming Model Travis Gagie 1, and Pawe l Gawrychowski 2 arxiv:09120850v4 [csds] 6 Feb 2010 1 Department of Computer Science University of Chile travisgagie@gmailcom 2 Institute

More information

Approximation of smallest linear tree grammar

Approximation of smallest linear tree grammar Approximation of smallest linear tree grammar Artur Jeż 1 and Markus Lohrey 2 1 MPI Informatik, Saarbrücken, Germany / University of Wrocław, Poland 2 University of Siegen, Germany Abstract A simple linear-time

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

Context-Free and Noncontext-Free Languages

Context-Free and Noncontext-Free Languages Examples: Context-Free and Noncontext-Free Languages a*b* is regular. A n B n = {a n b n : n 0} is context-free but not regular. A n B n C n = {a n b n c n : n 0} is not context-free The Regular and the

More information

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004 On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview

More information

arxiv: v2 [cs.ds] 29 Jan 2014

arxiv: v2 [cs.ds] 29 Jan 2014 Fully Online Grammar Compression in Constant Space arxiv:1401.5143v2 [cs.ds] 29 Jan 2014 Preferred Infrastructure, Inc. maruyama@preferred.jp Shirou Maruyama and Yasuo Tabei Abstract PRESTO, Japan Science

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

Space-Efficient Re-Pair Compression

Space-Efficient Re-Pair Compression Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

arxiv: v1 [cs.ds] 21 Nov 2012

arxiv: v1 [cs.ds] 21 Nov 2012 The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

(Preliminary Version)

(Preliminary Version) Relations Between δ-matching and Matching with Don t Care Symbols: δ-distinguishing Morphisms (Preliminary Version) Richard Cole, 1 Costas S. Iliopoulos, 2 Thierry Lecroq, 3 Wojciech Plandowski, 4 and

More information

Space-Efficient Construction Algorithm for Circular Suffix Tree

Space-Efficient Construction Algorithm for Circular Suffix Tree Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes

More information

arxiv: v4 [cs.ds] 24 Apr 2018

arxiv: v4 [cs.ds] 24 Apr 2018 A compressed dynamic self-index for highly repetitive text collections Takaaki Nishimoto 1 Yoshimasa Takabatake 2 and Yasuo Tabei 1 1 RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan

More information

NPDA, CFG equivalence

NPDA, CFG equivalence NPDA, CFG equivalence Theorem A language L is recognized by a NPDA iff L is described by a CFG. Must prove two directions: ( ) L is recognized by a NPDA implies L is described by a CFG. ( ) L is described

More information

Grammar Compressed Sequences with Rank/Select Support

Grammar Compressed Sequences with Rank/Select Support Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,

More information

Define M to be a binary n by m matrix such that:

Define M to be a binary n by m matrix such that: The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Compressing Kinetic Data From Sensor Networks Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Motivation Motivation Computer Science Graphics: Image and video

More information

Solution. S ABc Ab c Bc Ac b A ABa Ba Aa a B Bbc bc.

Solution. S ABc Ab c Bc Ac b A ABa Ba Aa a B Bbc bc. Section 12.4 Context-Free Language Topics Algorithm. Remove Λ-productions from grammars for langauges without Λ. 1. Find nonterminals that derive Λ. 2. For each production A w construct all productions

More information

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha

More information

Automata and Computability. Solutions to Exercises

Automata and Computability. Solutions to Exercises Automata and Computability Solutions to Exercises Spring 27 Alexis Maciel Department of Computer Science Clarkson University Copyright c 27 Alexis Maciel ii Contents Preface vii Introduction 2 Finite Automata

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

Foundations of Informatics: a Bridging Course

Foundations of Informatics: a Bridging Course Foundations of Informatics: a Bridging Course Week 3: Formal Languages and Semantics Thomas Noll Lehrstuhl für Informatik 2 RWTH Aachen University noll@cs.rwth-aachen.de http://www.b-it-center.de/wob/en/view/class211_id948.html

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

Automata and Computability. Solutions to Exercises

Automata and Computability. Solutions to Exercises Automata and Computability Solutions to Exercises Fall 28 Alexis Maciel Department of Computer Science Clarkson University Copyright c 28 Alexis Maciel ii Contents Preface vii Introduction 2 Finite Automata

More information

Introduction to Theory of Computing

Introduction to Theory of Computing CSCI 2670, Fall 2012 Introduction to Theory of Computing Department of Computer Science University of Georgia Athens, GA 30602 Instructor: Liming Cai www.cs.uga.edu/ cai 0 Lecture Note 3 Context-Free Languages

More information

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {

More information

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS

OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS Tuğkan Batu a, Funda Ergun b, and Cenk Sahinalp b a LONDON SCHOOL OF ECONOMICS b SIMON FRASER UNIVERSITY LSE CDAM Seminar Oblivious String Embeddings

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

Context-Free Languages (Pre Lecture)

Context-Free Languages (Pre Lecture) Context-Free Languages (Pre Lecture) Dr. Neil T. Dantam CSCI-561, Colorado School of Mines Fall 2017 Dantam (Mines CSCI-561) Context-Free Languages (Pre Lecture) Fall 2017 1 / 34 Outline Pumping Lemma

More information

Pattern Matching of Compressed Terms and Contexts and Polynomial Rewriting

Pattern Matching of Compressed Terms and Contexts and Polynomial Rewriting Pattern Matching of Compressed Terms and Contexts and Polynomial Rewriting Manfred Schmidt-Schauß 1 Institut für Informatik Johann Wolfgang Goethe-Universität Postfach 11 19 32 D-60054 Frankfurt, Germany

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Even More on Dynamic Programming

Even More on Dynamic Programming Algorithms & Models of Computation CS/ECE 374, Fall 2017 Even More on Dynamic Programming Lecture 15 Thursday, October 19, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 26 Part I Longest Common Subsequence

More information

Opportunistic Data Structures with Applications

Opportunistic Data Structures with Applications Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and

More information