Succincter text indexing with wildcards

Size: px
Start display at page:

Download "Succincter text indexing with wildcards"

Transcription

1 University of British Columbia CPM 2011 June 27, 2011

2 Problem overview

3 Problem overview

4 Problem overview

5 Problem overview

6 Problem overview

7 Problem overview

8 Problem overview

9 Problem overview

10 Problem overview

11 Problem overview Problem Create a succinct index for: a text T of length n over an alphabet of size σ containing d wildcard positions to support efficient pattern matching queries.

12 Problem overview Problem Create a succinct index for: a text T of length n over an alphabet of size σ containing d wildcard positions to support efficient pattern matching queries. Assumption (in presentation) No adjacent wildcards in T.

13 Motivation SNPs in the human genome T 3 billion bases σ = 4 d 3 million positions also protein collections (σ = 20)

14 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words

15 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words Succint indexes Index space (bits) Query working space (bits) Tam et al., 2009 (3 + o(1))n log σ O(n log d)

16 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words Succint indexes Index space (bits) Query working space (bits) Tam et al., 2009 (3 + o(1))n log σ O(n log d) (reduced working space by increasing query time) Tam et al., 2009 (3 + o(1))n log σ O(n log σ)

17 Our results in context Non succinct indexes Cole, Gottlieb, Lewenstein (2004) Lam et al., (2007) O(n log k n) words O(n) words Succint indexes Index space (bits) Query working space (bits) Tam et al., 2009 (3 + o(1))n log σ O(n log d) (reduced working space by increasing query time) Tam et al., 2009 (3 + o(1))n log σ O(n log σ) This work (2 + o(1))n log σ + O(n) O(m log n + md) (Ignoring smaller order terms)

18 3 Types of matches Type 1: pattern contained within a text segment [Lam et al., 2007, Tam et al., 2009]

19 3 Types of matches Type 1: pattern contained within a text segment Type 2: pattern spanning one wildcard [Lam et al., 2007, Tam et al., 2009]

20 3 Types of matches Type 1: pattern contained within a text segment Type 2: pattern spanning one wildcard Type 3: pattern contains at least one text segment [Lam et al., 2007, Tam et al., 2009]

21 Review: SAs and the Burrows Wheeler transform mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi Naïve BWT algorithm For a string S:

22 Review: SAs and the Burrows Wheeler transform mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ)

23 Review: SAs and the Burrows Wheeler transform mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ) 2 list all cyclic rotations of S

24 Review: SAs and the Burrows Wheeler transform $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ) 2 list all cyclic rotations of S 3 sort rows of matrix (lexicographically)

25 Review: SAs and the Burrows Wheeler transform $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Naïve BWT algorithm For a string S: 1 append $ to S ($ < c, c Σ) 2 list all cyclic rotations of S 3 sort rows of matrix (lexicographically) 4 output last column of matrix

26 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [8, 11] [Ferragina & Manzini 2002]

27 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [8, 11] [Ferragina & Manzini 2002]

28 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [3, 4] [Ferragina & Manzini 2002]

29 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [3, 4] [Ferragina & Manzini 2002]

30 Review: backwards search of pattern P in string S $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi Example s s i s SA range: [9, 9] [Ferragina & Manzini 2002]

31 Type 1 matching Build F, a compressed suffix array of T. Lemma All occ 1 Type 1 matches of P can be reported using: Query time O( P log σ + occ 1 log 1+ɛ n) Index space (1 + o(1))n log σ bits Work space O(log n) bits

32 Matching prefixes of text segments

33 Matching prefixes of text segments rank (BWT, i) queries determine a lexicographic range of text segments

34 Type 2 matching Use F and R to find candidate text segments

35 Type 2 matching Use Q to find compatible candidate [Bose et al., 2009]

36 Type 2 matching Lemma All occ 2 Type 2 matches of P can be reported using: Query time Index space Work space O( P log σ + ( P + occ 2 ) log d (2 + o(1))n log σ bits O(m log n) bits log log d )

37 Type 3 matching We will enhance F to support dictionary queries calculate matching statistics for each suffix, return text segments contained as a prefix

38 Review: computing matching statistics suffix length SA range c 1 [15,19] cc 2 [19,19] acc 3 [14,14] cacc 4 [18,18] acacc 3 [13,13] [Ohlebusch, Gog & Kügel 2010]

39 Finding dictionary matches Case 1

40 Finding dictionary matches Case 1

41 Finding dictionary matches Case 1 [Munro & Ramen 2002]

42 Finding dictionary matches Case 1 [Ramen et al., 2002]

43 Finding dictionary matches Case 1 [Grossi & Vitter 2000]

44 Finding dictionary matches Case 1

45 Finding dictionary matches Case 1

46 Finding dictionary matches Case 1 if BP[pos] = ) then find matching (

47 Finding dictionary matches Case 1 if BP[pos] = ) then find matching (

48 Finding dictionary matches Case 1 if BP[pos] = ) then find matching (

49 Finding dictionary matches Case 2

50 Finding dictionary matches Case 2 if BP[pos] = ( then find parent interval

51 Finding dictionary matches Case 2 if BP[pos] = ( then find parent interval

52 Summary of the full-text dictionary Theorem A full-text dictionary using (1 + o(1))n log σ + O(n) + O(d log n) bits supports the following operations efficiently: report/count text segments containing P report/count text segments prefixed by P report/count text segments contained in P

53 Type 3 matching Preliminaries Every text segment contained within P is a candidate

54 Type 3 matching Preliminaries Every text segment contained within P is a candidate A candidate is valid if it satisfies: 1 suffix condition 2 prefix condition

55 Algorithm overview We design a dynamic programming algorithm: P phases of algorithm (i = P,..., 2, 1) phase i considers candidates prefixing P[i.. P ]

56 Algorithm overview We design a dynamic programming algorithm: P phases of algorithm (i = P,..., 2, 1) phase i considers candidates prefixing P[i.. P ] potential matches are tracked in a working array W

57 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j + 1

58 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition

59 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition 2 if P must begin in text segment j 1 then verify prefix condition

60 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition 2 if P must begin in text segment j 1 then verify prefix condition

61 Type 3 matching Case 1 Text segment j matches P[i.. P ] (case 1) P must end within text segment j verify suffix condition 2 if P must begin in text segment j 1 then verify prefix condition 3 else record partial match

62 Type 3 matching Case 2 Text segment j matches P[i.. P ] (case 2) P must contain text segment j + 1

63 Type 3 matching Case 2 Text segment j matches P[i.. P ] (case 2) P must contain text segment j verify suffix condition by checking W

64 Type 3 matching Case 2 Text segment j matches P[i.. P ] (case 2) P must contain text segment j verify suffix condition by checking W 2 if P must begin in text segment j 1 then verify prefix condition 3 else record partial match

65 Type 3 matching Lemma All occ 3 Type 3 matches of P can be reported using: Query time O( P log σ + γ) Index space (2 + o(1))n log σ bits Work space O(m log n + md) bits where γ is number of text segments contained in P

66 Main result Theorem Given a text T of length n containing d wildcards, all matches of a pattern P can be reported using: Query time Index space Work space O( P log σ + m log d log log d + occ log d 1 log n + occ 2 log log d + γ) (2 + o(1))n log σ + O(n)+O(m log n) bits O(m log n + md) bits

67 Main result Theorem Given a text T of length n containing d wildcards, all matches of a pattern P can be reported using: Query time Index space Work space O( P log σ + m log d log log d + occ log d 1 log n + occ 2 log log d + γ) (2 + o(1))n log σ + O(n)+O(m log n) bits O(m log n + md) bits

68 Main result Theorem Given a text T of length n containing d wildcards, all matches of a pattern P can be reported using: Query time Index space Work space O( P log σ + m log d log log d + occ log d 1 log n + occ 2 log log d + γ) (2 + o(1))n log σ + O(n)+O(m log n) bits O(m log n + md) bits

69 Significance of working space reduction Short read alignment to human genome T 3 billion bases d 3 million bases m Working space reduced from GBs [Tam et al., 2009] to 10 s of MBs

70 Questions Future Work Can index space be reduced to (1 + o(1))n log σ bits? YES, but with query time: O(max(m 2 log σ log log σ, current query time)) Thanks to Anne Condon and anonymous reviewers.

Compact Data Strutures

Compact Data Strutures (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

New Algorithms and Lower Bounds for Sequential-Access Data Compression

New Algorithms and Lower Bounds for Sequential-Access Data Compression New Algorithms and Lower Bounds for Sequential-Access Data Compression Travis Gagie PhD Candidate Faculty of Technology Bielefeld University Germany July 2009 Gedruckt auf alterungsbeständigem Papier ISO

More information

An Algorithmic Framework for Compression and Text Indexing

An Algorithmic Framework for Compression and Text Indexing An Algorithmic Framework for Compression and Text Indexing Roberto Grossi Ankur Gupta Jeffrey Scott Vitter Abstract We present a unified algorithmic framework to obtain nearly optimal space bounds for

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

A simpler analysis of Burrows Wheeler-based compression

A simpler analysis of Burrows Wheeler-based compression Theoretical Computer Science 387 (2007) 220 235 www.elsevier.com/locate/tcs A simpler analysis of Burrows Wheeler-based compression Haim Kaplan, Shir Landau, Elad Verbin School of Computer Science, Tel

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

A Simpler Analysis of Burrows-Wheeler Based Compression

A Simpler Analysis of Burrows-Wheeler Based Compression A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan School of Computer Science, Tel Aviv University, Tel Aviv, Israel; email: haimk@post.tau.ac.il Shir Landau School of Computer Science,

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Space-Efficient Construction Algorithm for Circular Suffix Tree

Space-Efficient Construction Algorithm for Circular Suffix Tree Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Suffix Array of Alignment: A Practical Index for Similar Data

Suffix Array of Alignment: A Practical Index for Similar Data Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer

More information

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015 Mapping Burrows Wheeler and Reference Based Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #5 WS 2014/2015 Today Burrows Wheeler FM index

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

Define M to be a binary n by m matrix such that:

Define M to be a binary n by m matrix such that: The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]

More information

Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes

Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes Enno Ohlebusch, Simon Gog, and Adrian Kügel Institute of Theoretical Computer Science, University of Ulm, D-89069

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Advanced Text Indexing Techniques. Johannes Fischer

Advanced Text Indexing Techniques. Johannes Fischer Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,

More information

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

String Searching with Ranking Constraints and Uncertainty

String Searching with Ranking Constraints and Uncertainty Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

Opportunistic Data Structures with Applications

Opportunistic Data Structures with Applications Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

Linear-Space Alignment

Linear-Space Alignment Linear-Space Alignment Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,

More information

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Burrows-Wheeler Transforms in Linear Time and Linear Bits Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

Succinct Data Structures for Text and Information Retrieval

Succinct Data Structures for Text and Information Retrieval Succinct Data Structures for Text and Information Retrieval Simon Gog 1 Matthias Petri 2 1 Institute of Theoretical Informatics Karslruhe Insitute of Technology 2 Computing and Information Systems The

More information

Efficient Accessing and Searching in a Sequence of Numbers

Efficient Accessing and Searching in a Sequence of Numbers Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer

More information

Least Random Suffix/Prefix Matches in Output-Sensitive Time

Least Random Suffix/Prefix Matches in Output-Sensitive Time Least Random Suffix/Prefix Matches in Output-Sensitive Time Niko Välimäki Department of Computer Science University of Helsinki nvalimak@cs.helsinki.fi 23rd Annual Symposium on Combinatorial Pattern Matching

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings arxiv:1607.08342v1 [cs.ds] 28 Jul 2016 Paola Bonizzoni Gianluca Della Vedova Serena Nicosia Marco Previtali Raffaella

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

Approximate String Matching with Ziv-Lempel Compressed Indexes

Approximate String Matching with Ziv-Lempel Compressed Indexes Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2 Course

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

The Burrows-Wheeler Transform: Theory and Practice

The Burrows-Wheeler Transform: Theory and Practice The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16 Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

A Faster Grammar-Based Self-index

A Faster Grammar-Based Self-index A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University

More information

String Indexing for Patterns with Wildcards

String Indexing for Patterns with Wildcards MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Text Indexing, Suffix Sorting & Data Compression: Common Problems and Techniques

Text Indexing, Suffix Sorting & Data Compression: Common Problems and Techniques Text Indexing, uffix orting & Data Compression: Common Problems and Techniques Roberto Grossi Dipartimento di Informatica Università di Pisa Roadmap pace efficiency issues hort stories: suffix array permutations

More information

Succinct Data Structures for NLP-at-Scale

Succinct Data Structures for NLP-at-Scale Succinct Data Structures for NLP-at-Scale Matthias Petri Trevor Cohn Computing and Information Systems The University of Melbourne, Australia first.last@unimelb.edu.au November 20, 2016 Who are we? Trevor

More information

Suffix Sorting Algorithms

Suffix Sorting Algorithms Suffix Sorting Algorithms Timo Bingmann Text-Indexierung Vorlesung 2016-12-01 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the State of Baden-Wuerttemberg and National Research Center

More information

Streaming and communication complexity of Hamming distance

Streaming and communication complexity of Hamming distance Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Université Paris-Diderot (Joint work with Raphaël Clifford, ICALP 16) Approximate pattern matching Problem Pattern

More information

An External-Memory Algorithm for String Graph Construction

An External-Memory Algorithm for String Graph Construction An External-Memory Algorithm for String Graph Construction Paola Bonizzoni Gianluca Della Vedova Yuri Pirola Marco Previtali Raffaella Rizzi DISCo, Univ. Milano-Bicocca, Milan, Italy arxiv:1405.7520v2

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

Optimal-Time Text Indexing in BWT-runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Simple Compression Code Supporting Random Access and Fast String Matching

Simple Compression Code Supporting Random Access and Fast String Matching Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101

More information

Converting SLP to LZ78 in almost Linear Time

Converting SLP to LZ78 in almost Linear Time CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b) Introduction to Algorithms October 14, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Srini Devadas and Constantinos (Costis) Daskalakis Quiz 1 Solutions Quiz 1 Solutions Problem

More information

LZ77-like Compression with Fast Random Access

LZ77-like Compression with Fast Random Access -like Compression with Fast Random Access Sebastian Kreft and Gonzalo Navarro Dept. of Computer Science, University of Chile, Santiago, Chile {skreft,gnavarro}@dcc.uchile.cl Abstract We introduce an alternative

More information

arxiv: v3 [cs.ds] 6 Sep 2018

arxiv: v3 [cs.ds] 6 Sep 2018 Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

Algorithms in Computational. Biology. More on BWT

Algorithms in Computational. Biology. More on BWT Algorithms in Computtionl Biology More on BWT tody Plese Lst clss! don't forget to submit And by next (vi emil, repo ) implementtion week or shre prgectfltw get Not I would like reding overview! Discuss

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

Tighter Bounds for the Sum of Irreducible LCP Values

Tighter Bounds for the Sum of Irreducible LCP Values Tighter Bounds for the Sum of Irreducible LCP Values Juha Kärkkäinen 1 Dominik Kempa 1 Marcin Piątkowski 2 1 University of Helsinki 2 Nicolaus Copernicus University Outline 1 Cyclic words 2 Irreducible

More information

Computing Longest Common Substrings Using Suffix Arrays

Computing Longest Common Substrings Using Suffix Arrays Computing Longest Common Substrings Using Suffix Arrays Maxim A. Babenko, Tatiana A. Starikovskaya Moscow State University Computer Science in Russia, 2008 Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing

More information

CRAM: Compressed Random Access Memory

CRAM: Compressed Random Access Memory CRAM: Compressed Random Access Memory Jesper Jansson 1, Kunihiko Sadakane 2, and Wing-Kin Sung 3 1 Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho,

More information

The Number of Spanning Trees in a Graph

The Number of Spanning Trees in a Graph Course Trees The Ubiquitous Structure in Computer Science and Mathematics, JASS 08 The Number of Spanning Trees in a Graph Konstantin Pieper Fakultät für Mathematik TU München April 28, 2008 Konstantin

More information

Indexing a Dictionary for Subset Matching Queries

Indexing a Dictionary for Subset Matching Queries Indexing a Dictionary for Subset Matching Queries Gad M. Landau 1,2, Dekel Tsur 3, and Oren Weimann 4 1 Department of Computer Science, University of Haifa, Haifa - Israel. 2 Department of Computer and

More information

PATTERN MATCHING WITH SWAPS IN PRACTICE

PATTERN MATCHING WITH SWAPS IN PRACTICE International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania

More information

arxiv: v1 [cs.ds] 20 Dec 2017

arxiv: v1 [cs.ds] 20 Dec 2017 Text Indexing and Searching in Sublinear Time J. Ian Munro Gonzalo Navarro Yakov Nekrich arxiv:1712.07431v1 [cs.ds] 20 Dec 2017 Abstract We introduce the first index that can be built in o(n) time for

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information