Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Similar documents
Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Lecture 4 : Adaptive source coding algorithms

Data Compression Techniques

2. Exact String Matching

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

Three new strategies for exact string matching

Lecture 1 : Data Compression and Entropy

PATTERN MATCHING WITH SWAPS IN PRACTICE

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

UNIT I INFORMATION THEORY. I k log 2

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Module 9: Tries and String Matching

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

On-line String Matching in Highly Similar DNA Sequences

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

SUPPLEMENTARY INFORMATION

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Introduction to information theory and coding

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

On Pattern Matching With Swaps

Using an innovative coding algorithm for data encryption

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

2018/5/3. YU Xiangyu

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

CSE 421 Greedy: Huffman Codes

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Static Huffman. Wrong probabilities. Adaptive Huffman. Canonical Huffman Trees. Algorithm for canonical trees. Example of a canonical tree

Data Compression Techniques

21. Dynamic Programming III. FPTAS [Ottman/Widmayer, Kap. 7.2, 7.3, Cormen et al, Kap. 15,35.5]

A Unifying Framework for Compressed Pattern Matching

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

3F1 Information Theory, Lecture 3

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

CMPT 365 Multimedia Systems. Lossless Compression

Compressed Index for Dynamic Text

BASIC COMPRESSION TECHNIQUES

CS4800: Algorithms & Data Jonathan Ullman

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

String Search. 6th September 2018

Converting SLP to LZ78 in almost Linear Time

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Greedy. Outline CS141. Stefano Lonardi, UCR 1. Activity selection Fractional knapsack Huffman encoding Later:

Shannon-Fano-Elias coding

Summary of Last Lectures

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

3 Greedy Algorithms. 3.1 An activity-selection problem

Exercises with solutions (Set B)

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

INF 4130 / /8-2017

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

3F1 Information Theory, Lecture 3

A Mathematical Theory of Communication

Bloom Filters, general theory and variants

arxiv: v1 [cs.fl] 29 Jun 2013

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of

Algorithm Design and Analysis

Chapter 2: Source coding

Information Theory, Statistics, and Decision Trees

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

On Boyer-Moore Preprocessing

Theory of Computation

Coding of memoryless sources 1/35

INF 4130 / /8-2014

Computing the Entropy of a Stream

Lec 05 Arithmetic Coding

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

CSCI 2570 Introduction to Nanocomputing

String Matching II. Algorithm : Design & Analysis [19]

Entropy as a measure of surprise

Transducers for bidirectional decoding of prefix codes

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Image and Multidimensional Signal Processing

Alternative Algorithms for Lyndon Factorization

Streaming and communication complexity of Hamming distance

String Range Matching

Implementing Approximate Regularities

Asymptotic redundancy and prolixity

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Simple Compression Code Supporting Random Access and Fast String Matching

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

Midterm 2 for CS 170

Implementation of Lossless Huffman Coding: Image compression using K-Means algorithm and comparison vs. Random numbers and Message source

Lecture 18 April 26, 2012

Transcription:

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 / 26

String matching in compressed texts Compressed matching problem (Amir & Benson, 1992): Alphabet Σ Pattern P Compression system (E, D) Encoded Text E(T ) Find all the shifts of P in T using E(P) and E(T ) 2 / 26

String matching in compressed texts A static compression method is characterized by a system (E, D) of two complementary functions: E : Σ {0, 1} + E(ε) = ε E(T [1.. l]) = E(T [1.. l 1]).E(T [l]), l : 1 l T D(E(c)) = c, c Σ 3 / 26

String matching in compressed texts A static compression method is characterized by a system (E, D) of two complementary functions: E : Σ {0, 1} + E(ε) = ε E(T [1.. l]) = E(T [1.. l 1]).E(T [l]), l : 1 l T D(E(c)) = c, c Σ Prefix property - c 1, c 2 : E(c 1 ) E(c 2 ) Canonical Huffman coding 3 / 26

String matching in compressed texts t : 00 e : 01 w : 100 a : 101 n : 110 y : 1110 b : 1111 ten twenty 00 01 110 00 100 01 110 00 1110 4 / 26

String matching in compressed texts t : 00 e : 01 w : 100 a : 101 n : 110 y : 1110 b : 1111 ten twenty ten ten 00 01 110 00 100 01 110 00 1110 00 01 110 00 01 110 4 / 26

String matching in compressed texts t : 00 e : 01 w : 100 a : 101 n : 110 y : 1110 b : 1111 ten twenty ten ten 00 01 110 00 100 01 110 00 1110 00 01 110 00 01 110 Problem: false positives, occurrences of E(P) in E(T ) which do not correspond to occurrences of P in T. An occurrence of E(P) which does not start on a codeword boundary is a false positive. 4 / 26

minimum redundancy codes A binary prefix code can be represented with an ordered binary tree, whose leaves are labeled with characters in Σ and whose edges are labeled by 0 (left) and 1 (right). The codeword of a given character is the word labeling the branch from the root to the leaf labeled by the same character. 5 / 26

skeleton tree The minimal prefix of a codeword which allows one to unambigously determine the codeword length corresponds to the minimum depth node in the path from the root induced by the codeword, which is the root of a complete subtree. The skeleton tree (Klein, 2000) is a pruned canonical Huffman tree whose leaves correspond to minimum depth nodes in the Huffman tree which are roots of complete subtrees. It is useful to maintain at each leaf of a skeleton tree the common length of the codeword(s) sharing the prefix which labels the path from the root to it. 6 / 26

skeleton tree b : 00 i : 01 d : 1000 t : 1001 a : 1010 r : 1011 l : 1100 c : 1101 g : 11100 k : 11101 u : 11110 e : 11111 0 0 1 2 0 0 1 0 1 b i 4 0 0 1 0 1 4 5 0 1 0 1 0 1 0 1 d t a r l c 0 1 0 1 g k u e 7 / 26

Previous work sk-kmp (Daptardar & Shapira, 2006): modified KMP, prefix alignments always respect codeword boundaries. Decoding uses the skeleton tree. If no prefix matches the border of the current window, the algorithm can skip the remaining bits for the current codeword. 8 / 26

skeleton tree verification Filtering/verification paradigm The filtering phase searches the occurrences of E(P) in E(T ) The verification phase uses an algorithm based on the skeleton tree to verify candidate matches 9 / 26

skeleton tree verification Suppose that we have found a candidate shift s. We need to know if s is codeword aligned. We maintain an offset ρ pointing to the start of the last window where a verification has taken place. We update - using the skeleton tree - ρ to a minimal position ρ s which is codeword aligned. If ρ = s, s is a valid shift. 10 / 26

skeleton tree verification Sk-Align(root, t, ρ, s) 1 x root, l 0 2 while true 3 do B B t[ ρ / k ] (ρ mod k) 4 if B < 2 k 1 then x Left(x) else x Right(x) 5 if Key(x) 0 6 then ρ ρ + Key(x) l, l 0, x root 7 if ρ s then break 8 else ρ ρ + 1, l l + 1 9 return ρ 11 / 26

skeleton tree verification t : 00 e : 01 w : 100 a : 101 n : 110 y : 1110 b : 1111 ten twenty 00 01 110 00 100 01 110 00 1110 12 / 26

skeleton tree verification t : 00 e : 01 w : 100 a : 101 n : 110 y : 1110 b : 1111 ten twenty 00 01 110 00 100 01 110 00 1110 00 100 01 110 00 1110 j = 3 SK-ALIGN(root, t, 0, 3) = 5 3 12 / 26

skeleton tree verification t : 00 e : 01 w : 100 a : 101 n : 110 y : 1110 b : 1111 ten twenty 00 01 110 00 100 01 110 00 1110 00 100 01 110 00 1110 j = 3 SK-ALIGN(root, t, 0, 3) = 5 3 00 100 01 110 00 1110 j = 9 SK-ALIGN(root, t, 5, 9) = 10 9 12 / 26

skeleton tree verification Pros Cons Lazy decoding: in the best case (the pattern does not occur) no decoding is performed For every codeword the algorithm reads the minimum number only of bits necessary to infer its length The number of bits processed depends on the position of candidate matches 13 / 26

skeleton tree verification Huffman-matcher(P, m, T, n) 1 Precompute Globals(P) 2 n T 3 m P 4 s 0 5 ρ 0 6 root Build-Sk-Tree(φ) 7 for i 0 to n 1 8 do j Check Shift(s, P, T ) 9 if j = s + m 1 10 then ρ Sk-Align(root, T, ρ, j) 11 if ρ = j then Print(j) 12 s s + Shift Increment(s, P, T, j) 14 / 26

string matching in binary strings Adaptation of existing algorithms for binary strings BINARY-HASH-MATCHING (Lecroq & Faro, 2009) FED (J. Kim & E. Kim & Park, 2007) 15 / 26

string matching in binary strings Scanning T with bit granularity is slow T [0,..., n 1] B T [0,..., n/k 1], B T [i] is a block of k bits An occurrence of P can start at bit 1, 2,... k of a block P i P i P i [0,..., m + i 1] Patt i [0,..., (m + i)/k 1] 16 / 26

string matching in binary strings Scanning T with bit granularity is slow T [0,..., n 1] B T [0,..., n/k 1], B T [i] is a block of k bits An occurrence of P can start at bit 1, 2,... k of a block P i P i P i [0,..., m + i 1] Patt i [0,..., (m + i)/k 1] ent 10001110 twenty 00100011-10001110 16 / 26

string matching in binary strings Scanning T with bit granularity is slow T [0,..., n 1] B T [0,..., n/k 1], B T [i] is a block of k bits An occurrence of P can start at bit 1, 2,... k of a block P i P i P i [0,..., m + i 1] Patt i [0,..., (m + i)/k 1] ent 10001110 twenty 00100011-10001110 P 2 00100011-10000000 16 / 26

string matching in binary strings E(P) = 110010110010110010110 (A) Patt 0 1 2 3 0 11001011 00101100 10110000 1 01100101 10010110 01011000 2 00110010 11001011 00101100 3 00011001 01100101 10010110 4 00001100 10110010 11001011 00000000 5 00000110 01011001 01100101 10000000 6 00000011 00101100 10110010 11000000 7 00000001 10010110 01011001 01100000 (B) Mask 0 1 2 3 0 11111111 11111111 11111000 1 01111111 11111111 11111100 2 00111111 11111111 11111110 3 00011111 11111111 11111111 4 00001111 11111111 11111111 10000000 5 00000111 11111111 11111111 11000000 6 00000011 11111111 11111111 11100000 7 00000001 11111111 11111111 11110000 17 / 26

string matching in binary strings The pattern E(P) is aligned with the s-th bit of the text if Patt i [h] = B T [j + h] & Mask i [h], for h = 0, 1,..., m i 1 where j = s/k, i = s mod k, m i = (m + i)/k 18 / 26

string matching in binary strings Bad character rule does not work well with binary alphabets, bc(c) 1, c Σ Super alphabet: 2 2 k P is logically divided into m k overlapping grams 19 / 26

BINARY-HASH-MATCHING Shift with bit granularity Hs(B) = min {0 u < m p[m u k.. m u 1] B} {m}, 0 B < 2 k 20 / 26

BINARY-HASH-MATCHING Shift with bit granularity Hs(B) = min {0 u < m p[m u k.. m u 1] B} {m}, 0 B < 2 k if Hs[B] = 0, the alignment (s m + 1) mod k is checked, where s is the current position in T in bits 20 / 26

FED Shift with byte granularity δ i (c) = min({m i 2 + 1} {m i 2 + 1 k Patt i [k] = c and 1 k m i 2}) δ(c) = min{δ i (c), 0 i < k} 21 / 26

FED Shift with byte granularity δ i (c) = min({m i 2 + 1} {m i 2 + 1 k Patt i [k] = c and 1 k m i 2}) δ(c) = min{δ i (c), 0 i < k} hash table λ to find candidate alignments, each entry in the table pointing to a linked list of patterns pattern Patt i is inserted into the slot with hash value Patt i [m i 2] 21 / 26

Complexity O( m/k n) worst case time complexity O(m + 2 k ) space complexity k 8 works well for a single pattern, space overhead negligible 22 / 26

Experimental results Implementation in C, compiled with gcc 4, run on PowerPC G4 1.5 GHz Natural language texts Set of 100 patterns of fixed length m {4, 8, 16, 32, 64, 128, 256}, randomly extracted from the text Comparison between the following algorithms: Huffman-Kmp Huffman-Hash-Matching Huffman-Fed Decompress and Search with 3-Hash algorithm 23 / 26

Experimental results - running times 600 time 400 200 0 50 100 150 200 250 m HKMP HHM HFED D&S 24 / 26

Experimental results - processed bits 1 bits 0.5 0 50 100 150 200 250 m HKMP HHM HFED 25 / 26

Conclusions Our generic algorithm can skip many bits when decoding Sublinear on average when used with Boyer-Moore like algorithms Fast, especially when the pattern frequency is low 26 / 26