Data Compression Techniques

Similar documents
Data Compression Techniques

Data Compression Techniques

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Inverting the Burrows-Wheeler Transform

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

F U N C T I O N A L P E A R L S Inverting the Burrows-Wheeler Transform

Alphabet Friendly FM Index

arxiv: v1 [cs.ds] 19 Apr 2011

Succinct Suffix Arrays based on Run-Length Encoding

String Range Matching

The Burrows-Wheeler Transform: Theory and Practice

Preview: Text Indexing

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS

Jay Daigle Occidental College Math 401: Cryptology

A simpler analysis of Burrows Wheeler-based compression

Linear-Space Alignment

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

Chapter 5. Finite Automata

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Algebra Year 10. Language

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

Compressed Representations of Sequences and Full-Text Indexes

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Smaller and Faster Lempel-Ziv Indices

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

Asymptotic Optimal Lossless Compression via the CSE Technique

A Comparison of Methods for Redundancy Reduction in Recurrence Time Coding

A Simpler Analysis of Burrows-Wheeler Based Compression

Algorithms for Bioinformatics

Alphabet-Independent Compressed Text Indexing

arxiv: v1 [cs.ds] 15 Feb 2012

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Summary of Last Lectures

On the general Term of hypergeometric Series *

Theoretical Computer Science

Succincter text indexing with wildcards

Regular expressions and Kleene s theorem

Opportunistic Data Structures with Applications

Lecture 18 April 26, 2012

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Lecture 1 : Data Compression and Entropy

Linear Algebra II. 2 Matrices. Notes 2 21st October Matrix algebra

Computational Models Lecture 2 1

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

An O(N) Semi-Predictive Universal Encoder via the BWT

Module 9: Tries and String Matching

Exercises with solutions (Set B)

1 The distributive law

Project Management Prof. Raghunandan Sengupta Department of Industrial and Management Engineering Indian Institute of Technology Kanpur

Small-Space Dictionary Matching (Dissertation Proposal)

1 Alphabets and Languages

arxiv:cs/ v1 [cs.it] 21 Nov 2006

A Bijective String Sorting Transform

Recurrence Relations

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

1 Introduction to information theory

Essential facts about NP-completeness:

Continuity and One-Sided Limits

Lecture 4 : Adaptive source coding algorithms

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

New Lower and Upper Bounds for Representing Sequences

Sturmian Words, Sturmian Trees and Sturmian Graphs

CS 410/610, MTH 410/610 Theoretical Foundations of Computing

MODEL ANSWERS TO HWK #7. 1. Suppose that F is a field and that a and b are in F. Suppose that. Thus a = 0. It follows that F is an integral domain.

A Bijective String Sorting Transform

Alternative Algorithms for Lyndon Factorization

Compressed Representations of Sequences and Full-Text Indexes

Galileo Educator Network

MODEL ANSWERS TO THE FIRST HOMEWORK

Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Data Intensive Computing Handout 11 Spark: Transitive Closure

SOFTWARE FOR MULTI DE BRUIJN SEQUENCES. This describes Maple software implementing formulas and algorithms in

CSEP 590 Data Compression Autumn Arithmetic Coding

Lecture Notes: The Halting Problem; Reductions

Chapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0

MATH 324 Summer 2011 Elementary Number Theory. Notes on Mathematical Induction. Recall the following axiom for the set of integers.

School of Computer Science and Electrical Engineering 28/05/01. Digital Circuits. Lecture 14. ENG1030 Electrical Physics and Electronics

Computational Models Lecture 2 1

lossless, optimal compressor

Longest Common Prefixes

Math 324 Summer 2012 Elementary Number Theory Notes on Mathematical Induction

arxiv: v1 [cs.ds] 25 Nov 2009

A FIRST COURSE IN LINEAR ALGEBRA. An Open Text by Ken Kuttler. Matrix Arithmetic

Second step algorithms in the Burrows Wheeler compression algorithm

Lecture 7: More Arithmetic and Fun With Primes

CSCI 340: Computational Models. Regular Expressions. Department of Computer Science

Properties of Context-Free Languages

Lecture 12: November 6, 2017

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Chapter 8: Recursion. March 10, 2008

Fault Tolerance Technique in Huffman Coding applies to Baseline JPEG

Nearest Neighbor Search with Keywords: Compression

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Transcription:

Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16

Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation of the text that makes it easier to compress. It can be defined as follows: Let T [0..n) be a circular text. For any i [0..n), T [i..n)t [0..i) is a rotation of T. Let M be the n n matrix, where the rows are all the rotations of T in lexicographical order. All columns of M are permutations of T. In particular: The first column F contains the text symbols in order. The last column L is the BWT of T. The BWT of T = banana$ is L = annb$aa. F L $ b a n a n a a $ b a n a n a n a $ b a n a n a n a $ b b a n a n a $ n a $ b a n a n a n a $ b a 2 / 16

Surprisingly, BWT is invertible, i.e., T can be reconstructed from the BWT L. We will later see how. The compressibility of the BWT is based on sorting by context: Consider a symbol s that is followed by a string w in the text. (Due to circularity, the last symbol T [n 1] if followed by the first symbol T [0].) The string w is called a right context of s. In the matrix M, there is a row beginning with w and ending with s. Because the rows are sorted, all symbols with the right context w appear consecutively in the BWT. This part of the BWT sharing the context w is called a w-context block and is denoted by L w. Context blocks are often highly compressible as they consist of symbols occurring in the same context. In our example, L n = L na = aa. 3 / 16

Here we have right contexts while earlier with higher order compression we considered left contexts. This makes no essential difference. Furthermore, the text T is often reversed into T before computing the BWT, which turns left contexts into right contexts and vice versa. The context block L ht for a reversed English text, which corresponds to left context th in unreversed text. oreeereoeeieeeeaooeeeeeaereeeeeeeeeeeeereeeeeeeeeeaaeeaeeeeeeee eaeeeeeeeeaeieeeeeeeeereeeeeeeeeeeeeeeeeeeeeeeeaeeieeeeeeaaieee eeeeeeeeeeeeeeeeeeeeeeeeeeeeeaeieeeeeeeeeeeeeeeeeeeeeeeeeeeeaee eeeeeeeeeeeeeeeeeeereeeeeeeeeeeieaeeeeieeeeaeeeeeeeeeieeeeeeeee eeeieeeeeeeeioaaeeaoereeeeeeeeeeaaeaaeeeeieeeeeeeieeeeeeeeaeeee eeaeeeeeereeeaeeeeeieeeeeeeeiieee. e eeeeiiiiii e, i o oo e eiiiiee,er,,,. iii 4 / 16

Some of those symbols in (unreversed) context: t raise themselves, and th e hunter, thankful and r ery night it flew round th e glass mountain keeping agon, but as soon as he th r ew an apple at it the b f animals, were resting th e mselves. "Halloa, comr ple below to life. All th o se who have perished on that the czar gave him th e beautiful Princess Mil ng of guns was heard in th e distance. The czar an cked magician put me in th i s jar, sealed it with t o acted as messenger in th e golden castle flew pas u have only to say, Go th e re, I know not where; b 5 / 16

For any k, the BWT L is the concatenation of all the length-k-context blocks in lexicographical order of the contexts: L = w Σ k L w L = L $ L a L b L n = a nnb $ aa = L $b L a$ L an L ba L na = a n nb $ aa As mentioned, the context blocks are often highly compressible, even using zeroth order compression. In fact, by compressing all length-k-context blocks separately to zeroth order, we achieve kth order compression of the text. This is known as compression boosting. 6 / 16

Theorem (Compression boosting) For any text T and any k, Proof. w Σ k L w H 0 (L w ) = T H k (T ). Recall that n w is the number of occurrences of w. Then n w = L w and H 0 (L w ) = s σ n sw n w log n sw n w We can then use the symmetry of the kth order empirical entropy best seen in the equation from slide 15 on Lecture 5 (proved in Exercise 4.2): n sw L w H 0 (L w ) = n w log n sw = nh k (T R ) n w Σ k w Σ k s σ w n w = n w log n w n w log n w = nh k (T ). w Σ k w Σ k+1 7 / 16

As with context-based compression, using a single context length k everywhere is not optimal in general. There exists a linear time algorithm for finding a sequence of variable length contexts w 1, w 2,..., w h such that L w1 L w2... L wh = L and the total compressed size is minimized. This is called optimal compression boosting. For the best compression, we may need to take multiple context lengths into account. With BWT this is fairly easy using adaptive methods: For most symbols s in the BWT, the nearest preceding symbols share a long context with s, symbols that are a little further away share a short context with s, and symbols far away share no context at all. Thus adaptive methods that forget, i.e., give higher weight to the nearest preceding symbols are often good compressors of the BWT. Such methods can be context oblivious: They achieve good compression for all contexts without any knowledge about context blocks or context lengths. 8 / 16

Run-Length Encoding An extreme form of forgetting is run-length encoding (RLE). RLE encodes a run s k of k consecutive occurrences of a symbol s by the pair s, k. Nothing beyond the run has an effect on the encoding. The run-length encoding of L ht from the example on slide 4 begins: o, 1 r, 1 e, 3 r, 1 e, 1 o, 1 e, 2 i, 1 e, 4 a, 1 o, 2 e, 5 a, 1 e, 1 r, 1 e, 13 r, 1 e, 10 RLE can be wastful when there are many runs of length one. A simple optimization is to encode the (remaining) run-length only after two same symbols in a row. This could be called lazy RLE. The lazy RLE of L ht begins: oree1reoee0iee2aoo0ee3aeree11ree8 RLE alone does not compress the BWT very well in general but can be useful when combined with other methods. 9 / 16

Move-to-front Move-to-front (MTF) encoding works like this: Maintain a list of all symbols of the alphabet. A symbol is encoded by its position on the list. Note that the size of the alphabet stays the same. When a symbol occurs, it is moved to the front of the list. Frequent symbols tend to stay near the front of the list and are therefore encoded with small values. The MTF encoding of L = annb$aa starting with the list anb$ (which is global frequency order) is 0202330. list next symbol code anb$ a 0 anb$ n 1 nab$ n 0 nab$ b 2 bna$ $ 3 $bna a 3 a$bn a 0 10 / 16

MTF does not compress but it changes the global distribution which makes entropy coding simpler. In BWT, symbols are often frequent in some contexts but rare in other contexts. Effective entropy coding needs an adaptive model that forgets quickly. In MTF encoded BWT, small values tend to be frequent everywhere. Then entropy coding with a single global semistatic model achieves good compression. A run of length l of any symbol in BWT become a run of 0s of length l 1 after MTF. Thus MTF if sometimes followed by RLE0, which run-length encodes 0s only. The last stage of a good BWT compression is always entropy coding: Using a sophisticated adaptive model, plain BWT can be compressed very effectively. Intermediate processing (RLE, MTF, MTF+RLE0,...) can significantly simplify and speed up the compression by allowing a simpler model (MTF) and by reducing the data (RLE). 11 / 16

Computing BWT Let us assume that the last symbol of the text T [n 1] = $ does not appear anywhere else in the text and is smaller than any other symbol. This simplifies the algorithms. To compute the BWT, we need to sort the rotations. With the extra symbol at the end, sorting rotations is equivalent to sorting suffixes. The sorted array of all suffixes is called the suffix array (SA). F L $ b a n a n a a $ b a n a n a n a $ b a n a n a n a $ b b a n a n a $ n a $ b a n a n a n a $ b a SA 6 $ 5 a $ 3 a n a $ 1 a n a n a $ 0 b a n a n a $ 4 n a $ 2 n a n a $ There are linear time algorithms for suffix sorting. The best ones are complicated but fairly fast in practice. We will not described them here, but they are covered on the course String Processing Algorithms. 12 / 16

Inverse BWT We will take a closer look at inverting the BWT, i.e., recovering the text T given its BWT. Let M be the matrix obtained by rotating the BWT matrix M one step to the right. The rows of M are the rotations of T in a different order. In M without the first column, the rows are sorted lexicographically. If we sort the rows of M stably by the first column, we obtain M. This cycle M rotate M sort M is the key to inverse BWT. M $ b a n a n a a $ b a n a n a n a $ b a n a n a n a $ b b a n a n a $ n a $ b a n a n a n a $ b a rotate sort M a $ b a n a n n a $ b a n a n a n a $ b a b a n a n a $ $ b a n a n a a n a $ b a n a n a n a $ b 13 / 16

In the cycle, each column moves to the right and is permuted. By repeating the cycle, we can reconstruct M from the BWT. To reconstruct T, it suffices to follow just one row through the cycles. - - - - - - a - - - - - - n - - - - - - n - - - - - - b - - - - - - $ - - - - - - a - - - - - - a sort rotate $ b - - - - a a $ - - - - n a n - - - - n a n - - - - b b a - - - - $ n a - - - - a n a - - - - a a - - - - - - n - - - - - - n - - - - - - b - - - - - - $ - - - - - - a - - - - - - a - - - - - - rotate & sort sort $ b a - - - a a $ b - - - n a n a - - - n a n a - - - b b a n - - - $ n a $ - - - a n a n - - - a $ - - - - - a a - - - - - n a - - - - - n a - - - - - b b - - - - - $ n - - - - - a n - - - - - a rotate & sort... rotate & sort rotate a $ - - - - - n a - - - - - n a - - - - - b a - - - - - $ b - - - - - a n - - - - - a n - - - - - $ b a n a n a a $ b a n a n a n a $ b a n a n a n a $ b b a n a n a $ n a $ b a n a n a n a $ b a 14 / 16

The permutation that transforms M into M is called the LF-mapping. LF-mapping is the permutation that stably sorts the BWT L, i.e., F [LF [i]] = L[i]. Thus it is easy to compute from L. Given the LF-mapping, we can easily follow a row through the permutations. F $ a a a b n n L a n n b $ a a 15 / 16

Here is the algorithm. Algorithm Inverse BWT Input: BWT L[0..n 1] Output: text T [0..n 1] Compute LF-mapping: (1) for i 0 to n 1 do R[i] = (L[i], i) (2) sort R (stably by first element) (3) for i 0 to n do (4) (, j) R[i]; LF [j] i Reconstruct text: (5) j position of $ in L (6) for i n 1 downto 0 do (7) T [i] L[j] (8) j LF [j] (9) return T The time complexity is linear if we sort R using linear time integer sorting. 16 / 16