CS 229r Information Theory in Computer Science Feb 12, Lecture 5

Similar documents
Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Dynamics and time series: theory and applications. Stefano Marmi Scuola Normale Superiore Lecture 6, Nov 23, 2011

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Chapter 2: Source coding

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Motivation for Arithmetic Coding

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Information Theory and Statistics Lecture 2: Source coding

Entropy Rate of Stochastic Processes

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Lecture 4 : Adaptive source coding algorithms

Dynamics and time series: theory and applications. Stefano Marmi Giulio Tiozzo Scuola Normale Superiore Lecture 3, Jan 20, 2010

Communications Theory and Engineering

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Lecture 12: Randomness Continued

Homework Set #2 Data Compression, Huffman code and AEP

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 6: September 22

Lecture 1 : Data Compression and Entropy

Lecture 15: Expanders

Lecture 1: September 25, A quick reminder about random variables and convexity

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Lecture 11: Polar codes construction

Lecture 11: Quantum Information III - Source Coding

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

(Classical) Information Theory II: Source coding

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Lecture 6: Entropy Rate

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Chapter 9 Fundamental Limits in Information Theory

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

ECE 587 / STA 563: Lecture 5 Lossless Compression

Coding for Discrete Source

Entropy as a measure of surprise

lossless, optimal compressor

ECE 587 / STA 563: Lecture 5 Lossless Compression

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Theory of Computation

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Markov Chains and MCMC

CS4800: Algorithms & Data Jonathan Ullman

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

1 Review of The Learning Setting

Universal Loseless Compression: Context Tree Weighting(CTW)

A Mathematical Theory of Communication

Is there an Elegant Universal Theory of Prediction?

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

COS597D: Information Theory in Computer Science October 19, Lecture 10

Lecture 8: Shannon s Noise Models

2 Plain Kolmogorov Complexity

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Compression Complexity

Statistical process control of the stochastic complexity of discrete processes

Lecture 11: Continuous-valued signals and differential entropy

Chapter 5: Data Compression

Theory of Computation

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Lecture 19 : Reed-Muller, Concatenation Codes & Decoding problem

LECTURE 3. Last time:

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Multiple Choice Tries and Distributed Hash Tables

6.842 Randomness and Computation March 3, Lecture 8

Solutions to Set #2 Data Compression, Huffman code and AEP

7.1 Coupling from the Past

PRGs for space-bounded computation: INW, Nisan

Lecture 16 Oct 21, 2014

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Further discussion of Turing machines

Lecture 1: Introduction, Entropy and ML estimation

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Lecture 9: Counting Matchings

Kolmogorov complexity

IITM-CS6845: Theory Toolkit February 3, 2012

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Lecture 5: Asymptotic Equipartition Property

Lecture 9 Polar Coding

Lecture 7: Passive Learning

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Midterm 2 for CS 170

Lecture 1: Shannon s Theorem

Lecture 3 (Notes) 1. The book Computational Complexity: A Modern Approach by Sanjeev Arora and Boaz Barak;

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Lecture 3: September 10

Languages, regular languages, finite automata

LECTURE 13. Last time: Lecture outline

Advanced topic: Space complexity

CS168: The Modern Algorithmic Toolbox Lecture #14: Markov Chain Monte Carlo

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

2. Transience and Recurrence

Transcription:

CS 229r Information Theory in Computer Science Feb 12, 2019 Lecture 5 Instructor: Madhu Sudan Scribe: Pranay Tankala 1 Overview A universal compression algorithm is a single compression algorithm applicable to input sequences from an entire class of information sources, rather than a specific probability distribution. We will be particularly interested in the class of Markovian sources, formally defined in the previous lecture. Today s goal is to work toward a proof that the Lempel-Ziv algorithm is a universal compression algorithm for the class of Markovian sources. 1.1 Logistics Problem Set 1 is currently being graded; expect feedback early next week. Problem Set 2 will be released soon. 1.2 Schedule Universal Compression - Hidden Markov Models (HMM) - A Universal Compressor - Lempel-Ziv is Universal 2 Hidden Markov Models We begin with some illustrative examples of hidden Markov models, which are closely related to Markov chains: 2.1 Outer Space Consider a satellite listening for signals from outer space. Most of the time, there is hardly any information transmission, but occasionally, some highly entropic planetary activity is observed. We model the information source as a hidden Markov model, with two internal states and some small parameters δ 0 and p 0: 1 p 1 p p Z = 0: Quiet X Bern(δ) p Z = 1: Entropic ( ) 1 X Bern 2 δ CS 229r Information Theory in Computer Science-1

We denote the successive states of this information source as Z 1, Z 2,..., Z n, and the bit sequence produced as X 1, X 2,..., X n. We have X i Bern(δ) if Z i = 0 and X i Bern ( 1 2 δ) if Z i = 1. Our goal is to compress the sequence X 1,..., X n. 2.2 Shannon s n-gram Model In Shannon s foundational paper, he discusses a Markov chain to model English, in which letters are drawn randomly according to a probability distribution conditional on the previous n 1 letters. Here is Shannon s example of a sequence produced using trigram frequencies: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE. http://math.harvard.edu/ ctm/home/text/others/shannon/entropy/entropy.pdf 2.3 General Markovian Sources We now provide a general definition of hidden Markov models that abstracts the key features of the previous two examples. Intuitively, a hidden Markov model consists of a Markov chain of hidden states Z 1, Z 2,..., that can only be observed through a sequence of related variables X 1, X 2,... (which do not necessarily form a Markov chain). More formally, Definition 1. A hidden Markov model is characterized by: - a Markov chain M R k k, where M ij = Pr[next state is j current state is i]. - k distributions P (1),..., P (k) supported on Ω, with P (i) the distribution of X given Z = i. When discussing a Markov chain M, we will always assume that M is irreducible and aperiodic, so that M has a unique stationary distribution Π R k satisfying ΠM = Π. We will also assume that Z 1 Π, which implies that Z i Π for every i. Definition 2. The entropy rate of a Hidden markov model is lim n H(X n X 1,..., X n 1 ). Under our assumptions on M, the above limit always exists. Indeed, we have H(X n X 1,..., X n 1 ) H(X n X 2,..., X n 1 ) (conditioning reduces entropy) = H(X n 1 X 1,..., X n 2 ) (we ve assumed a stationary process), so the values H(X n X 1,..., X n 1 ) are monotonically decreasing, and hence converge to a limit H(M) from above. [ ] 1 p p Remark In example 2.1.1, we have k = 2, M = and distributions Bern(δ) and p 1 p Bern ( 1 2 δ) for the quit and entropic states, respectively. However, we lack a closed form expression for the entropy rate H(M) in terms of p and δ. In fact, the best approximation algorithms we have at our disposal to compute H(M) are still quite coarse. Remark Without our assumptions on M, there would be no reason to expect the values H(X n X 1,..., X n 1 ) to converge to a limit. For example, if we were to take any Markov chain M and place an unavoidable intermediate node along each edge, the entropy H(X n X 1,..., X n 1 ) could be wildly different for even and odd n. CS 229r Information Theory in Computer Science-2

3 A Universal Compressor In order to begin to describe universal compression algorithms for Markovian sources, we need Theorem 3 (Asymptotic Equiparition Property for Markovian Sources). For every hidden Markov model (M, P (1),..., P (k) ) and ε > 0, for sufficiently large l, there exists a set S Ω l such that for every start state Z 1, 1. Pr[(x 1,..., x l ) / S] ε 2. For all (α 1,..., α l ) S, 3. S 2 (H(M)±ε)l The set S is called the typical set. 1 S 1+ε Pr[(x 1,..., x l ) = (α 1,..., α l )] 1 S 1 ε. Note that the AEP for Markovian sources is more general than the version of the AEP we saw before, which dealt only with i.i.d. sequences of random variables. Exercise 4. Using the AEP, as stated above, argue that in probability as l. 3.1 A Non-Universal Compressor 1 l log p(x 0,..., X l 1 ) H(M) We ll first use the AEP to describe a remarkably simple compression algorithm for a known Markovian source M. Suppose we wish to encode a string x 1 x n produced by M. Take l and S as in the statement of the AEP, and suppose that n l. We first divide the string into blocks of length l: x 1 x l B 1 x l+1 x 2l B 2 x }{{ n. } B n/l To encode a block B i, we first use a single bit to indicate whether or not B i S. Depending on whether or not B i S, the rest of the encoding consists of either log 2 S or log 2 Ω l bits, respectively. By the AEP, the probability that B i / S is at most ε. Thus, the expected length of a single block s encoding is at most (1 + log 2 S ) + ε (1 + log 2 Ω l ) (H(M) + ε + ε log 2 Ω l ) l, which is very close to the optimal rate of H(M) bits per input symbol. To encode x, we simply concatenate the encodings of B 1,..., B n/l. 3.2 Universal Huffman Algorithm The previous algorithm made use of the values l, S. This is problematic for two reasons. First, it limits the algorithm to a single Markovian source. Second, we generally do not know how to compute l or S. We attempt to circumvent these issues with what we will call the universal Huffman algorithm. Suppose we are given x 1,..., x n to encode, and a parameter l. The algorithm proceeds in stages: CS 229r Information Theory in Computer Science-3

1. Divide the string into size l blocks as before: and compute the empirical frequency x 1 x l B 1 x l+1 x 2l B 2 x }{{ n, } B n/l for every w Ω l. f w = frequency of w among B 1,..., B n/l 2. Compute the Huffman code tree for the frequencies {f w } w Ω l, and use this to compress the sequence B 1,..., B n/l. 3. Finally, transmit any encoding of the list of frequencies {f w } computed in part 1, followed by the string Huffman(B 1,..., B n/l, {f w } w ) computed in part 2. Exercise 5. Prove that for all Markovian sources M and ε > 0, as l and n grow, we have Pr[ Univ-Huffman(x 1,..., x n ) > n (H(M) + ε)] 0 Remark We don t know much about how large n must be, as a function of l, in order for the above algorithm to consistently perform well. We know even less about how large l must be, as a function of the underlying Markov chain. If you were surprised by the seemingly inexpensive computations needed for the universal Huffman algorithm, be warned that the real complexity is hidden in the size of n, which may need to be exponentially large for the compression to work well. Exercise 6. Prove, using principles of complexity theory, that the total compression time for a universal compression algorithm must be at least exponential in k, the number of states of the Markov chain. 4 Lempel-Ziv is Universal We begin with a review of the Lempel-Ziv algorithm to compress a string x = x 1 x n, x i Ω. 1. First, we parse x into consecutive phrases s 0 s 1 s 2 s t, such that (a) s 0 is the empty string. (b) s i = s ji b i, for some b i Ω and j i < i. (c) s i s j for i j. 2. Encode each j i using log t + o(log t) bits, and encode each b i using log Ω bits. 3. Transmit the encoding of each j i and b i. The resulting compression length is (t log t)(1 + o(1)) in total. 4.1 Analyzing Lempel-Ziv Before analyzing the Lempel-Ziv algorithm, we need to introduce one more important construct: the finite state compressor. Definition 7. A finite state compressor (FSC) is a finite state machine (FSM) that writes 0 or more symbols to an output tape during each state transition (see diagram). CS 229r Information Theory in Computer Science-4

x 1 x 2 x n left-to-right, single pass FSM 101 11 left-to-right, write only y 1 y m Finite state compressors have a very important property related to Markovian sources, which will be crucial in our analysis of Lempel-Ziv: Theorem 8. For any hidden Markov model M and ε > 0, there exists a finite state compressor that achieves a compression rate of (1 + ε)h(m). Remark It turns out that the proof of the theorem involves constructing a FSC that reads its input sequence in blocks of size l. As a result, the number of states S in the machine exceeds Ω l, which is exorbitantly large compared to the k-state machine M used to produce the input! Recall that we want to prove that Lempel-Ziv achieves a nearly optimal compression rate. By the previous theorem, it suffices to show that the performance of Lempel-Ziv is not much worse than that of any finite state compressor. To make this comparison precise, we define C(x), for a string x Ω, as the maximum integer t such x can be written as the concatenation of t distinct strings Y 1,..., Y t in Ω. One might interpret C(x) as a measure of the inherent complexity of compressing the string x. Using C(x), we can state the following important bounds: 1. In the Lempel-Ziv parsing x = s 1 s t, we have t C(x). 2. The length of the Lempel-Ziv encoding of x satisfies LZ(x 1,..., x n ) t log t(1 + o(1)) C(x) log C(x)(1 + o(1)). 3. For a string x, x C(x) log C(x) 4 Ω 2 4. Finally, for any S-state compressor F, the compression length satisfies Comp F (x) C(x) log C(x) 4 Ω 2 S 2. In the next lecture, we will further discuss the inequalities above, which provide the relationship between Lempel-Ziv and FSCs needed to complete our analysis. CS 229r Information Theory in Computer Science-5

Addendum inserted by Madhu The proofs of the three inequalities claimed above is not hard nor is it crucial for future lectures, so instead of using half a lecture to go over them, let me just add them here and conclude why this implies that the Lempel-Ziv algorithm is a univeral compressor. We start with the second one above, and then turn to the first and finally the last. Claim 9. For every string x Σ, x (1 o(1))c(x) log Σ C(x). Proof. Let x = y 1 y 2 y t with y i s being distinct elements of Σ. To minimize the length of x we might as well assume that the y i s are lexicographically ordered strings with the shortest ones first. So there are Σ strings of length 1, Σ 2 strings of length 2 etc., till we get to t distinct strings. The claim now follows from some calculations (omitted here). Claim 10. The length of the Lempel-Ziv encoding of x = (x 1,..., x n ), denoted LZ(x), satisfies where the o(1) term goes to zero as n. LZ(x) t log t(1 + o(1)) C(x) log C(x)(1 + o(1)), Proof. This follows from the fact that Lempel-Ziv produces a split of x = y 1 y t with distinct y i s. It follows that t C(x) and since each y i is encoded using at most logt (1 + o t (1)) bits, it follows that the compression length LZ(x) t log t(1 + o t (1)) C(x) log C(x)(1 + o t (1)). It is left as an exercise to see that o t (1) = o n (1) for Lempel-Ziv (i.e., that t grows as n grows. Claim 11. For every S-state compressor F, the compression length satisfies Comp F (x) (1 o(1))c(x) log C(x) S 2 = (1 o(1))c(x) log C(x). Proof. Fix a partition x = y 1 y t of x with y i s being distinct. For states a, b of F, let S ab be the set of indices i [t] such that the compressor is in state a at the beginning of parsing y i and ends in state b after parsing y i. Let y i {0, 1} denote the string output by F while parsing y i. The crux of this proof is the claim that y i y j for i, j S a,b, i.e., if the compressor starts and ends in the same state then it must produce distinct outputs on distinct inputs (or else the compression is not invertible). Once we have this, we have that i S a,b y i (1 o(1)) S a,b log 2 S a,b (from the first claim above). i S a,b y i (1 o(1)) a,b S a,b log 2 S a,b. Finally by the Cauchy-Schwartz And thus Comp F (x) = a,b inequality and using the fact that a,b S a,b = t we get that Comp F (x) (1 o(1)) a,b S a,b log 2 S a,b (1 o(1))t log(t/ S 2 ). Putting the last two claims together we have that Lempel-Ziv compresses almost as well as an finite state compressor. Combining with Theorem 8 we have the Lempel-Ziv achieves a compression length of at most (1 + ε)h(m) per input symbol. CS 229r Information Theory in Computer Science-6