Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria

Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal Codes (Lempel-Ziv Coding) Source Coding 1/49

Source Coding Let us consider a source (discrete random variable X taking values in the alphabet X ) that produces i.i.d. symbols generated according to a given density p(x). A source code is a mapping from X to a set of codewords, each one defined by a sequence of finite-length string of bits Source p(x) X 1, X 2,..., X n Code C x ) C( x ),... C( x ) ( 1 2 n Example: Let X in X = (1, 2, 3) be a r.v. with the following pmf and codeword assignment Pr(X = 1) = 1 2, C(1) = 0 Pr(X = 2) = 1, C(2) = 10 4 Pr(X = 3) = 1, C(3) = 11 4 Source Coding 2/49

L(C) = 1 2 1 + 1 4 2 + 1 2 = 1.5 bits 4 We cannot find any lossless code with expected length < 1.5 Source bits/symbols Coding 3/49 Source p(x) 1 2 1 1 2 3 1 3 Code 0 10 0 0 10 11 0 11 Lossless source coding: The source symbols can be exactly recovered from the binary string Q: What is the minimum expected length (in bits/symbol), L(C), of any lossless source code? A: The source s entropy H(X ) (Shannon s source coding theorem) For the example considered, the entropy of the source is H(X ) = x X p(x) log p(x) = 1 2 log 2 + 1 4 log 4 + 1 log 4 = 1.5 bits 4 and the average length of the code is

General goal: To study the fundamental limits for the compression of information Outline: Typical sequences and the Asymptotic Equipartition Property Shannon s source coding theorem Optimal codes (Huffman coding) Universal codes (Lempel-Ziv coding) Source Coding 4/49

Typicality: intuition Let X be a Bernoulli random variable with parameter p = 0.2: Pr{X = 1} = 0.2 and Pr {X = 0} = 0.8. Let s = (x 1, x 2,..., x N ) be a sequence of N independent realizations of X Let us consider two possible outcomes of s for N = 20 s 1 = (0000000000000000000000000000000000000000) s 2 = (0010000000100000000000100000001000000000) Which one is more likely that comes from X? Source Coding 5/49

The probabilities of each sequence are Pr(s 1 ) = (1 p) N = 1.15 10 2 Pr(s 2 ) = p 4 (1 p) (N 4) = 4.50 10 5 Therefore, s 1 is more probable than s 2 However, one would expect that a string of N i.i.d. Bernoulli variables with parameter p should have (on average) Np ones (in our example Np = 4) In this sense, s 2 seems more typical than s 1 : Why? How many sequences exist with all zeros? 1 How many sequences exist with 4 ones and 16 zeros? ( ) 20 = 4845 4 All of them with the same probability! Source Coding 6/49

Let s take a look at the entropy What s the entropy of X? H(X ) = p log(p) (1 p) log(1 p) = 0.2 log(0.2) 0.8 log(0.8) 0.722 The average information or sample entropy of a sequence s = (x 1, x 2,..., x N ) is i(s) = 1 N log p(x 1, x 2,..., x N ) For the example considered, we have i(s 1 ) = 1 N log(1 p)n = log(1 p) 0.322 i(s 2 ) = 1 N log (p 4 (1 p) (N 4)) 0.722 Source Coding 7/49

The value of i(s 2 ) is identical to the entropy H(X ), whereas the value of the non-typical sequence, i(s 1 ), is very different from H(X ) We can divide the set of all 2 N sequences into two sets 1. The typical set, which contains all sequences whose sample entropy is close to the true entropy 2. The non-typical set, which contains the other sequences (note, for instance, that the most probable and least probable sequences belong to the non-typical set) Key findings as N grows large The typical set has probability close to 1 The typical set contains nearly 2 NH(X ) elements (sequences) a All elements in the typical set are nearly equiprobable a Note that when H(X ) << 1 this number can be a tiny fraction of the total number of 2 N sequences Source Coding 8/49

Typical set Definition: The typical set A N ɛ is the set of sequences (x 1, x 2,..., x N ) X N satisfying 2 N(H(X )+ɛ) N(H(X ) ɛ) p(x 1, x 2,..., x N ) 2 or, equivalently, is the set defined as A N ɛ = {(x 1, x 2,..., x N ) X N : 1 N log p(x 1, x 2,..., x N ) H(X ) < ɛ} For N sufficiently large, the set A N ɛ 1. Pr{A N ɛ } > 1 ɛ 2. The number of elements in A N ɛ is has the following properties (1 ɛ)2 N(H(X ) ɛ) A N N(H(X )+ɛ) ɛ 2 Source Coding 9/49

An example Consider a sequence of N i.i.d. Bernoulli random variables with parameter p: The probability of a sequence, s k, with k ones is p(s k ) = p k (1 p) (N k) The sample entropy of all sequences with k ones is 1 N log p(s k) The number of sequences with k ones is ( ) N k The pmf is ( ) N k p k (1 p) N k, k = 0, 1,..., N The entropy is H = p log p (1 p) log(1 p) What is the typical set A N ɛ for ɛ = 0.1? Source Coding 10/49

Let s consider the case N = 8 and p = 0.4, for which H = 0.971 The typical set for ɛ = 1 is composed of all sequences whose sample entropy lies between H ɛ = 0.871 and H + ɛ = 1.071 k 0 1 2 3 4 5 6 7 8 ( ) N 1 8 28 56 70 56 28 8 1 ( ) k N p k (1 p) N k 0.017 0.090 0.209 0.279 0.232 0.124 0.041 0.079 0.001 k 1 N log p(s k ) 0.737 0.810 0.883 0.956 1.029 1.103 1.176 1.249 1.322 The number of sequences in the typical set is 154 (1 ɛ)2 N(H(X ) ɛ) = 112, 6 A N ɛ = 154 2 N(H(X )+ɛ) = 379, 4 The probability of the typical set is 0.72 What happens when N increases? Source Coding 11/49

N = 25 -log(p(s k ))/N H + ε H - ε k # Sequences Typical Set = 26366510 pmf Prob. Typical Set = 0.9362 k Source Coding 12/49

N = 100 -log(p(s k ))/N H + ε H - ε k # Sequences Typical Set = 1.18 10 30 pmf Prob. Typical Set = 0.9997 k Source Coding 13/49

N = 200 -log(p(s k ))/N H + ε H - ε k # Sequences Typical Set = 1.57 10 60 pmf Prob. Typical Set ~ 1 k Source Coding 14/49

AEP This concentration of measure phenomenon is a consequence of the weak (convergence in probability) law of large numbers, and can be formalized in the Asymptotic Equipartition Property (AEP) theorem Theorem (AEP): If X 1, X 2,... are i.i.d. with pmf p(x), then 1 N log p(x 1, X 2,..., X N ) H(X ) in probability Convergence in probability means that for every ɛ > 0 { Pr 1 } N log p(x 1, X 2,..., X N ) H(X ) > ɛ 0 The proof follows from the independence of the random variables and the weak law of large numbers Source Coding 15/49

Consequences of the AEP: data compression When H < 1 a consequence of the AEP is that a tiny fraction of the total number of sequences contains most of the probability: this can be used for data compression For an alphabet with X = 2 elements N 2 elements Non-typical set 2 Typical set N ( H +ε ) elements Source Coding 16/49

A coding scheme Let x N = (x 1, x 2,..., x N ) denote a sequence of the set and let l(x N ) be the length of the codeword corresponding to x N Let us denote the typical set as A N ɛ and the non-typical set as A N ɛ Proposed coding scheme (brute force enumeration) If x N A N ɛ : 0 + at most 1 + N(H + ɛ) bits If x N A N ɛ : 1 + at most 1 + N bits (recall that X =2) The code is one-to-one and easily decodable Typical sequences have short codewords of length NH We have overestimated the size of the non-typical set Source Coding 17/49

Non-typical set Description N + 2 bits Typical set Description N( H + ε ) + 2 bits Source Coding 18/49

If N is sufficiently large so that Pr{A N ɛ } > 1 ɛ, the expected codeword length is [ ] E l(x N ) = p(x N )l(x N ) + p(x N )l(x N ) x N A N ɛ x N A N ɛ p(x N )(2 + N(H + ɛ)) + p(x N )(2 + N) = Pr x N A N ɛ { A N ɛ } (2 + N(H + ɛ)) + Pr x N A N ɛ { A N ɛ (1 ɛ)(2 + N(H + ɛ)) + ɛ(2 + N) N(H + ɛ) + ɛn + 2 = N(H + ɛ ) } (2 + N) where ɛ = 2ɛ + + 2 N can be made arbitrarily small Source Coding 19/49

Theorem (data compression): Let X N = (X 1, X 2,..., X N ) be i.i.d. with pmf p(x), and let ɛ > 0. Then, there exists a code that maps sequences x N of length N into binary strings such that the mapping is one-to-one (invertible) and [ ] 1 E N l(x N ) H(X ) + ɛ for N sufficiently large Shannon s source coding theorem (informal statement): N i.i.d. random variables with entropy H can be compressed into NH bits with negligible risk of information loss as N. Conversely, if they are compressed into fewer than NH bits information will be lost In the following, we will study some practical codes for data compression Source Coding 20/49

Definitions Definition: A source code C for a random variable X is a mapping from X, the alphabet of X, to D, the set of finite-length strings of symbols from a D-ary alphabet: D = {0, 1,... D 1} Let C(x) denote the codeword corresponding to x Let l(x) denote the length of C(x) Definition: The expected length of source code C(x) for a random variable X with pmf p(x) is given by L(C) = E [l(x )] = x X p(x)l(x) Source Coding 21/49

Example Let X in X = (1, 2, 3, 4) be a r.v. with the following pmf and codeword assignment from a binary alphabet D = {0, 1} Pr(X = 1) = 1, C(1) = 00 2 Pr(X = 2) = 1, C(2) = 01 4 Pr(X = 3) = 1, C(3) = 10 8 Pr(X = 4) = 1, C(4) = 11 8 The sequence of symbols X = (1 1 3 1 2 1 4) is encoded as C(X ) = (00 00 10 00 01 00 11) The bit string C(X ) = (01 00 00 11 01 00 10) is decoded as X = (2 1 1 4 2 1 3) The expected length of the code is L(C) = 2 bits. Source Coding 22/49

There are some basic requirements for a useful code: Every element in X should map into a different string in D non-singular codes Any encoded string must have a unique decoding (that is, a unique sequence of symbols producing it): uniquely decodable codes The encoded string must be easy to decode (that is, we should be able to perform symbol-by-symbol decoding): prefix codes or instantaneous codes In addition, the code should have minimum expected length (as close as possible to H) Our goal is to construct instantaneous codes of minimum expected length Source Coding 23/49

All codes Nonsingular codes Uniquely decodable codes Instantaneous codes Source Coding 24/49

Example Find the expected the lengths of the following codes, and whether or not they are nonsingular, uniquely decodable and instantaneous X pmf C 1 C 2 C 3 1 1 2 0 10 0 1 2 4 010 00 10 1 3 8 01 11 110 1 4 8 10 110 111 Source Coding 25/49

Code trees Any prefix code from a D ary alphabet can be represented as a D ary tree (D branches at each node) Consider the following prefix code: {00 11 100 101} 0 00 100 0 0 1 0 1 101 1 1 11 Each node along the path to a leaf is a prefix of the leaf, so it cannot be a leaf itself Some leaves may be unused Source Coding 26/49

Kraft inequality Suppose we define a code by assigning a set of codeword lengths (integer numbers) (l 1, l 2,..., l m ) If we restrict to instantaneous (prefix) codes: What is the limit or restriction on the set of integers {l i }? Intuitions: Suppose we have an instantaneous code with codewords {00, 01, 10, 11}: If we shorten one of the codewords 00 0, then, the only way to retain instantaneous decodability is to lengthen other codewords There seems to be a constrained budget that we can spend on codewords Source Coding 27/49

Suppose we build a code from codewords of length l = 3 using a binary alphabet (D = 2): How many codewords can we have and retain unique decodability? D l = 2 3 = 8 {000 001 010 011 100 101 110 111} Can we add another codeword of length l > 3 and retain instantaneous decodability? No Suppose now we fix the first codeword to 0 and complete the code with codewords of length 3. How many codewords can we have? {0 100 101 110 111} So, a codeword of length 3 seems to have a cost that is 2 2 times smaller than a codeword of length 1 If the total budget is 1, the cost of a codeword whose length would be l is 2 l Kraft inequality Source Coding 28/49

Theorem (Kraft inequality): For any instantaneous code (prefix code) over an alphabet of D symbols, the codeword lengths l 1, l 2,..., l m must satisfy the inequality D l i 1 i Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these codeword lengths If we represent a prefix code as a tree, i D l i = 1 is achieved iff all leaves are utilized Source Coding 29/49

Example Suppose a code with codeword lengths (2, 2, 3, 3, 3) 5 2 l i = 0.875 1 i=1 It satisfies Krafts s inequality, so there exists an instantaneous code with these codeword lengths: Can we find one? l k c k = k 1 i=1 2 l i Code 2 0.0 = 0.00 2 00 2 0.25 = 0.01 2 01 3 0.5 = 0.100 2 100 3 0.625 = 0.101 2 101 3 0.75 = 0.110 2 110 Try with (2, 2, 3, 3, 3, 3) and (2, 2, 2, 3, 3, 3) Source Coding 30/49

Codeword supermarket (From Mackay textbook) 0 1 00 01 10 11 000 001 010 011 100 101 110 111 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 The total symbol code budget Source Coding 31/49

Optimal codes So far, we have proved that: Any instantaneous code must satisfy Kraft s inequality Kraft s inequality is a sufficient condition for the existence of a code with the specified codeword lengths Notice, however, that Kraft s inequality only involves the codeword lengths, not their probabilities Now we consider the design of optimal codes: 1. Instantaneous or prefix codes (thus satisfying Kraft s inequality ) 2. With minimum expected length L = i p i l i Source Coding 32/49

Given a set of probabilities (p 1,..., p m ), the optimization problem for designing an optimal code is as follows (we assume a binary alphabet for the codewords): minimize L(l 1,..., l m ) = i p i l i subject to 2 l i 1 over the set of integers We consider a relaxed version of the problem that neglects the integer constraint on the codeword lengths l i And solve the constrained minimization problem using Lagrange multipliers J = ( ) p i l i + λ 2 l i i i Source Coding 33/49

Differentiating with respect to l i we obtain J l i = p i λ2 l i ln 2 Setting the derivative to zero we obtain 2 l i = p i λ ln 2 and substituting this in the constraint to find λ we find λ = 1/ ln 2 and hence p i = 2 l i yielding the optimal code lengths l i = log p i With this solution the expected length of the code would be H(X ); however, notice that li might be noninteger Source Coding 34/49

We round up l i as l i = l i where x denotes the smallest integer x This choice of codewords satisfy log p i l i log p i + 1 and therefore we have the following theorem Theorem: Let (l 1,..., l m ) be optimal integer-valued codeword lengths over a binary alphabet (D = 2) for a source distribution (p 1,..., p m ), and let L = i p il i be the associated expected length. Then H(X ) L H(X ) + 1 So there is an overhead of at most 1 bit due to the fact that l i are constrained to be integer values Source Coding 35/49

Huffman coding David A. Huffman discovered a simple algorithm for designing optimal (shortest expected length) prefix codes No other code for the same alphabet can have lower expected length than Huffman s code The basic idea is to assign short codewords to those input blocks with high probabilities and long codewords with low probabilities A Huffman code is designed by merging together the two least probable characters (outcomes) of the random variable, and repeating this procedure until there is only one character remaining A tree is thus generated and the Huffman code is obtained from the labeling of the code tree Source Coding 36/49

Example 1 Source Coding 37/49

Example 2 Construct a Huffman code for the following pmf, calculate the entropy of each symbol and indicate the length of each codeword x i p i C(x i ) h(p i ) l i a 0.25 b 0.25 c 0.2 d 0.15 e 0.15 Source Coding 38/49

Example 2 Construct a Huffman code for the following pmf, calculate the entropy of each symbol and indicate the length of each codeword x i p i C(x i ) h(p i ) l i a 0.25 00 2.0 2 b 0.25 10 2.0 2 c 0.2 11 2.3 2 d 0.15 010 2.7 3 e 0.15 011 2.7 3 The entropy is H = 2.2855 bits The expected length of the code is L = 2.3 bits Source Coding 39/49

Huffman code for the English language a i p i log 2 1 pi l i c(a i) a 0.0575 4.1 4 0000 b 0.0128 6.3 6 001000 c 0.0263 5.2 5 00101 d 0.0285 5.1 5 10000 e 0.0913 3.5 4 1100 f 0.0173 5.9 6 111000 g 0.0133 6.2 6 001001 h 0.0313 5.0 5 10001 i 0.0599 4.1 4 1001 j 0.0006 10.7 10 1101000000 k 0.0084 6.9 7 1010000 l 0.0335 4.9 5 11101 m 0.0235 5.4 6 110101 n 0.0596 4.1 4 0001 o 0.0689 3.9 4 1011 p 0.0192 5.7 6 111001 q 0.0008 10.3 9 110100001 r 0.0508 4.3 5 11011 s 0.0567 4.1 4 0011 t 0.0706 3.8 4 1111 u 0.0334 4.9 5 10101 v 0.0069 7.2 8 11010001 w 0.0119 6.4 7 1101001 x 0.0073 7.1 7 1010001 y 0.0164 5.9 6 101001 z 0.0007 10.4 10 1101000001 0.1928 2.4 2 01 a n c s i o e d h u y m r l t b g f p k x v w j z q Source Coding 40/49

Universal codes We have seen that Huffman coding produces optimal (minimal expected length) prefix codes Do they have some disadvantage? Huffman coding is optimal for a specific source distribution, which has to be known in advance However, the probability distribution underlying the source may be unknown To solve this problem we could first estimate the pmf of the source and then design the code, but this is not practical We would rather prefer optimal lossless codes that do not depend on the source pmf Such codes are called universal Source Coding 41/49

The cost of pmf mismatch Assume we have a random variable X drawn from some parameterized pmf, p θ (x), which depends on a parameter θ {1, 2,... m} If θ is known, we can construct an optimal code with codeword lengths (ignoring the integer constraint) and expected length l(x) = log p θ (x) = log 1 p θ (x), E [l(x)] = E [ log p θ (x))] = H(p θ ) If the true distribution is unknown some questions arise: What is the cost of using a wrong (mismatched) pmf? What is the optimal pmf that minimizes this cost for a given family of distributions p θ (x)? Source Coding 42/49

Suppose we use a code with codeword lengths l(x). This code would be optimal for a source with pmf given by q(x) = 2 l(x) l(x) = log 1 q(x) The redundancy is defined as the difference between the expected length of the code that assumes the wrong distribution, q(x), and the expected length of the optimal code designed for p θ (x) R(p θ (x), q(x)) = x p θ (x)l(x) x p θ (x) log 1 p θ (x) = x = x p θ (x) (log p θ (x) log q(x)) p θ (x) log p θ(x) q(x) = D(p θ(x) q(x)) The cost (in bits) is the relative entropy between the true and the wrong distributions Source Coding 43/49

If we want to design a code that works well regardless of the true distribution, we could use a minimax criterion min max R(p θ(x), q(x)) = min max D(p θ(x) q(x)) q(x) p θ (x) q(x) p θ (x) The solution of this problem is achieved by the distribution q (x) that is at the center (using KL distance) of the family of distributions p θ (x) p 1 (x) p m (x) q*(x) p 2 (x) p 3 (x) Source Coding 44/49

Lempel-Ziv (LZ) coding Discovered by Abraham Lempel and Jacob Ziv, LZ codes are a popular class of universal codes that are asymptotically optimal Their asymtotic compression rate approaches the entropy of the source Most file compression programs (gzip, pkzip, compress) use different implementations of the basic LZ algorithm LZ coding is based on the use of adaptive dictionaries that contain substrings that have happened already When we find a substring that is already in the dictionary we just have to encode a pointer to that substring, thus effectively compressing the source Source Coding 45/49

Basic Lempel-Ziv algorithm As an example, let us consider the following binary string (we consider a binary alphabet without loss of generality) source 1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 The first step is to parse the source of symbols into an ordered dictionary of substrings that have not appeared before source 1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 In the second step we numerate the substrings source 1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 n 1 2 3 4 5 6 7 8 Source Coding 46/49

The basic idea is to encode each substring by giving a pointer to the earliest occurrence of that prefix and sending the extra bit The first pointer is empty A pointer 0 means also empty dictionary source 1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 n 1 2 3 4 5 6 7 8 (pointer, bit) (, 1) (0, 0) (1, 1) (2, 1) (4, 0) (2, 0) (1, 0) (3, 1) If we have already enumerated n substrings, the pointer can be encoded in log n bits Source Coding 47/49

1 0 1 1 0 1 0 1 0 0 0 1 0 1 1 1 1 2 3 4 5 6 7 8 (, 1) (0, 0) (1, 1) (2, 1) (4, 0) (2, 0) (1, 0) (3, 1) (, 1) (0, 0) (01, 1) (10, 1) (100, 0) (010, 0) (001, 0) (011, 1) Finally, the encoded string is 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 For this example the encoded sequence is actually larger than the original sequence, this is because the source signal is too short How is the decoder? Source Coding 48/49

Sending an uncompressed bit in each substring results in a loss of efficiency The way to solve this issue is to consider the extra bit as part of the next substring This modification proposed by Terry Welch in 1984 is the basis of most practical implementations of LZ coding In practice, the dictionary size is limited (4096) Source Coding 49/49