Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet, ˆD: code alphabet x C(x) }{{} codeword, l(x): the length of C(x) Example : ˆX = {Red, Blue, Yellow}, ˆD = {0, 1} C(Red) = 0 C(Red) = 0 C(Blue) = 1 or C(Blue) = 10 C(Y ellow) = 00 C(Y ellow) = 11
Definition. The expected length L(C) of a source code C for a random variable X with probability mass function p(x) is L(C) = x X p(x)l(x) Without loss of generality, we assume ˆD = {0, 1,, D 1}
Example : P (X = 1) = 1 2, c(1) = 0 P (X = 2) = 1, c(2) = 10 4 P (X = 3) = 1, c(3) = 110 8 P (X = 4) = 1, c(4) = 111 8 H(X) = 1.75 bits, L(c) = E{l(x)} = 1.75 bits Definition. A code is said to be nonsingular if x i x j implies c(x i ) c(x j ). Definition. The extension C of a code C is the mapping from finite length strings of X to finite length strings of D, defined by c(x 1 x 2 x n ) = c(x 1 )c(x 2 ) c(x n ) }{{} The concatenation of c(x 1 ) c(x n )
Example : c(x 1 ) = 00, c(x 2 ) = 11, c(x 1 x 2 ) = 0011 Definition. A code is called uniquely decodable if its extension is nonsingular. Definition. A uniquely decodable code is said to be instantaneous if it is possible to decode each word in a sequence without referring to succeeding code symbols. Definition. Let x = (x 1, x 2,, x n ) be a sequence. A sequence (x 1, x 2,, x i ) with i n is called a prefix of x. A necessary and sufficient condition for a code to be instantaneous is that no codeword of the code be a prefix of some other codewords.
Code tree: 1. The code alphabet consists of D symbols. 2. The maximum number of branches emanating from each node is D. 3. Each node, except the initial node has exactly one branch entering it. 4. A node with no branch emanating from it is called a terminal node, otherwise a node is called an intermediate node. 5. A path is defined as a sequence of consecutive branches starting from the initial node and entering certain node. 6. Each branch is named by a code symbol b j. 7. A path is corresponding to a sequence of code symbols. 8. A node is named by the sequence of code symbols which correspond to the path Γ entering it.
9. A node which is entered by a path of n consecutive branches is called an nth order node.
Theorem 5.2.1 (Kraft inequality) : For any instantaneous code over an alphabet of size D, the codeword lengths l 1, l 2,, l m must satisfy the inequality D l i 1. i Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths. Proof. ( ) There are at most D l nodes of order l. A terminal node of order l i (l i < l) eliminates D l l i of the possible nodes of order l. Hence, we have m i=1 Dl l i D l. This implies m i=1 D l i 1.
( ) Let max 1 i m l i = L. Let N i be the number of codewords of length i. Then, m i=1 D l i 1 can be written as L l=1 N ld l 1. Consider the codewords of length t, where 1 t L. Then, t 1 l=1 N ld l + N t D t + L N l D l 1. Equivalently, N t {D t t 1 l=1 N l D t l } l=t+1 L l=t+1 N l D (l t). Note that D t is the maximum number of nodes of order t and t 1 l=1 N ld t l is the number of nodes of order t eliminated by the presence of terminal nodes of lower orders. Therefore, D t t 1 l=1 N ld l is the number of available nodes of order t in the tree. For all t, (1 t L), the number of available terminal node of order t is greater than the required terminal nodes. Thus, a tree with the required terminal nodes can always be constructed.
Theorem 5.2.2 (Extended Kraft Inequality) : For any countably infinite set of codewords that form a prefix code, the codeword lengths satisfy the extended Kraft inequality D l i 1 i=1 Conversely, given any l 1, l 2,, satisfying the extended Kraft ineqality, we can construct a prefix code with these codeword lengths. Proof. Let the D-ary alphabet be {0, 1,, D 1}. Consider the ith codeword η 1 η 2 η li. Let 0.η 1 η 2 η li represent the real number in the D-ary expansion, i.e., 0.η 1 η 2 η li = l i j=1 η j D j
This codeword corresponds to an interval (0.η 1 η 2 η li, 0.η 1 η 2 η li + 1 D l i ) i.e., the set of all real numbers whose D-ary expansion begins with 0, η 1 η 2 η li. This is a subinterval of [0,1]. By the prefix condition, all the intervals are disjoint. Hence, the sum of their lengths are less then or equal to 1. That means D l i 1 i=1 We can reverse the procedure to construct a prefix code of lengths l 1, l 2,. (or use the code tree which can be extended to infinite order)
Theorem 5.3.1 The expected length L of an instantaneous D-ary code for a random variable X is greater than or equal to the entropy H D (X), i.e., L H D (X) with equality if and only if D l i = p i Proof. L H D (X) = i p i l i i p i log D 1 p i = i p i log D D l i + p i log D p i = i [p i log p i r i ] log D c = D(p r) + log D 1 c 0 ( ) where r i = D l i j D l j and c = D l i 1. Equality holds if and only if p i = D l i and p i = D l i = 1
Definition. A probability distribution is called D-adic if each of the probabilities is equal to D n for some n. Thus, we have equality in the theorem if and only if the distribution of X is D-adic. Theorem 5.4.1 Let l 1, l 2,, l m be optimal codeword lengths for a source distribution p 1, p 2,, p m and a D-ary alphabet, and let L be the associated expected length of an optimal code (L = p i l i ). Then, H D (X) L < H D (X) + 1. 1 1 Proof. Let log D p i l i < log D p i + 1. Then, H D (X) L = p i l i < H D (X) + 1. 1 Since l i = log D p i satisfies the Kraft inequality, there exits an instantaneous code with l i. An optimal code has average length L = p i li L. Since L H D (X) from Theorem 5.3.1, we have H D (X) L < H D (X) + 1.
Extended Source X n : H(X 1, X 2,, X n ) p(x 1,, x n )l(x 1,, x n ) < H(X 1, X 2,, X n ) + 1. Suppose that X 1,, X n are i.i.d. Then, H(X 1,, X n ) = H(X i ) = nh(x). Hence, H(X) L n = 1 n p(x1,, x n )l(x 1,, x n ) < H(X) + 1 n.
Theorem 5.4.2 In general, H(X 1,, X n ) n Note that H(X 1,, X n ) n L n < H(X 1,, X n ) n + 1 n H(X), if the random process is stationary. Theorem 5.4.3 The expected length under p(x) of the code assignment l(x) = log 1 q(x) satisfies H(p) + D(p q) E p [l(x)] < H(p) + D(p q) + 1
Proof. E p [l(x)] = x < x = x p(x) log 1 q(x) p(x)(log 1 q(x) + 1) p(x)log p(x) q(x) 1 p(x) + 1 = p(x)log p(x) q(x) + x x = D(p q) + H(p) + 1. p(x)log 1 p(x) + 1 The lower bound can be derived similarly.
Theorem 5.5.1 (McMillan Inequality) The condition of k D l i i=1 necessary and sufficient for the existence of a base D uniquely decodable code with lengths l 1, l 2,, l k. Proof. Consider that ( k i=1 1 is D l i ) m = (D l 1 + + D l k ) m. There are k m terms, each of the form D l i 1 l i2 l i m = D l, where l = l i1 + l i2 + + l im. Then, ( k D l i ) m = mn N l D l, where l i n i=1 l=m for all i. Note that N l is the number of strings of m codewords that can be formed so that each string has a length of exactly l code symbols. If the code is uniquely decodable, N l can not exceed D l, the number of distinct D-ary seqence of length l. Thus, ( k D l i ) m mn D l D l = mn m + 1 mn. If x > 1,then i=1 l=m x m > mn when m is large enough. Thus, we have k D l i 1. i=1
Huffman Codes: Example:
Total number of symbols D + k(d 1) is needed for D-ary codeword. Example: D = 3
Optimality of Huffman Codes Lemma 5.8.1 For any distribution, there exists an optimal instantaneous code (with minimum expected length) that satisfies the following property: 1. If P j > P k, then l j l k. 2.The two longest codewords have the same length. 3.The two longest codewords differ only in the last bit and correspond to the least likely symbols. Proof : Omitted. Proof for the optimality of Huffman code (binary case). Let C m be a Huffman code for m symbols with probability p 1, p 2,, p m. Then C m satisfies the properties of Lemma 5.8.1. Let C m 1 be a reduced code of C m. C m 1 is constructed for m 1 symbols, which takes the common prefix of two longest codewords and allots it to a symbol with probability P m 1 + P m. All the other codewords remain the same.
C m 1 C m p 1 w 1 l 1 w 1 = w 1 l 1 = l 1 p 2 w 2 l 2 w 2 = w 2 l 2 = l 2... p m 2 w m 2 l m 2 w m 2 = w m 2 l m 2 = l m 2 p m 1 + p m w m 1 l m 1 w m 1 = w m 10 l m 1 = l m 1 + 1 The expected length of C m is L(C m ) = m = m 2 i=1 = m 2 i=1. w m = w m 11 l m = l m 1 + 1 i=1 p i l i p i l i + p m 1(l m 1 + 1) + p m (l m 1 + 1) p i l i + (p m + p m 1 )l m 1 + p m 1 + p m = L(C m 1 ) + p m 1 + p m.
We will show that the best L(C m 1 ) implies the best L(C m ). Suppose that there is a code Ĉm such that L(Ĉm) < L(C m ). Let the codeword of Ĉ m be ŵ 1,, wˆ m with lengths ˆl 1,, l ˆ m. Also that ˆl 1 ˆl 2 l ˆ m. One of the words ŵ m 1 of Cm ˆ must be identical with ŵ m except in its last digit. We form Ĉm 1 by combining ŵ m 1 and ŵ m and dropping their last digit while leaving all other words unchanged. Then L( C ˆ m ) = L(Ĉm 1) + ˆp m 1 + ˆp m. Note that ˆp m 1 = p m 1 and pˆ m = p m. Hence, we have L(Ĉm 1) < L(C m 1 ) for (p 1,, p m 2, p m 1 + p m ) which is a contradiction. Example of Huffman code over extended source alphabet. Let x={0,1}, p(x=0)=0.9, p(x=1)=0.1 (next page)
Shannon code: codeword length of log 1 p i. Shannon codes is not optimal. Example : D={0,1}, p(0)=0.9999, p(1)=0.0001. Huffman code = C(0) = 1, C(1) = 1 Shannon code = C(0) = 1, C(1) = }{{}. 14 bits
Shannon-Fano-Elias coding Let X = {1, 2,, m} and p(x) > 0 for all x. Let F (x) = a x p(a) and F (x) = a<x p(a) + 1 2 p(x) Round off F (x) to l(x) bits, denoted by F (x) l(x)
We use the first l(x) bits of F (x) as a codeword for x. Note that If l(x) = log 1 p(x) + 1, then F (x) F (x) l(x) < 1 2 l(x) 1 p(x) < 2l(x) 2 = F (x) F (x 1). Hence, F (x) l(x) lies within the step corresponding to x. Let each codeword z 1,, z l represent an interval [0.z 1 z 2 z l, 0.z 1 z 2 z l + 1 2 l ]. The code is prefix free if and only if the intervals corresponding to codeword are disjoint.
Note that The interval [0.z 1 z 2 z l, 0.z 1 z 2 z l + 1 2 l ] falls within the step of (F (x 1), F (x)). Hence all the intervals are disjoint, once all the steps are disjoint.
The average length L = x p(x)l(x) = x p(x)( log 1 + 1) < H(X) + 2 p(x) Ex. x p(x) F (x) F (x) F (x) in binary l(x) codeword 1 0.25 0.25 0.125 0.001 3 001 2 0.25 0.5 0.375 0.011 3 011 3 0.2 0.7 0.6 0.10011 4 1001 4 0.15 0.85 0.775 0.1100011 4 1100 5 0.15 1.0 0.925 0.1110110 4 1110
For small source alphabets, we have efficient coding only if we use long blocks of source symbols. Hence, it is desirable to have an efficient coding procedure that works for long blocks of source symbols. Huffman coding is not ideal for this situation, since it is a bottom-up procedire that requires the calculation of the probabilities of all source sequences of a particular block length and the construction of the corresponding complete code tree. Arithmetic coding is a direct extension of the Shannon-Fano-Elias coding, which is suitable for long block lengths without having to redo all the calculations. The essential idea of arithmetic coding is to efficiently calculate the probability mass function p(x n ) and the cumulative distribution function F (x n ) for the source sequence x n.