Analytic Information Theory: Redundancy Rate Problems

Size: px

Start display at page:

Download "Analytic Information Theory: Redundancy Rate Problems"

August Sims
5 years ago
Views:

1 Analytic Information Theory: Redundancy Rate Problems W. Szpankowski Department of Computer Science Purdue University W. Lafayette, IN November 3, 2015 Research supported by NSF Science & Technology Center, and Humboldt Foundation.

2 Outline 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem Asymptotic Equipartition Theorem Jointly typical Sequences Capacity 3. Source Coding 4. Redundancy: Known Sources Shannon and Huffman Coding (sequences modulo 1) Non-Prefix Codes (saddle point method) Tunstall Code (Mellin transform) 5. Minimax Redundancy: Universal (unknown) Sources (a) Universal Memoryless Sources (Tree-like gen. func.) (b) Universal Markov Sources (Balance matrices) (c) Universal Renewal Sources (Combinatorial calculus) 6. Appendix A: Mellin Transform 7. Appendix B: Saddle Point Method

3 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem Asymptotic Equipartition Theorem Jointly typical Sequences Capacity 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources

4 Three Jewels of Shannon Theorem 1 & 3. [Shannon 1948; Lossless & Lossy Data Compression] Lossless Compression: compression bit rate source entropy H(X); Lossy Compression: For distortion level D: lossy bit rate rate distortion function R(D) Theorem 2. [Shannon 1948; Channel Coding ] In Shannon s words: It is possible to send information at the capacity through the channel with as small a frequency of errors as desired by proper (long) encoding. This statement is not true for any rate greater than the capacity.

5 Theorem 1: AEP and Typical Sequences Shannon-McMilan-Breiman: 1 n logp(xn 1 ) H(X) (pr.) H(X) is the entropy rate. Asymptotic Equipartition Property: partitioned into Sequences of length n can be Also, G ε n 2nH(X). good set G ε n P(w) 2 nh(x), w G ε n bad set B ε n P(B ε n ) < ε.

6 Theorem 2: Shannon Random Decoding Rule There are 2 nh(x) X-typical sequences There are 2 nh(y) Y-typical sequences There are 2 nh(x,y) jointly X,Y-typical pair of sequences Decoding Rule: Declare that sequence sent X is the one that is jointly typical with the received sequence Y provided there is unique X satisfying this property!

7 Sketch of Proof: Channel Capacity Theorem 1. With high probability (whp), there is a jointly typical pair (X,Y ). 2. The probability that there is another jointly typical pair is 2 ni(x,y). Indeed: there are2 nh(x) and2 nh(y) typical sequencesx n andy n, respectively. there are 2 nh(x,y) jointly typical pairs (X,Y ). The probability of error (more than one typical pair is): 2 nh(x,y) 2 nh(x)+h(y) = 2 ni(x,y). 3. Probability of error when 2 nr messages are sent is approximately minp(error) 2 n(sup P(X) I(X,Y) R) = 2 n(c R). 4. In conclusion: R < C P(error) 2 nδ R > C P(error) 1.

8 Capacity of BSC Capacity: I(X;Y ) = H(Y ) H(Y X) = H(Y ) H(p) 1 H(p). The capacity is achieved for the uniform input distribution. Thus C = 1 H(p).

9 Temporal Capacity for BSC (Exercise) 1. Each bit incurs a delay, T with known probability distribution: F(t) = P(T < t). If a bit arrives after a given deadline τ, it is dropped. 2. The longer it takes to send a bit, the lower the probability of success, which we denote by Φ(ε,t) for t < τ (e.g., Φ(ε,t) = (1 ε) t ). 3. Define P(x x) = τ 0 P(y x) = Φ(ε,t)dF(t): prob. of a successfully transmission: α := 1 F(τ) y = erasure P(x x) if x = y 1 α P(x x) if x y Define: α = 1 F(τ) and ρ := P(x x) (1 α). H(Y X) = H(α) + (1 α)h(ρ) and H(Y ) = H(α) + (1 α)h(pρ + p ρ). Then: tau C(τ) = [(1 P(T > τ)][1 H(ρ)].

10 Noisy Constrained Channel Let S denote the set of binary constrained sequences of length n. Here: S d,k = {(d,k) sequences}, i.e., no sequence contains a run of zeros shorter than d or longer than k.

11 Noisy Constrained Channel Let S denote the set of binary constrained sequences of length n. Here: S d,k = {(d,k) sequences}, i.e., no sequence contains a run of zeros shorter than d or longer than k. Sequence X S (d,k) can be represented as a Markov process.

12 Noisy Constrained Channel Let S denote the set of binary constrained sequences of length n. Here: S d,k = {(d,k) sequences}, i.e., no sequence contains a run of zeros shorter than d or longer than k. Sequence X S (d,k) can be represented as a Markov process. C(S,ε) noisy constrained capacity defined as 1 C(S,ε) = supi(x;y) = lim X S n n sup I(X n X 1 n S 1,Y n 1 ). n This is/was an open problem since Shannon.

13 Entropy of HMM 1. Mutual information where I(X;Y) = H(Y ) H(Y X). Y k = X k E k, k 1, ( is exclusive-or), E = {E k } k 1 is a binary i.i.d. representing noise such that P(E i = 1) = ε. 2. Observe that H(Y X) = H(E) = H(ε) hence we must find the entropy of H(Y), that is, the entropy of a hidden Markov process since a (d,k) sequence can be generated as an output of a kth order Markov process. X k Y k = E k X k E k How to compute the entropy H(Y )?

14 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources

Source Coding A source code is a bijective mapping C : A {0,1} from sequences over the alphabet A to set {0,1} of binary sequences. The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions (lengths) either on average or for individual sequences.

15 Source Coding A source code is a bijective mapping C : A {0,1} from sequences over the alphabet A to set {0,1} of binary sequences. The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions (lengths) either on average or for individual sequences. Three Basic Types of Source Coding: Fixed-to-Variable (FV) length codes (e.g., Huffman and Shannon codes). Variable-to-Fixed (VF) length codes (e.g., Tunstall and Khodak codes). Variable-to-Variable (VV) length codes (e.g., Khodak VV code).

16 Preliminary Results Prefix code is such that no codeword is a prefix of another codeword. Tree and lattice representations: L R R L Notation: For a source model S and a code C we let: P(x) be the probability of x A ; L(C,x) be the code length for the source sequence x A ; Entropy H(P) = x A P(x)lgP(x). Quantities are expressed in binary logarithms written lg := log 2.

17 Prefix Codes Kraft s Inequality A binary code is a prefix code iff the code lengths l 1,l 2,...,l N satisfy l max l i l i i 2 l max l i 2 l max < N i=1 2 l i 1. Barron s lemma For any sequence a n of positive constants satisfying n 2 an < Pr{L(X) < logp(x) a n } 2 an, and therefore L(X) logp(x) a n (a.s). Proof: We argue as follows: Pr{L(X) < log 2 P(X) a n } = x:p(x)<2 L(x) a n x:p(x)<2 L(x) a n 2 L(x) a n P(x) 2 a n x 2 L(x) 2 an.

18 Shannon Lower Bound Shannon First Theorem For any prefix code the average code length E[L(C, X)] cannot be smaller than the entropy of the source H(P), that is, E[L(C n,x)] H(P). Proof: Let K = x 2 L(x) 1, and L(C,x) := L(C). Then E[L(C,X)] H(P) = = x A P(x)L(x) + x A P(x)logP(x) = x A P(x)log 0 P(x) 2 L(x) /K logk since logx x 1 for 0 < x 1 or the divergence is nonnegative, while K 1 by Kraft s inequality. Exercise: There exists at least one sequence x n 1 such that L( xn 1 ) log 2 P( x n 1 ).

19 Redundancy Known Source P. The pointwise redundancy R(x) and the average redundancy R: R(x) = L(C,x) + lgp(x) R = E[L(C,X)] H(P) 0 Optimal Code: min L(x)P(x) subject to L x 2 L(x) 1. x Solution: By Lagrangian multipliers we find L opt (x) = lgp(x). The smaller the redundancy is, the better (closer to the optimal) the code is.

20 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources Shannon and Huffman Coding Non-Prefix Codes Tunstall Code 5. Minimax Redundancy: Universal (unknown) Sources

21 Redundancy for Huffman s Code We consider fixed-to-variable length codes; in particular, Huffman s code. For a known source P, we consider fixed length sequencesx n 1 = x 1...x n. Huffman Code: The following optimization problem is solved by Huffman s code. R n = min Cn C E x n 1 [L(C n,x n 1 ) + log 2P(x n 1 )]. We study the average redundancy for a binary memoryless sources with p denoting the probability of generating 0 and q = 1 p. In 1994 Stubley proposed the following for Huffman s average redundancy R H n = 2 n k=0 ( n k) p k q n k αk + βn 2 n k=0 where ( ) ( ) 1 p 1 α = log 2, β = log 2 p 1 p and x = x x is the fractional part of x. ( n k) p k q n k 2 αk+βn + o(1).

22 Main Result Theorem 1 (W.S., 2000). Consider the Huffman block code of length n over a binary memoryless source with p < 1 2. Then as n R H n = ln M + o(1) α irrational ( ) βmn M(1 2 1/M ) 2 nβm /M + O(ρ n ) α = N M where N,M are integers such that gcd(n,m) = 1 and ρ < (a) (b) Figure 1: The average redundancy of Huffman codes versus block size n for: (a) irrational α = log 2 (1 p)/p with p = 1/π; (b) rational α = log 2 (1 p)/p with p = 1/9.

23 Why Two Modes: Shannon Code Consider the Shannon code that assigns the length L(C S n,xn 1 ) = lgp(xn 1 ) to the source sequence x n 1. Observe that P(x n 1 ) = pk (1 p) n k where p is known probability of generating 0 and k is the number of 0s. The Shannon code redundancy is R S n = n k=0 ( n k) p k (1 p) n k( log 2 (p k (1 p) n k ) ) + log 2 (p k (1 p) n k ) = 1 n k=0 ( n k) p k (1 p) n k αk + βn where x = x x is the fractional part of x, and ( ) ( ) 1 p 1 α = log 2, β = log 2. p 1 p

24 Sketch of Proof We need to understand asymptotic behavior of the following sum (cf. Bernoulli distributed sequences modulo 1) n k=0 ( n k) p k (1 p) n k f( αk + y ) for fixed p and some Riemann integrable function f : [0, 1] R. Lemma 1. Let 0 < p < 1 be a fixed real number and α be an irrational number. Then for every Riemann integrable function f : [0, 1] R lim n n k=0 ( n k) p k (1 p) n k f( αk + y ) = 1 0 f(t)dt, where the convergence is uniform for all shifts y R. Lemma 2. Let α = N M be a rational number with gcd(n,m) = 1. Then for bounded function f : [0,1] R n k=0 ( n k) p k (1 p) n k f( αk + y ) = 1 M uniformly for all y R and some ρ < 1. M 1 l=0 f ( l M + My ) M + O(ρ n )

25 Uniformly Distributed Sequences Mod 1 1. Uniformly Distributed Sequences Mod 1: A sequence x n R is said to be Bernoulli uniformly distributed modulo 1 (in short: B-u.d. mod 1) if for 0 < p < 1 lim n n k=0 ( n k) p k (1 p) n k χ I ( x k ) = λ(i) holds for every interval I R, where χ I (x n ) is the characteristic function of I (i.e., it equals to 1 if x n I and 0 otherwise) and λ(i) is the Lebesgue measure of I. 2. Weyl s Criterion: A sequence x n is B-u.d. mod 1 if and only if lim n n k=0 holds for all non-zero m Z {0}. ( n k) p k (1 p) n k e 2πimx k = 0 Proof. The proof is standard. Basically, it is based on the fact that by Weierstrass s approximation theorem every Riemann integrable function f of period 1 can be uniformly approximated by a trigonometric polynomial (i.e., a finite combination of functions of the type e 2πimx ).

26 Shannon Code: The Irrational Case 3. Let us return to the Shannon code redundancy. Two cases: α irrational and α rational. 4. We first consider α irrational. To apply our previous results, we must show that αk is B-u.d. mod 1. By Weyl s criterion lim n n k=0 ( n k) p k q n k e 2πim(kα) ( ) n = lim pe 2πimα + q n 0 = 0 providedαis irrational. Hence, by the previous theorem, withf(t) = t and y = βn, we immediately obtain lim n n k=0 This proves that for α irrational ( n k) p k q n k αk + βn = 1 0 tdt = 1 2. R S n = o(1).

27 Shannon Redundancy Rational Case Assume α = N/M where gcd(n,m) = 1. Denote p n,k = ( n k) p k q n k. S n = = n ( n k)p k q k n k N M + βn = k=0 M 1 l=0 l M + βn m: k=l+mm n M 1 l=0 p n,k. m: k=l+mm n Lemma 3. For fixed l M and M, there exist ρ < 1 such that m: k=l+mm n ( n k) p k (1 p) n k = 1 M + O(ρn ). p n,k l N M + N + βn Proof. Let ω k = e 2πik/M for k = 0,1,...,M 1 be the Mth root of unity. M 1 1 { ω n 1 if M n k M = 0 otherwise. k=0 where M n means that M divides n. Then ( n k)p k q n k = 1 + (pω 1 + q) n l (pω M 1 + q) n l m: k=l+mm n M = 1 M + O(ρn ), since (pω r + q) = p 2 + q 2 + 2pqcos(2πr/M) < 1 for r 0.

28 Finishing the Rational Case We shall use the following Fourier series; for real x x = 1 2 m=1 sin 2πmx mπ = 1 2 m Z {0} c m e 2πimx, c m = i 2πm, Continuing the derivation and using the above lemma we obtain S n = 1 M M 1 l=0 = M 1 2 m 0c m e 2πim(l/M+βn) = 1 2 m 0 m=km 0 c km e 2πikMβn = M c m e 2πimnβ 1 M ( ) 1 2 βnm. M 1 l=0 e 2πim l M Now, having the above two results we easily establish that R S n = o(1) α irrational M ( Mnβ 1 2) + O(ρ n ) α = N M Exercise. Using Stubley s formula and tools discussed above, derive the redundancy of the Huffman code.

29 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources Shannon and Huffman Coding Non-Prefix Codes Tunstall Code 5. Minimax Redundancy: Universal (unknown) Sources

30 Do We Really Need Prefix FV Codes? As argued in the XXVIII Shannon Lecture, it is possible to attain average compression for FV codes lower than Huffman s code (P andnare known): 1. Avoiding symbol-by-symbol compression, we design fixed-to-variable code for the whole file, hence eliminating the need for prefix codes. 2. Applying prefix condition to symbol-by-symbol or n-block supersymbol is only optimal for the linear term but not for the sublinear term. 3. The optimal FV code performs no blocking but encodes a table that listing sequences in decreasing probabilities: Define: π X (x) = l if x is the l-th most probable element according to distribution P X. The the minimum average code length is L(X) = E[ log 2 π X (X) ].

31 Do We Really Need Prefix FV Codes? As argued in the XXVIII Shannon Lecture, it is possible to attain average compression for FV codes lower than Huffman s code (P andnare known): 1. Avoiding symbol-by-symbol compression, we design fixed-to-variable code for the whole file, hence eliminating the need for prefix codes. 2. Applying prefix condition to symbol-by-symbol or n-block supersymbol is only optimal for the linear term but not for the sublinear term. 3. The optimal FV code performs no blocking but encodes a table that listing sequences in decreasing probabilities: Define: π X (x) = l if x is the l-th most probable element according to distribution P X. The the minimum average code length is L(X) = E[ log 2 π X (X) ]. Example: Binary source with p < 1 p := q; sequence x n 1 = x 1...x n : q n( p q ) 0 q n( p q ) 1... q n( ) n p q log 2 (1) log 2 (2)... log 2 (2 n )

32 Bounds on the Average Rate of Fixed-to-Variable Codes Upper Bound: E[L(X)] H(X) (Wyner). Lower Bounds: Theorem 2 (W.S., Verdu, 2009). Define the monotonically increasing function by: ψ(x) = x + (1 + x)log 2 (1 + x) xlog 2 x. Then ψ 1 (H(X)) L(X). Looser Bounds: Notice that: ψ(x) x + log 2 (e + ex), then Alon & Orlitsky bound follows: H(X) log 2 (H(X) + 1) log 2 e L(X) By monotonic increasing: h(x) = (1+x)log(1+x) xlogx, we conclude L(X) H(X) (1 + H(X))log 2 (1 + H(X)) H(X)log 2 H(X) This can be also found in Blundo & de Prisco.

33 Binary Memoryless Source (WS, 2005) Consider again a binary memoryless source with p probability of transmitting a 1. Let x n 1 = x 1...x n. There are ( n) k equal probabilities P(x n 1 ) = pk q n k, k number of 1 s. Define ( n ( n ( n A k = + + +, A 1 = 0. 0) 1) k) Starting from A k 1 the next ( n) k probabilities P(x n 1 ) are the same. The average code length is L n = n p k q n k A k log 2 (j) = n ( n k) p k q n k log 2 (A k 1 + i). k=0 j=a k 1 +1 k=0 i=1 In 2005 we proved the following asymptotic results: p = 1 2 p < 1 2 L n = n n (n + 2). L n = nh(p) 1 logn + O(1). 2

34 Binary Case: More Precise Main Result Theorem 3 (W.S., 2005). For a binary memoryless source, let p < 1 2. Then L n = nh(p) 1 2 log 2n 3 + ln(2) 1 p 1 + log 2 2ln(2) 1 2p 2πp(1 p) ( ) p 2(1 p) + 1 2p log 2 + F(n) + o(1) p where H(p) = plog 2 p (1 p)log 2 (1 p), and F(n) = 0 if log 2 1 p p is irrational; F(n) is an oscillating function if log 2 1 p p = N/M is rational, F(n) = 1 p 1 2p H M(nβ)[x] p 1 2p H 2(1 3p) M(nβ α)[ x] 1 2p H M(nβ)[2 x ]+ p 1 2p H M(nβ α)[2 x ] where H M (y)[f] := 1 M 2π e x2 /2 ( M ( y log 2 ( 1 2p 1 p 2πpqn ) x2 2ln2 ) 1 0 f(t)dt ) dx for some Riemann function f and β = log 2 (1 p).

35 Some Oscillations (a) L_n n*h(p)+0.5*log(n) (b) L_n n*h(p)+0.5*log(n) Figure 2: Plots of L n nh(p) + 0.5log(n) (y-axis) versus n (x-axis) for: (a) irrational α = log 2 (1 p)/p with p = 1/π; (b) rational α = log 2 (1 p)/p with p = 1/9.

36 Sketch of Proof 1. Using the following identity to handle floor functions (partial summation; cf. Knuth) N N 1 a j = Na n (a j+1 a j ) j=1 j=1 we can reduce L n to the sums of the following form S n = = n k=0 n k=0 ( n k) p k q n k log 2 A k ( n k) p k q n k log 2 A k n k=0 ( n k) p k q n k log 2 A k = a n + b n where a n = b n = n k=0 n k=0 ( n k) p k q n k log 2 A k, ( n k) p k q n k log 2 A k.

37 Asymptotics of A n : Saddle point Method Lemma 5. For large n and p < 1/2 A np = 1 p 1 2 nh(p)( ) 1 + O(n 1/2 ). 1 2p 2πnp(1 p) More precisely, for an ε > 0 and k = np + Θ(n 1/2+ε ) we have (Exercise) A k = 1 p ( ) 1 1 p k ( 1 1 2p 2πnp(1 p) p (1 p) exp n (k np)2 2p(1 p)n ) ( ) 1 + O(n δ ). Proof. Notice that A n (z) = n k=0 A kz k = (1+z)n 2 n z n+1 1 z. Apply the saddle point method to the Cauchy formula A k = [z n ]A n (z). A k = 1 2πi (1 + z) n 2 n z n+1 1 z dz z k+1 = 1 2πi 1 1 z 2nlog(1+z) (k+1)logz dz. The saddle pointz 0 = (k+1)/(n k+1) = p/(1 p) andh (z 0 ) = q 3 /p. A k = 1 1 z 0 1 2πnH (z 0 ) 2nH(z 0 ) (1 + O(n 1/2 )). Heuristic: Note that ( n ) k i = (p/q) i ( n k).

38 Returning to b n 3. We also need asymptotics of b n = n k=0 ( n k) p k q n k log 2 A k. From previous lemma we conclude that loga k = αk + nβ log 2 ω n (k np)2 2pqnln2 + O(n δ ) for some ω > 0 and α = logp/(1 p). Thus we need asymptotics of the following sum n k=0 ( n k) p k q n k αk + nβ log 2 ω n (k np)2. 2pqnln2 We must now resort to theory of Bernoulli sequences modulo 1.

39 Final Lemma Lemma 6 (Drmota, Hwang, W.S., 2004). Let 0 < p < 1 be a fixed real number and f : [0,1] R be a Riemann integrable function. (i) If α is irrational, then lim n n k=0 ( n k) p k (1 p) n k f ( ) kα + y (k np) 2 /(2pqnln2) = 1 0 f(t)dt, where the convergence is uniform for all shifts y R. (ii) If α = N M (rational) (gcd(n,m) = 1), then uniformly y R n k=0 ( n k) p k (1 p) n k f ( ) kα + y (k np) 2 /(2pqnln2) = 1 0 f(t)dt + H M (y) where H M (y) := 1 M 1 2π e x2 /2 ( M ( ) y x2 2ln2 1 0 ) f(t)dt dx is a periodic function with period 1 M.

40 General Memoryless Source: Main Results We consider m-ary alphabet A with probabilities of symbols: p 1 p 2 p m 1 p m. Entropy: H(X) = i p ilogp i. Define also B i = logp m /p i. Our main results are: Theorem 4. For a memoryless source with finite alphabet A, the minimum expected length of a lossless binary encoding is if the source is equiprobable, and L n = nlog 2 A + o(1). if the source is not equiprobable. L n = nh(x) 1 2 log 2n + O(1)

41 Sketch of the Proof 1. Type: Type T n,m T n,m = {(k 1,...,k m ) N m,k k m = n}. such thatk i is the number of symbolsiin a sequence and probabilities are the same p k = p k 1 1 pk m m, k T n,m. 2. Order among types: l k iff p l p k, sort from the smallest index to the largest; or equivalently where B i = logp m /p i. l 1 B l m 1 B m 1 k 1 B k m 1 B m 1 3. There are ( n ) = k sequences of type k. Define ( n k 1,...,k m ) A k := l k ( n l). Starting from position A k the next ( n k+1) sequences have the same probability p k+1, where k + 1 is the next type.

42 Sketch of Proof 4. The average code length is L n = = k T n,m k Tn,m p k A k i=a k 1 +1 logi = ( n k) p k loga k + O(1) k Tn,m ( n k) p k log(a k i) i=1 binomial sum = loga np + O(1) 5. We need to estimate A np = p l p np ( n l) For l i = np i + x i A np = ( n ). np + x B 1 x 1 + +B m 1 x m 1 0

43 Sketch of the Proof 6. By Stirling ( n np + x ) ( 1 + O(1/ n) ) C 2nH(X) n exp(b 1x (m 1)/ B m 1 x m 1 ) ( exp 1 ) 2n xt Σ 1 x where Σ is an invertible covariance matrix (m 1) (m 1). 7. Observe that (b = (B 1,...B m 1 )) A np = C2nH(X) n (m 1)/2 b T x=0 exp ( 1 ) 2n xt Σ 1 x + b T x<0 exp ( b T x 1 ) 2n xt Σ 1 x. which leads to ( loga np = log C 2nH(X) n (m 1)/2n(m 2)/2 + O ( 2 H(X) n (m 1)/2 )) = nh(x) 1 2 logn+o(1) since D m 2 exp ( 1 ) 2n xt Σ 1 x = Cn (m 2)/2.

44 Example for m = 3 Assume now m = 3. Then ( n ) A np = np 1 + x,np 2 + y B 1 x+b 2 y 0 2 nh(x) n 2πp 1 p 2 p 3 B 1 x+b 2 y=0 exp ( x2 2np 1 y2 2np 2 ) (x + y)2 2np 3 = O( n) 2nH(X) n = C 2nH(X) n, k 2 normal O( n) np 2 geometric np 1 k 1. Figure 3: Illustration for m = 3

45 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources Shannon and Huffman Coding Non-Prefix Codes Tunstall Code 5. Minimax Redundancy: Universal (unknown) Sources

46 Variable-to-Fixed Codes A VF coder consists of a parser and a dictionary. Parsing Tree 1 0 Dictionary A variable-to-fixed length encoder partitions the source string into a concatenation of variable-length phrases. 2. Each phrase belongs to a given dictionary D of source strings. 3. A dictionary can be represented by a complete parsing tree T. The dictionary entries d D correspond to the leaves of the parsing tree. 4. The encoder represents phrases by the fixed length binary codewords. i.e., a dictionary D of M entries requires log 2 M bits to represent entries. Average Redundancy Rate: r = lim n x =n P S(x)(L(x) + logp S (x)) n = logm E[D] h where h is the entropy rate of the source.

47 Tunstall and Khodak Codes p = 0.6 q = 04 Tunstall s construction M = 5 Khodak s construction r = 0.25 Tunstall Code: 1. Start with a root and leaves. 2. In the J s iteration select a leaf with the highest probability and grow children out it At Jth step, the parsing tree has J internal nodes and M=J + 1 leaves corresponding to dictionary entries. Khodak Construction: 1. Pick a real number r and grow a complete parsing tree satisfying min{p,1 p} r P(d) < r, d D. 2. The resulting parsing tree is exactly the same as the Tunstall tree. 3. Ify is a proper prefix of entries ofd r, i.e., y is an internal node oft r, then P(y) r.

48 Phrase Length We study the phrase length D = d, i.e., path length in the parsing tree. Moment Generating Functions: Define D(r,z) := E[z D ] = d Dr P(d)z d. and its corresponding internal nodes generating function S(r,z) = P(y)z y. y: P(y) r Simple Fact on Trees: Let D be a dictionary (leaves of T ) and Ỹ be the collection of proper prefixes of dictionary entries (internal nodes of T ). d D P(d) z d 1 z 1 = y Ỹ P(y)z y, Exercise. Thus and D(r,z) = 1 + (z 1)S(r,z), E[D] = S(v,1) = y Ỹ P(y), E[D(D 1)] = S (v,1) = 2 y Ỹ P(y) y.

49 Recurrences Define v = 1/r, z complex, and S(v,z) = S(v 1,z). Let A(v) = 1 y:p(y) 1/v be the # of strings of probab. v 1 or the # of internal nodes. In fact: M r = A(v) + 1. We have and A(v) = { 0 v < 1, 1 + A(vp) + A(vq) v 1 S(v,z) = { 0 v < 1, 1 + zp S(vp,z) + zq S(vq,z) v 1, since every binary string either is: empty string, string starting with the first symbol string starting with second symbol.

50 Mellin Transform The Mellin transform F (s) of a function F(v) is F (s) = 0 F(v)v s 1 dv. From the recurrence on S(v,z) we find D (s,z) = 1 z s(1 zp 1 s zq 1 s ) 1 s, R(s) < s 0(z), where s 0 (z) denotes the real solution of: zp 1 s + zq 1 s = 1. To find the asymptotics of D(v,z) as v we compute the inverse transform of D (s,z)): D(v,z) = 1 σ+it 2πi lim T σ it D (s,z)v s ds, where σ < s 0 (z). To determine the polar singularities of the meromorphic continuation of D (s,z), we have to analyze the set Z(z) = {s C : zp 1 s + zq 1 s = 1}.

51 x x x Im(z) T From C D(v, z x F T (v, σ x x x s 0 τ Re(z) = + 1 2πi x x = x x x T + 1 2πi provided that the series of residues converges the last integral exists. But they don t!.

52 Tauberian Rescue Therefore, we analyze (as in analytic number theory; cf. also Vallee) whose Mellin transform is D 1 (v,z) = v 0 D(w,z)dw. D 1 (s,z) = D (s + 1,z) s = O(1/s 2 ). Lemma 7 (Tauberian). Let f(v, λ) be a non-negative increasing function such that F(v,λ) = and has the asymptotic expansion v 0 f(w,λ)dw F(v,λ) = vλ+1 (1 + λ o(1)) λ + 1 as v and uniformly in λ. Then as v uniformly in λ f(v,λ) = v λ (1 + λ 1 2 o(1)).

53 Main Results Theorem 5 (Central Limit Theorem). For large M r D r 1 H lnm r ) N(0,1) standard normal distribution lnm r (H2 H 3 1 H where H is natural entropy and H 2 = pln 2 +qln 2 q. If ln q/ lg p is irrational, then M r = A(v) + 1 = v H + o(v) if ln q/ lg p is rational, then M r E[D r ] = lnm r H + lnh H + H 2 2H 2 + o(1); = Q 1(logv) v + O(v 1 η ), Q 1 (x) = H L 1 e Le L x L, E[D r ] = lnm r H + lnh H + H 2 2H + lnl + ln(1 e L ) + L 2 2 H + O(M r η ), L largest real number s.t. ln(1/p) and ln(1/q) are integer multiples of L.

54 Redundancy Rate The average redundancy rate of Tunstall/Khodak s VF code is defined as r M r = lnm r E[D] h. Case 1: ln p/ ln q is irrational: r M r = H lnm r Case 2: lnp/lnq is rational: r M r = H lnm r ( ( H ) ( ) 2 1 2H lnh + o. lnm r H 2 2H lnh + lnl ln(el 1) + L ) 2 + O ( M r η ), for some η > 0, where L > 0 is the largest real number for which ln(1/p) and ln(1/q) are integer multiples of L. (No oscillation!)

55 Random Walk l logr logq Consider a random walk that corresponds to a path in the associated parsing tree. For Khodak s code we studied A(v) = f(v) y:p(y) 1/v logq = b a = log p logr log p k for some function f(v). But P(y) = p k q l (k,l 0), and set v = 2 V so that logp(v) = klg(1/p)+sllg(1/q) V. This corresponds to a random walk in the first quadrant with the linear boundary condition ax + by = V where a = log(1/p) and b = log(1/q). The phrase length coincides with the exit time of such a random walk.

56 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources Universal Memoryless Sources Universal Markov Sources Universal Renewal Sources

57 Minimax Redundancy Unknown Source P In practice, one can only hope to have some knowledge about a family of sources S that generates real data. Following Davisson we define the average minimax redundancy R n (S) and the worst case (maximal) minimax redundancy R n (S) for a family of sources S as R n (S) = min Cn R n (S) = minsup Cn P S supe[l(c n,x n 1 ) + lgp(xn 1 )] P S max[l(c x n n,x n 1 ) + lgp(xn 1 )]. 1 In the minimax scenario we look for the best code for the the worst source. Source Coding Goal: Find data compression algorithms that match optimal redundancy rates either on average or for individual sequences.

58 Maximal Minimax Redundancy We consider the following classes of sources S: Memoryless sources M 0 over an m-ary (finite) alphabet, that is, P(x n 1 ) = pk 1 1 pk m m with k k m = n, where p i are unknown! Markov sources M r over a binary alphabet of order r. Observe that for r = 1 P(x n 1 ) = p x 1 p k pk pk pk 11 11, where k ij is such that (x k,x k+1 ) = (i,j) {0,1} 2 and k 00 + k 01 + k 10 + k 11 = n 1, and that k 01 = k 10 if x 1 = x n and k 01 = k 10 ± 1 if x 1 x n. Renewal Sources R 0 where an 1 is introduced after a run of 0s distributed according to some distribution.

59 Improved Shtarkov Bounds For the maximal minimax redundancy define Q (x n 1 ) := sup P S P(x n 1 ) y n1 An sup P S P(y n 1 ). the maximum likelihood distribution. Observe that R n (S) = min sup Cn C P S = min Cn C max x n 1 = min max Cn C x n 1 max(l(c n,x n 1 ) + lgp(xn 1 )) x n 1 = R GS n (Q ) + lg ( ) L(C n,x n 1 ) + sup lgp(x n 1 ) P S [L(C n,x n 1 ) + lgq (x n 1 ) + lg y n 1 An sup P S P(y n 1 ) y n 1 An sup P S P(y n 1 )] where R GS n (Q ) is the maximal redundancy of a generalized Shannon code built for the (known) distribution Q. We also write D n (S) = lg sup x n 1 An P S P(x n 1 ) := lgd n (S).

60 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources Universal Memoryless Sources Universal Markov Sources Universal Renewal Sources

61 Maximal Minimax for Memoryless Sources We first consider the maximal minimax redundancy R n (M 0) for a class of memoryless sources over a finite m-ary alphabet. Observe that d n (M 0 ) = x n 1 = = The summation set is sup p k 1 1 pk m m p 1,...,pm k 1 + +km=n k 1 + +km=n ( n k 1,...,k m ) sup p 1,...,pm ( n k 1,...,k m )( k1 n p k 1 1 pk m m ) k1 ( km I(k 1,...,k m ) = {(k 1,...,k m ) : k k m = n}. The number N k of types k = (k 1,...,k m ) is ( n ) N k = k 1,...,k m The (unnormalized) likelihood distribution is sup p k 1 1 pk m m = p 1,...,pm ( k1 n ) k1 ( km n ) k m n ) k m.

62 Generating Function for d n (M 0 ) We write d n (M 0 ) = n! n n k 1 + +km=n Let us introduce a tree-generating function k k 1 1 k 1! kkm m k m! B(z) = k=0 k k k! zk = 1 1 T(z), where T(z) satisfies T(z) = ze T(z) (= W( z), Lambert s W -function) and also k k 1 T(z) = z k k! k=1 enumerates all rooted labeled trees. Let now D m (z) = Then by the convolution formula n=0 n n n! d n(m 0 )z n. D m (z) = [B(z)] m.

63 Asymptotics The function B(z) has an algebraic singularity at z = e 1 and B(z) = 1 2(1 ez) O( (1 ez). The singularity analysis yields ( ) [z n 1 ] = 1 ez e n ( 1 1 πn [z n ] ( 1 ez ) = en πn 3 ( ) 1 [z n ] = e n, 1 ez n! n n = e n 2πn ) 8n + O(1/n2 ) ) ( n ( ) 12n + O(1/n2 ),. which leads to (cf. Clarke & Barron, 1990, W.S., 1998) d n (M 0 ) = m 1 ( ) ( ) n π log + log 2 2 Γ( m 2 ) ( 3 + m(m 2)(2m + 1) Γ(m 2 )m 3Γ( m ) Γ2 ( m 2 )m2 9Γ 2 ( m ) ) 2 n 1 n +

64 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources Universal Memoryless Sources Universal Markov Sources Universal Renewal Sources

65 Maximal Minimax for Markov Sources (i) M 1 is a Markov source of order r = 1, (ii) the transition matrix P = {p ij } m i,j=1 (iii)circular sequences (the first symbols follows the last). d n (M 1 ) = x n 1 sup P p k pk mm mm = k Fn M k ( k11 k 1 ) k11... ( kmm k m ) k mm, k ij is the number of pairs ij A 2 in x n 1, k i = m k = {k ij } m i,j=1 is an integer matrix such that j=1 k ij, F n : m k ij = n, and m k ij = m 1 k ji, i,j=1 j=1 j=0 Matrix k satisfying the above conditions is called the frequency matrix or Markov type. M k represents the numbers of strings x n 1 of type k. We come back to Markov types soon.

66 Main Technical Tool Let g k be a sequence of scalars indexed by matrices k and g(z) = k g k z k be its regular generating function, and Fg(z) = k F g k z k = n 0 k Fn g k z k the F-generating function of g k for which k F. Lemma 8. Let g(z) = k g kz k. Then Fg(z) := n 0 k Fn g k z k = ( 1 2jπ ) m dx1 dxm x j g([z ij ]) x 1 x m x i with the ij-th coefficient of [z ij x j x i ] is z ij x j x i. Proof. It suffices to observe g([z ij x j x i ]) = k g k z k m i=1 i x k ij j k ij i Thus Fg(z) is the coefficient of g([z ij x j x i ]) at x 0 1 x0 2 x0 m.

67 Main Results Theorem 6. Let M 1 be a Markov source over an m-ry alphabet. Then with d n (M 1 ) = A m = where K(1) = {y ij : of degree m 1. ( n 2π ) m(m 1)/2 A m ( 1 + O ( )) 1 n j y ij d[y ij ] yij j m (y ij ) K(1)mF i ij y ij = 1} and F m ( ) is a polynomial expression In particular, for m = 2 A 2 = 16 Catalan where Catalan is Catalan s constant ( 1) i i (2i+1) Theorem 7. Let M r be a Markov source of order r. Then d n (M r ) = ( n 2π ) m r (m 1)/2 A r m (1 + O ( )) 1 where A r m is a constant defined in a similar fashion as A m above. n

68 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources Universal Memoryless Sources Universal Markov Sources Universal Renewal Sources

69 Renewal Sources The renewal process defined as follows: Let T 1,T 2... be a sequence of i.i.d. positive-valued random variables with distribution Q(j) = Pr{T i = j}. The process T 0,T 0 +T 1,T 0 +T 1 +T 2,... is called the renewal process. With a renewal process we associate a binary renewal sequence in which the positions of the 1 s are at the renewal epochs (runs of zeros) T 0,T 0 + T 1,... We start with x 0 = 1. Csiszár and Shields (1996) proved that R n (R 0 ) = Θ( n). We prove the following result. Theorem 8 (Flajolet and WS, 1998). Consider the class of renewal processes as defined above. Then where c = π R n (R 0) = 2 cn + O(logn). log2

70 Maximal Minimax Redundancy For a sequence x n 0 = 10α 1 10 α αn 10 0 }{{} k k m is the number of i such that α i = m. Then P(x n 1 ) = Qk 0 (0)Q k 1 (1) Q k n 1 (n 1)Pr{T 1 > k }. It can be proved that r n+1 1 d n (R 0 ) n m=0 r m where r n = n k=0 r n,k = P(n,k) r n,k ( k )( ) k0 ( ) k1 k0 k1 kn 1 ( k 0 k n 1 k k k ) kn 1 where P(n,k) is the summation set which happens to be the partition of n into k terms, i.e., n = k 0 + 2k nk n 1, k = k k n 1.

71 Main Results Theorem 9 (Flajolet and WS, 1998). Consider the class of renewal processes as defined above. The quantity r n attains the following asymptotics where c = π r n = 2 5 cn log2 8 lgn lglogn + O(1) Asymptotic analysis is sophisticated and follows these steps: first, we transform r n into another quantity s n that we know how to handle and (using a probabilistic technique) we know how to read back results for r n from s n ; use combinatorial calculus to find the generating function of s n, which turns out to be an infinite product of tree-functions B(z) defined above; transform this product into a harmonic sum that can be analyzed asymptotically by the Mellin transform; obtain an asymptotic expansion of the generating function around z = 1 which is the starting point for extracting the asymptotics of the coefficients; finally, estimate R n (R 0) by the saddle point method.

72 Asymptotics: The Main Idea The quantity r n is to hard to analyze due to the factor k!/k k, hence we define a new quantity s n defined as { sn = n k=0 s n,k s n,k = e k P(n,k) kk 0 k 0! kk n 1 k n 1!. To analyze it, we introduce the random variable K n as follows Stirling s formula yields Pr{K n = k} = s n,k s n. r n s n = n k=0 r n,k s n,k s n,k s n = E[(K n )!K K n n e Kn ] = E[ 2πK n ] + O(E[K n 1 2]).

73 Fundamental Lemmas Lemma 9. Let µ n = E[K n ] and σ 2 n = Var(K n). s n exp (2 cn 78 ) logn + d + o(1) µ n = 1 4 n c log n c + o( n) σ 2 n = O(nlogn) = o(µ 2 n ), where c = π 2 /6 1, d = log2 3 8 logc 3 4 logπ. Lemma 10. For large n where µ n = E[K n ]. Thus E[ K n ] = µ 1/2 n (1 + o(1)) E[K n 1 2] = o(1). r n = s n E[ 2πK n ](1 + o(1)) = s n 2πµn (1 + o(1)).

74 Sketch of a Proof: Generating Functions 1. Define the function β(z) as β(z) = k=0 k k k! e k z k. One has (e.g., by Lagrange inversion or otherwise) β(z) = 1 1 T(ze 1 ). 2. Define S n (u) = s n,k u k, S(z,u) = S n (u)z n. k=0 n=0 Since s n,k involves convolutions of sequences of the form k k /k!, we have S(z,u) = Pn,k z 1k 0 +2k 1 + ( u e ) k0 + +k n 1 k k 0 k 0! kkn 1 k n 1! = β(z i u). i=1 We need to compute s n = [z n ]S(z,1) where [z n ]F(z) denotes the coefficient at z n of F(z).

75 Mellin Asymptotics 3. Let L(z) = logs(z,1) and z = e t, so that L(e t ) = logβ(e kt ). k=1 Mellin transform techniques provide an expansion of L(e t ) around t = 0 (or equivalently z = 1) since the sum falls under the harmonic sum paradigm. 4. The Mellin transforml (s) = M(L(e t );s) ofl(e t ) is computed by the harmonic sum property (M3). For R(s) (1, ), the transform evaluates to L (s) = ζ(s)λ(s) where ζ(s) = n 1 n s is the Riemann zeta function, and Λ(s) = 0 logβ(e t )t s 1 dt. This leads to L (s) ( ) Λ(1) s 1 s=1 + ( 1 4s logπ ). 2 4s s=0

76 What s Next? 5. An application of the converse mapping property (M4) allows us to come back to the original function, which translates in where L(e t ) = Λ(1) t logt 1 4 logπ + O( t), L(z) = Λ(1) 1 z log(1 z) 1 4 logπ 1 2 Λ(1) + O( 1 z). c = Λ(1) = 1 0 = π log(1 T(x/e)) dx x 6. In summary, we just proved that, as z 1, ( ) S(z,1) = e L(z) = a(1 z) 4 1 c exp (1 + o(1)), 1 z where a = exp( 1 4 logπ 1 2 c). 7. To extract asymptotic we need to apply the saddle point method.

77 Saddle Point Method Lemma 11. For positivea > 0, and reals B andc, definef(z) = f A,B,C (z) as ( ( )) A f(z) = exp 1 z + Blog z + C log z log 1. 1 z Then, [z n ]f A,B,C (z) = exp ( 2 An (4πe 2 log A / ) ) A + o(1). ( B 3 ) n logn + Cloglog 2 A Proof: We start with Cauchy s formula [z n ]f(z) = 1 2πi e h(z) dz where h(z) = logf A,B,C (z) (n +1)logz. The saddle point r defined by h (r) = 0 is asymptotically equal to r = 1 A n + B A 2n + o(n 1 ), and ( ) n n h(r) = 2A A + Blog A ( ) n + C loglog + 1 A 2 A + o(1).

78 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources 6. Appendix A: Mellin Transform 7. Appendix B: Saddle Point Method

79 Appendix A: Mellin Properties (M1) DIRECT AND INVERSE MELLIN TRANSFORMS. fundamental strip defined below. Let c belong to the f (s) := M(f(x);s) = 0 f(x)x s 1 dx then f(x) = 1 2πi c+i c i f (s)x s ds. (M2) FUNDAMENTAL STRIP. The Mellin transform of f(x) exists in the fundamental strip R(s) ( α, β), where f(x) = O(x α ) (x 0), f(x) = O(x β ) (x ). (M3) HARMONIC SUM PROPERTY. By linearity and the scale rule M(f(ax);s) = a s M(f(x);s), f(x) = k 0 λ k g(µ k x) then f (s) = g (s) k 0 λ k µ s k.

80 (M4) MAPPING PROPERTIES (Asymptotic expansion of f(x) and singularities of f (s)). f(x) = c ξ,k x ξ (logx) k + O(x M ) then (ξ,k) A f (s) (ξ,k) A c ξ,k ( 1) k k! (s + ξ) k+1. (i) Direct Mapping. Assume thatf(x) admits asx 0 + the asymptotic expansion of the above for some M < α and k > 0. Then for R(s) ( M, β), the transform f (s) satisfies the singular expansion of above. (ii) Converse Mapping. Assume that f (s) = O( s r ) with r > 1, as s and that f (s) admits the singular expansion above for R(s) ( M, α). Then f(x) satisfies the asymptotic expansion of above at x = 0 +.

81 Outline Update 1. Shannon Information Theory: Three Theorems of Shannon 2. Glance at Channel Coding Theorem 3. Source Coding 4. Redundancy: Known Sources 5. Minimax Redundancy: Universal (unknown) Sources 6. Appendix A: Mellin Transform 7. Appendix B: Saddle Point Method

82 Appendix B: Saddle Point Method Input: A function g(z) analytic in z < R (0 < R < + ) with nonnegative Taylor coefficients and fast growth as z R. Let h(z) := logg(z) (n + 1)logz. Output: The asymptotic formula for g n := [z n ]g(z) derived from the Cauchy coefficient integral g n = 1 2iπ where γ is a loop around z = 0. γ g(z) dz z n+1 = 1 2iπ γ e h(z) dz (S1). SADDLE POINT CONTOUR. Require that g (z)/g(z) + as z R. Let r = r(n) be the unique positive root of the saddle point equation h (r) = 0 or rg (r) g(r) = n + 1, so that r R as n. The integral above is evaluated on γ = {z z = r}.

83 (S2). BASIC SPLIT. Require that h (r) 1/3 h (r) 1/2 0. Define ϕ = ϕ(n) called the range of the saddle point by ϕ = h (r) 1/6 h (r) 1/4, so that ϕ 0, h (r)ϕ 2, and h (r)ϕ 3 0. Split γ = γ 0 γ 1, where γ 0 = {z γ arg(z) ϕ}, γ 1 = {z γ arg(z) ϕ}. (S3) ELIMINATION OF TAILS. Require that g(re iθ ) g(re iϕ ) onγ 1. Then, the tail integral satisfies the trivial bound, e h(z) dz γ 1 = O ( ) e h(reiϕ).

84 (S4) LOCAL APPROXIMATION. Require that h(re iθ ) h(r) 1 2 r2 θ 2 h (r) = O( h (r)ϕ 3 ) onγ 0. Then, the central integral is asymptotic to a complete Gaussian integral, and 1 2iπ e h(z) dz = g(r)r n γ 0 2πh (r) ( ) 1 + O( h (r)ϕ 3 ). (S5) COLLECTION. Requirements (S1),(S2),(S3),(S4), imply the estimate: [z n ]g(z) = g(r)r n 2πh (r) ( ) 1 + O( h (r)ϕ 3 ) g(r)r n 2πh (r).

A One-to-One Code and Its Anti-Redundancy

A One-to-One Code and Its Anti-Redundancy W. Szpankowski Department of Computer Science, Purdue University July 4, 2005 This research is supported by NSF, NSA and NIH. Outline of the Talk. Prefix Codes