Analytic Information Theory: From Shannon to Knuth and Back. Knuth80: Piteaa, Sweden, 2018 Dedicated to Don E. Knuth

Size: px

Start display at page:

Download "Analytic Information Theory: From Shannon to Knuth and Back. Knuth80: Piteaa, Sweden, 2018 Dedicated to Don E. Knuth"

Andrew Burns
5 years ago
Views:

1 Analytic Information Theory: From Shannon to Knuth and Back Wojciech Szpankowski Center for Science of Information Purdue University January 7, 2018 Knuth80: Piteaa, Sweden, 2018 Dedicated to Don E. Knuth Joint work with M. Drmota, P. Flajolet, P. Jacquet, M. Weinberger

2 Outline 1. Shannon & Knuth Legacy 2. Huffman Code and Its Redundancy 3. Universal Codes and Lambert s Function 4. Graph Compression and Knuth s Recurrences Algorithms: Combinatorics: Information: are at the heart of virtually all computing technologies; provides indispensable tools for finding patterns and structures; is a measure of distinguishability.

3 Shannon Legacy: Information Theory Theorem 1 & 3. [Shannon 1948; Lossless & Lossy Data Compression] compression bit rate source entropy H(X) for distortion level D: lossy bit rate rate distortion function R(D) Theorem 2. [Shannon 1948; Channel Coding ] In Shannon s words: It is possible to send information at the capacity through the channel with as small a frequency of errors as desired by proper (long) encoding. This statement is not true for any rate greater than the capacity.

4 Knuth s Legacy: Analytic Combinatorics Following Hadamard s precept 1, analytic combinatorics applies techniques of complex analysis (e.g., generating functions, combinatorial calculus, Rice s formula, Mellin transform, Fourier series, sequences distributed modulo 1, saddle point methods, analytic poissonization and depoissonization, and singularity analysis) to analyze algorithms (and combinatorial structures). D.E. Knuth initiated it in the 1970, while Flajolet and followers develop it through the next three decades, culminating in the publication of the Flajolet-Sedgewick magnum opus in 2009, which defines the field, and has stimulated a blossoming research field since. In the 1997 Shannon Lecture Jacob Ziv presented compelling arguments for backing off from first-order asymptotics in information theory. The program that applies complex-analytic tools to information theory, constitutes analytic information theory. 1 The shortest path between two truths on the real line passes through the complex plane.

5 Outline Update 1. Shannon & Knuth Legacy 2. Huffman Code and Its Redundancy 3. Universal Codes and Lambert s Function 4. Graph Compression and Knuth s Recurrences

6 Source Coding vel Data Compression A source code is a bijective mapping C : A {0,1} from sequences over the alphabet A to set {0,1} of binary sequences. The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions either on average or for individual sequences.

7 Source Coding vel Data Compression A source code is a bijective mapping C : A {0,1} from sequences over the alphabet A to set {0,1} of binary sequences. The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions either on average or for individual sequences. For a probabilistic source model S and a code C n we let: P(x n 1 ) be the probability of xn 1 = x 1...x n ; L(C n,x n 1 ) be the code length for xn 1 ; Entropy H n (P) = x n 1 P(xn 1 )log 2P(x n 1 ). Prefix Codes: no codeword is a prefix of a codword (Kraft s inequality).

8 Source Coding vel Data Compression A source code is a bijective mapping C : A {0,1} from sequences over the alphabet A to set {0,1} of binary sequences. The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions either on average or for individual sequences. For a probabilistic source model S and a code C n we let: P(x n 1 ) be the probability of xn 1 = x 1...x n ; L(C n,x n 1 ) be the code length for xn 1 ; Entropy H n (P) = x n 1 P(xn 1 )log 2P(x n 1 ). Prefix Codes: no codeword is a prefix of a codword (Kraft s inequality). Shannon First Theorem: For any prefix code the average code length E[L(C n,x n 1 )] cannot be smaller then the entropy of the source H n(p): E[L(C n,x n 1 )] H n(p).

9 Redundancy: Rate of Covergence Known Source P : Assume that P is known to us. It is known that the shortest code length L op (x n 1 ) logp(xn 1 ). Thus, the pointwise redundancyr n (C n,p;x n 1 ) and the average redundancy R n (C n,p) are defined as R n (C n,p;x n 1 ) = L(C n,x n 1 ) ( log 2P(x n 1 )) The maximal or worst case redundancy is Huffman Code (1952): R n (C n ) = E[L(C n,x n 1 )] H n(p) 0 R (C n,p) = max{r x n n (C n,p;x n 1 )}( 0). 1 R n (P) = min Cn C E x n 1 [L(C n,x n 1 ) + log 2P(x n 1 )].

10 Redundancy: Rate of Covergence Known Source P : Assume that P is known to us. It is known that the shortest code length L op (x n 1 ) logp(xn 1 ). Thus, the pointwise redundancyr n (C n,p;x n 1 ) and the average redundancy R n (C n,p) are defined as R n (C n,p;x n 1 ) = L(C n,x n 1 ) ( log 2P(x n 1 )) The maximal or worst case redundancy is Huffman Code (1952): R n (C n ) = E[L(C n,x n 1 )] H n(p) 0 R (C n,p) = max{r x n n (C n,p;x n 1 )}( 0). 1 R n (P) = min Cn C E x n 1 [L(C n,x n 1 ) + log 2P(x n 1 )]. D. E. Knuth, Dynamic Huffman Coding, J. Algorithms, 1985.

11 Redundancy: Rate of Covergence Known Source P : Assume that P is known to us. It is known that the shortest code length L op (x n 1 ) logp(xn 1 ). Thus, the pointwise redundancyr n (C n,p;x n 1 ) and the average redundancy R n (C n,p) are defined as R n (C n,p;x n 1 ) = L(C n,x n 1 ) ( log 2P(x n 1 )) The maximal or worst case redundancy is Huffman Code (1952): R n (C n ) = E[L(C n,x n 1 )] H n(p) 0 R (C n,p) = max{r x n n (C n,p;x n 1 )}( 0). 1 R n (P) = min Cn C E x n 1 [L(C n,x n 1 ) + log 2P(x n 1 )]. D. E. Knuth, Dynamic Huffman Coding, J. Algorithms, How the average Huffman s code redundancy behaves asymptotically (n )?

12 Redundancy of the Huffman Code (a) (b) Figure 1: The average redundancy of Huffman codes versus block size n for: (a) irrational α = log 2 (1 p)/p with p = 1/π; (b) rational α = log 2 (1 p)/p with p = 1/9.

13 Redundancy of the Huffman Code (a) (b) Figure 1: The average redundancy of Huffman codes versus block size n for: (a) irrational α = log 2 (1 p)/p with p = 1/π; (b) rational α = log 2 (1 p)/p with p = 1/9. Theorem 1 (W.S., 2000). Consider the Huffman block code of length n over a binary memoryless source with probability of transmitting a 1 is p < 1 2. Then as n R H n = ln M + o(1) , α = log 2(1 p)/p irrational, ( ) βmn M(1 2 1/M ) 2 nβm /M + O(ρ n ), α = N M where gcd(n,m) = 1, β = log(1 p), x = x x, and ρ < 1.

14 Why Two Modes: Shannon Code To simplify, we consider the Shannon code that assigns the length L(C S n,xn 1 ) = log 2P(x n 1 ) where P(x n 1 ) = pk (1 p) n k, with p being known probability of generating 0 and k is the number of 0s. The Shannon code redundancy is R S n = n k=0 = 1 = ( n p k) k (1 p) n k( ) log 2 (p k (1 p) n k ) + log 2 (p k (1 p) n k ) n k=0 ( n k) p k (1 p) n k αk + βn o(1) α = log 2(1 p)/p irrational M ( Mnβ 1 2) + O(ρ n ) α = N M rational where x = x x is the fractional part of x, and ( ) ( ) 1 p 1 α = log 2, β = log 2. p 1 p

15 Sketch of Proof: Sequences Modulo 1 To analyze redundancy for known sources one needs to understand asymptotic behavior of the following sum n k=0 ( n k) p k (1 p) n k f( αk + y ) for fixed p and some Riemann integrable function f : [0, 1] R. The proof follows from the following two lemmas. Lemma 1. Let 0 < p < 1 be a fixed real number and α be an irrational number. Then for every Riemann integrable function f : [0, 1] R lim n n k=0 ( n k) p k (1 p) n k f( αk + y ) = 1 0 f(t)dt, where the convergence is uniform for all shifts y R. Lemma 2. Let α = N M be a rational number with gcd(n,m) = 1. Then for bounded function f : [0,1] R n k=0 ( n k) p k (1 p) n k f( αk + y ) = 1 M uniformly for all y R and some ρ < 1. M 1 l=0 f ( l M + My ) M + O(ρ n )

16 Outline Update 1. Shannon & Knuth Legacy 2. Huffman Code and Its Redundancy 3. Universal Codes and Lambert s Function Finite Alphabet Unbounded Alphabet 4. Graph Compression and Knuth s Recurrences

17 Minimax Redundancy For Unknown Sources Following Davisson, the maximal minimax redundancy R n (S) for a family of sources S is: R n (S) = min R n Cn sup P S (S) = minsup Cn P S E[L(C n,x n 1 ) + logp(xn 1 )] max[l(c x n n,x n 1 ) + logp(xn 1 )]. 1

18 Minimax Redundancy For Unknown Sources Following Davisson, the maximal minimax redundancy R n (S) for a family of sources S is: R n (S) = min R n Cn sup P S (S) = minsup Cn P S E[L(C n,x n 1 ) + logp(xn 1 )] max[l(c x n n,x n 1 ) + logp(xn 1 )]. 1 Shtarkov s Bound for maximal minimaxr n (S) (using the maximum likelihood distribution): d n (S) := log x n 1 An sup P S P(x n 1 ) R n (S) log using the following maximum likelihood distribution Q (x n 1 ) := sup P S P(x n 1 ) y n1 An sup P S P(y n 1 ). x n 1 An sup P S P(x n 1 ) +1 } {{ } Dn(S)

19 Maximal Minimax for Memoryless Sources We shall analyze D n (M 0 ) for memoryless source M 0 over the alphabet A = {1,2,...,m} with probability of symbols p i, i = 1,...,m. Observe P(x n 1 ) = p 1 k1 p m km, k k m = n, where k i is the number of symbols i in x n. Since we find sup P(x n P(x n 1 ) 1 ) = sup p k 1 1 pk m m = p 1,...,pm ( k1 n ) k1 ( km n ) k m D n (M 0 ) := x n 1 = = sup P(x n 1 ) P(x n 1 ) = x n 1 k 1 + +km=n k 1 + +km=n ( n k 1,...,k m ) sup p k 1 1 pk m m p 1,...,pm sup p 1,...,pm ( n k 1,...,k m )( k1 n p k 1 1 pk m m ) k1 ( km n ) k m.

20 Tree Generating Function for D n (M 0 ) We write D n (M 0 ) = k 1 + +km=n ( n )( ) k1 k1 ( km k 1,...,k m n n ) k m = n! n n k 1 + +km=n k k 1 1 k 1! kkm m k m! Let us introduce a tree-generating function B(z) = k=0 k k k! zk = 1 1 T(z), T(z) = k=1 k k 1 z k k! where T(z) = ze T(z) (= W( z), Lambert s W -function) that enumerates all rooted labeled trees. Let now D m (z) = n=0 z n n n n! D n(m 0 ). Then by the convolution formula: D m (z) = [B(z)] m 1.

21 Tree Generating Function for D n (M 0 ) We write D n (M 0 ) = k 1 + +km=n ( n )( ) k1 k1 ( km k 1,...,k m n n ) k m = n! n n k 1 + +km=n k k 1 1 k 1! kkm m k m! Let us introduce a tree-generating function B(z) = k=0 k k k! zk = 1 1 T(z), T(z) = k=1 k k 1 z k k! where T(z) = ze T(z) (= W( z), Lambert s W -function) that enumerates all rooted labeled trees. Let now D m (z) = n=0 z n n n n! D n(m 0 ). Then by the convolution formula: D m (z) = [B(z)] m 1. D. E. Knuth and B. Pittel A Recurrence Related to Trees, Proc. AMS, D. E. Knuth, et al. On the Lambert W Function, Adv. Comp. Math., 1996.

22 Outline Update 1. Shannon & Knuth Legacy 2. Huffman Code and Its Redundancy 3. Universal Codes and Lambert s Function Finite Alphabet Unbounded Alphabet 4. Graph Compression and Knuth s Recurrences

Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) = B(z/e) = 1 2(1 z) + 1 3 + O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n!

23 Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) = B(z/e) = 1 2(1 z) O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n! n n[zn ][B(z)] m = 2πn(1 + O(1/n)) 1 2πi β(z) m dz. zn+1 For finite m, the singularity analysis of Flajolet and Odlyzko implies [z n ](1 z) α nα 1 Γ(α), α / {0, 1, 2,...} that finally yields (cf. W.S., 1998) R n (M 0) = m 1 ( ) ( ) n π log + log 2 2 Γ( m 2 ) ( 3 + m(m 2)(2m + 1) Γ(m 2 )m 3Γ( m ) Γ2 ( m 2 )m2 9Γ 2 ( m ) ) 2 n 1 n +

24 Outline Update 1. Shannon & Knuth Legacy 2. Huffman Code and Its Redundancy 3. Universal Codes and Lambert s Function Finite Alphabet Unbounded Alphabet 4. Graph Compression and Knuth s Recurrences

25 Redundancy for UNBOUNDED m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m z n+1 dz = 2πn 1 2πi e g(z) dz where g(z) = mlnβ(z) (n + 1)lnz. The saddle point z 0 is a solution of g (z 0 ) = 0, where 0 z 0 1.

26 Redundancy for UNBOUNDED m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m z n+1 dz = 2πn 1 2πi e g(z) dz where g(z) = mlnβ(z) (n + 1)lnz. The saddle point z 0 is a solution of g (z 0 ) = 0, where 0 z 0 1. m = o(n) m = n n = o(m)

mlnβ(z) (n + 1)lnz. The saddle point z 0 is a solution of g (z 0 ) = 0, where 0 z 0 1.

27 Redundancy for UNBOUNDED m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m z n+1 dz = 2πn 1 2πi e g(z) dz where g(z) = mlnβ(z) (n + 1)lnz. The saddle point z 0 is a solution of g (z 0 ) = 0, where 0 z 0 1. m = o(n) m = n n = o(m) D. Greene, D. E. Knuth, Mathematics for the Analysis of Algorithms, 1990.

28 Main Results for LARGE m Theorem 2 (W.S. and Weinberger, 2010). For memoryless sourcesm 0 over an m-ary alphabet, m as n grows, we have: (i) For m = o(n) R n,m (M 0) = m 1 2 log n m + m 2 loge + mloge 3 m n 1 2 O ( ) m n (ii) For m = αn + l(n), where α is a positive constant and l(n) = o(n), R n,m (M 0) = nlogb α + l(n)logc α log A α + O(l(n) 2 /n) where C α := /α, A α := C α + 2/α, B α = αc α+2 α e 1 Cα. (iii) For n = o(m) R n,m (M 0) = nlog m n + 3 n 2 2m loge 3 ( ) n 1 2m loge + O n + n3. m 2

29 Outline Update 1. Shannon & Knuth Legacy 2. Huffman Code and Its Redundancy 3. Universal Codes and Lambert s Function 4. Graph Compression and Knuth s Recurrences

30 Graph and Structural Entropies A structure model S of a graph G is defined as its unlabeled version. G 1 1 G 2 1 G 3 1 G 4 1 S 1 S G 5 1 G 6 1 G 7 1 G 8 1 S 3 S The probability of a structure S is: P(S) = N(S) P(G) where N(S) is the number of labeled graphs with the same structure. H G = E[ logp(g)] = G GP(G)logP(G), graph entropy H S = E[ logp(s)] = S S P(S)logP(S) structural entropy

31 Graph and Structural Entropies A structure model S of a graph G is defined as its unlabeled version. G 1 1 G 2 1 G 3 1 G 4 1 S 1 S G 5 1 G 6 1 G 7 1 G 8 1 S 3 S The probability of a structure S is: P(S) = N(S) P(G) where N(S) is the number of labeled graphs with the same structure. H G = E[ logp(g)] = G GP(G)logP(G), graph entropy H S = E[ logp(s)] = S S P(S)logP(S) structural entropy Graph Automorphism: For a graph G its automorphism is adjacency preserving permutation of vertices of G. H S = H G logn! + S S P(S)log Aut(S) b a c N(S) = n! Aut(S) d e

32 Erdös-Rényi Graph Model Erdös and Rényi model: G(n, p) generates graphs with n vertices, where edges are chosen independently with probability p: P(G) = p k (1 p) (n 2) k. Lemma (Kim, at al. 2002]. For ER graphs P(Aut(G) = 1) 1.

33 Erdös-Rényi Graph Model Erdös and Rényi model: G(n, p) generates graphs with n vertices, where edges are chosen independently with probability p: P(G) = p k (1 p) (n 2) k. Lemma (Kim, at al. 2002]. For ER graphs P(Aut(G) = 1) 1. Theorem 3 (Y. Choi and W.S., 2012). (i) For largenand all p satisfying lnn n p and 1 p lnn n (i.e., the graph is connected w.h.p.), H S = ( n 2) h(p) log n!+o(1) = ( n 2) h(p) nlogn+nloge 1 2 logn+o(1), where h(p) = plogp (1 p)log(1 p) is the entropy rate. (ii) CONVERSE: There is an algorithm, called SZIP, whose code length L(S) is upper bounded by E[L(S)] ( n 2) h(p) nlogn + n(c + Φ(logn)) + o(n),

34 Erdös-Rényi Graph Model Erdös and Rényi model: G(n, p) generates graphs with n vertices, where edges are chosen independently with probability p: P(G) = p k (1 p) (n 2) k. Lemma (Kim, at al. 2002]. For ER graphs P(Aut(G) = 1) 1. Theorem 3 (Y. Choi and W.S., 2012). (i) For largenand all p satisfying lnn n p and 1 p lnn n (i.e., the graph is connected w.h.p.), H S = ( n 2) h(p) log n!+o(1) = ( n 2) h(p) nlogn+nloge 1 2 logn+o(1), where h(p) = plogp (1 p)log(1 p) is the entropy rate. (ii) CONVERSE: There is an algorithm, called SZIP, whose code length L(S) is upper bounded by ( n E[L(S)] h(p) nlogn + n(c + Φ(logn)) + o(n), 2) Sketch of Proof: N(S) = n! Aut(S) and S S P(S)log Aut(S) = o(1).

35 Structural Zip (SZIP) Algorithm

36 Recurrences for E[B 1 ] and E[B 2 ] Let N x be the number of vertices that passed through node x in T n. {a,b,c,d,e,f,g,h,j} B 1 = B 2 = = log(n x + 1) x Tn and Nx > 1 x Tn and Nx = 1 x Tn and Nx = 1 log(n x + 1) 1. {d,g,j} {g,j} {b,c} {g} {c} {b} {c} {b} {h} {b} {h} {a,b,c,e,h} {a,e,h} {a,e,h} {a,e} {e} {a} {h} {e} {a} {e} {a} {a}

37 Recurrences for E[B 1 ] and E[B 2 ] Let N x be the number of vertices that passed through node x in T n. {a,b,c,d,e,f,g,h,j} B 1 = B 2 = = log(n x + 1) x Tn and Nx > 1 x Tn and Nx = 1 x Tn and Nx = 1 log(n x + 1) 1. {d,g,j} {g,j} {b,c} {g} {c} {b} {c} {b} {h} {b} {h} {a,b,c,e,h} {a,e,h} {a,e,h} {a,e} {e} {a} {h} {e} {a} {e} {a} Both E[ B 1 ] and E[ B 2 ] satisfy two-dimensional recurrences for some d 0 {a} b(n + 1,0) = n + n k=0( n k) p k q n k [b(k,0) + b(n k,k)], b(n,d) = n + n k=0( n k) p k q n k [b(k,d 1) + b(n k,k + d 1)], for d 1.

38 Regular Tries d (Knuth, 1968) Regular Trie Recurrence: Set d b(n, ) = n+ n k=0 ( n k) p k q n k [b(k, ) + b(n k, )]. Asymptotically: (Knuth 70, Jacquet 88, WS 89) b(n, ) = 1 h nlogn + 1 h [ γ + h ] 2 2h + Φ(log pn) n + o(n), where Φ(x) is the periodic function Φ(x) = ( k=,k 0 Γ when logp/log(1 p) is irrational, then Φ(x) 0 as x. 2kπir logp ) e 2kπrix ;

39 Regular Tries d (Knuth, 1968) Regular Trie Recurrence: Set d b(n, ) = n+ n k=0 ( n k) p k q n k [b(k, ) + b(n k, )]. Asymptotically: (Knuth 70, Jacquet 88, WS 89) b(n, ) = 1 h nlogn + 1 h [ γ + h ] 2 2h + Φ(log pn) n + o(n), where Φ(x) is the periodic function Φ(x) = ( k=,k 0 Γ when logp/log(1 p) is irrational, then Φ(x) 0 as x. 2kπir logp ) e 2kπrix ; D. E. Knuth, The Art of Computer Programming, vol 3, Addison-Wesley, 1973.

40 Regular Tries d (Knuth, 1968) Regular Trie Recurrence: Set d b(n, ) = n+ n k=0 ( n k) p k q n k [b(k, ) + b(n k, )]. Asymptotically: (Knuth 70, Jacquet 88, WS 89) b(n, ) = 1 h nlogn + 1 [ γ + h ] 2 h 2h + Φ(log pn) n + o(n), where Φ(x) is the periodic function Φ(x) = ( ) k=,k 0 Γ 2kπir logp e 2kπrix ; when logp/log(1 p) is irrational, then Φ(x) 0 as x. D. E. Knuth, The Art of Computer Programming, vol 3, Addison-Wesley, Define b(n,d) := b(n,d) b(n, ). Then we have our main result. Theorem 4. For n and d = O(1) we have b(n,d) = O(log 2 n), that is, 1 b(n,d) = 2hlogp log2 n + d h logn [ + 1 2h + 1 ( γ h )] 2 hlogp 2h + Ψ(log pn) where Ψ( ) is the periodic function, as above. logn + O(1),

41 Sketch of Proof 1. We analyze b (n,d) instead b(n,d) satisfying b (n,d) = n k=0 ( n k) p k q n k b (k,d 1) 2. The Poisson transform of b (n,d) A d (z) = n=2 b (n,d) zn n! e z satisfies the following functional recurrence A d (z) = A d 1 (pz) which can be solved as A d (z) = A 0 (p d z). 3. Define the Mellin transform: M(s) = 0 A 0 (z)z s 1 dz which leads to the following functional equation (s 1)M(s 1) + (1 p s )M(s) = (s 1)Γ(s) 1 p 1 s q 1 s. This can be solved and we arrive at ( Γ(s) s(s 1) M(s) = L=0 (1 pl s ) 2 + β + (s i 1) i=0 [ k=1 (1 ]) pk s+i ) 1 p 1+i s q 1 1+i s Residue theory and depoissonization complete the proof.

42 Standing of the Shoulders of Giants...

Minimax Redundancy for Large Alphabets by Analytic Methods

Minimax Redundancy for Large Alphabets by Analytic Methods Wojciech Szpankowski Purdue University W. Lafayette, IN 47907 March 15, 2012 NSF STC Center for Science of Information CISS, Princeton 2012 Joint