Analytic Algorithmics, Combinatorics, and Information Theory

Size: px

Start display at page:

Download "Analytic Algorithmics, Combinatorics, and Information Theory"

Vincent Hart
5 years ago
Views:

Lafayette, IN 47907 September 11, 2006 AofA and IT logos Research

1 Analytic Algorithmics, Combinatorics, and Information Theory W. Szpankowski Department of Computer Science Purdue University W. Lafayette, IN September 11, 2006 AofA and IT logos Research supported by NSF and NIH. Joint work with M. Drmota, P. Flajolet, and P. Jacquet.

2 Outline 1. Basic Facts about Source Coding 2. The Redundancy Rate Problem (a) Known Sources (Sequences mod 1) (b) Universal Memoryless Sources (Tree-like generating functions) (c) Universal Markov Sources (Balance matrices, Eulerian paths) (d) Universal Renewal Sources (Combinatorial calculus, Mellin, asymptotics) 3. Conclusions: Information Beyond Shannon (a) Computer Science and Information Theory Interplay (b) Today s Challenges (c) Vision

Goals of Source Coding The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions (lengths) either on average or for individual sequences when the source (i.

3 Goals of Source Coding The basic problem of source coding (i.e., data compression) is to find codes with shortest descriptions (lengths) either on average or for individual sequences when the source (i.e., statistics of the underlying probability distribution) is unknown (the so called universal source coding. Goals: Find universal lower bound on compression ratio (bit rate). Construct universal source codes that achieve this lower bound up to the second order asymptotics (i.e., match redundancy which is basically a measure of the second term asymptotics). As pointed out by Rissanen, universal coding evolved into universal modeling where the purpose is no longer restricted to just coding but rather to finding optimal models for data.

4 Some Definitions Definition: A block-to-variable (BV) length code C n : A n {0, 1} is a bijective mapping from a set of all sequences of length n over the alphabet A to the set {0, 1} of binary sequences. For a probabilistic source model S and a code C n we let: P(x n 1 ) be the probability of xn 1 = x 1... x n ; L(C n, x n 1 ) be the code length for xn 1 ; Entropy H n (P) = P x n 1 P(xn 1 )lg P(xn 1 ). Information-theoretic quantities are expressed in binary logarithms written lg := log 2.

5 Two Facts A prefix code is such that no codeword is a prefix of another codeword. Kraft s Inequality A code is a prefix code if and only if l 1, l 2,..., l m satisfy the inequality mx 2 l i 1. i=1 where l 1, l 2,..., l m are codeword length. Proof. An easy exercise on trees. Shannon s First Theorem For any prefix code the average code length E[L(C n, X n 1 )] cannot be smaller than the entropy of the source H n (P), that is, E[L(C n, X n 1 )] H n(p). Proof: Use log x x 1 for 0 < x 1 and Kraft s inequality. Optimal length: Minimize min C n E[L(C n, X)] subject to Kraft s inequality: el(x n 1 ) = log P(xn 1 ).

6 Outline Update 1. Basic Facts of Source Coding 2. The Redundancy Rate Problem (a) Known Sources (Sequences mod 1) (b) Universal Memoryless Sources (c) Universal Markov Sources (d) Universal Renewal Sources 3. Conclusions

7 Redundancy Known Source P The pointwise redundancy R n (C n, P; x n 1 ) and the average redundancy R n (C n, P) are defined as R n (C n, P; x n 1 ) = L(C n, x n 1 ) + lg P(xn 1 ) R n (C n ) = E[L(C n, X n 1 )] H n(p) 0 The maximal or worst case redundancy is R (C n, P) = max{r x n n (C n, P; x n 1 )}( 0). 1 The pointwise redundancy can be negative, maximal and average redundancy cannot. The smaller the redundancy is, the better (closer to the optimal) the code is.

8 Redundancy for Known Sources Huffman Code The following optimization problem is solved by Huffman s code Generalized Shannon Code R n (P) = min Cn C E x n 1 [L(C n, x n 1 ) + log 2 P(x n 1 )]. Drmota and W.S. (2001) proved that R n (P) = min Cn max[l(c x n n, x n 1 ) + lg P(xn 1 )]. 1 is solved by the generalized Shannon code C GS n which is defined as L(C GS n, xn 1 ) = j lg 1/P(x n 1 ) if lg P(xn 1 ) s 0 lg 1/P(x n 1 ) if lg P(xn 1 ) > s 0 where s 0 is a constant, and x = x x is the fractional part of x;

Two Modes of R H n (P) Behavior 0.08 0.08 0.07 0.06 0.06 0.04 0.05 0.04 0.02 0.

9 Two Modes of R H n (P) Behavior Figure: The average redundancy of Huffman codes versus block size n for: (a) irrational α = log 2 (1 p)/p with p = 1/π; (b) rational α = log 2 (1 p)/p with p = 1/9.

10 Why Two Modes: Shannon Code Consider the Shannon code that assigns the length L(C S n, xn 1 ) = lg P(xn 1 ) to the source sequence x n 1. Observe that P(x n 1 ) = pk (1 p) n k where p is known probability of generating 0, and k is the number of 0s. The Shannon code redundancy is R S n = nx k=0 n k p k (1 p) n k log 2 (p k (1 p) n k ) + log 2 (p k (1 p) n k ) = 1 nx k=0 n k p k (1 p) n k αk + βn where x = x x is the fractional part of x, and ««1 p 1 α = log 2, β = log 2. p 1 p

11 Sequences Modulo 1 Asymptotic redundancy depends on the behavior of the sequence x k = αk + βn. One needs asymptotics of the following sum: nx k=0 n k p k (1 p) n k f( x k + y ) 8 >< >: R 1 0 f(t) dt, x k B.u.d mod 1 sequences oscilating otherwise for fixed p and some Riemann integrable function f : [0, 1] R. For the Huffman code the following is known [W.S. (2000)] R H n = 8 >< >: ln M + o(1) α irrational ` βmn nβm /M M(1 2 1/M + ) O(ρn ) α = N M N, M are integers such that gcd(n, M) = 1 and ρ < 1. For the generalized Shannon code Drmota and W.S. (2004) proved when α is irrational. R n (P log log 2 p) = log 2 + o(1) = o(1).

12 Outline Update 1. Basic Facts of Source Coding 2. The Redundancy Rate Problem (a) Known Sources (b) Unknown Sources (memoryless, Markov, renewal) 3. Conclusions

13 Minimax Redundancy Unknown Source P In practice, one can only hope to have some knowledge about a family of sources S that generate real data. Following Davisson we define the average minimax redundancy R n (S) and the worst case (maximal) minimax redundancy R n (S) for a family of sources S as R n (S) R n (S) = min Cn = min Cn supe[l(c n, x n 1 ) + lg P(xn 1 )] P S sup P S max[l(c x n n, x n 1 ) + lg P(xn 1 )]. 1 In the minimax scenario we look for the best code for the the worst source. Source Coding Goal: Find data compression algorithms that match optimal redundancy rates either on average or for individual sequences.

14 Another Representation for the Minimax Redundancy For the maximal minimax redundancy define Q (x n 1 ) := sup P S P(x n 1 ) Py n1 An sup P S P(y n 1 ). the maximum likelihood distribution. Observe that R n (S) = min sup Cn C P S = min Cn C max x n 1 = min max Cn C x n 1 max(l(c n, x n 1 ) + lg P(xn 1 )) x n 1 = R GS B n (Q ) + «L(C n, x n 1 ) + sup lg P(x n 1 ) P S [L(C n, x n 1 ) + lg Q (x n 1 ) + lg X 0 X sup y 1 n An P S 1 P(y n 1 ) C A, y n 1 An sup P S P(y n 1 )] R GS n (Q ) being the maximal redundancy of a generalized Shannon. Define 0 1 B X D n (S) = P(x n 1 ) C A := lg d n (S). sup x n 1 An P S

15 Maximal Minimax Redundancy We consider the following classes of sources S: Memoryless sources M 0 over an m-ary (finite) alphabet, that is, P(x n 1 ) = pk 1 1 pk m m with k k m = n, where p i are unknown! Markov sources M r over a binary alphabet of order r. Observe that for r = 1 P(x n 1 ) = p x 1 p k pk pk pk 11 11, where k ij is such that (x k, x k+1 ) = (i, j) {0, 1} 2 and k 00 + k 01 + k 10 + k 11 = n 1, and that k 01 = k 10 if x 1 = x n and k 01 = k 10 ± 1 if x 1 x n. Renewal Sources R 0 where a 1 is introduced after a run of 0 s distributed according to some distribution.

16 Outline Update 1. Basic Facts of Source Coding 2. The Redundancy Rate Problem (a) Known Sources (b) Universal Memoryless Sources (c) Universal Markov Sources (d) Universal Renewal Sources 3. Conclusions

17 Maximal Minimax for Memoryless Sources We first consider the maximal minimax redundancy R n (M 0) for a class of memoryless sources over a finite m-ary alphabet. Observe that d n (M 0 ) = X x n 1 = = The summation set is sup p k 1 1 pk m m p 1,...,pm X k 1 + +km=n X k 1 + +km=n n k 1,..., k m sup p 1,...,pm n k 1,..., k m k1 n p k 1 1 pk m m «k1 km I(k 1,..., k m ) = {(k 1,..., k m ) : k k m = n}. The number N k of types k = (k 1,..., k m ) is n N k = k 1,..., k m The (unnormalized) likelihood distribution is sup p k 1 1 pk m m = p 1,...,pm k1 n «k1 km n «k m n «k m.

18 Generating Function for d n (M 0 ) We write d n (M 0 ) = n! n n X k 1 + +km=n Let us introduce the tree-generating function k k 1 1 k 1! kkm m k m! B(z) = X k=0 k k k! zk = 1 1 T(z), where T(z) satisfies T(z) = ze T(z) (= W( z), Lambert s W -function) and X k k 1 T(z) = z k k! k=1 enumerates all rooted labeled trees. Let now D m (z) = Then by the convolution formula X n=0 n n n! d n(m 0 )z n. D m (z) = [B(z)] m.

19 Asymptotics The function B(z) has an algebraic singularity at z = e 1 (it becomes a multi-valued function) and one finds B(z) = 1 p 2(1 ez) O ` 1 ez. The singularity analysis of Flajolet & Odlyzko yields (cf. W.S., 1998) d n (M 0 ) = m «n log + log m(m 2)(2m + 1) 36 π Γ( m 2 )! + Γ(m 2 )m 3Γ( m ) Γ2 ( m 2 )m2 9Γ 2 ( m )! 2 n 1 n + To complete the analysis, we need R GS n (Q ). Drmota & W.S., 2001 proved n (Q ) = ln m 1 1 ln m ln m R GS + o(1), In general, the term o(1) can not be improved. Thus R n (M 0) = m 1 2 n log 2 «ln 1 m 1 ln m ln m + log π Γ( m 2 )! + o(1).

20 Outline Update 1. Basic Facts of Source Coding 2. The Redundancy Rate Problem (a) Known Source (b) Universal Memoryless Sources (c) Universal Markov Sources (d) Universal Renewal Sources 3. Conclusions

21 Maximal Minimax for Markov Sources (i) Consider now the M 1 Markov source of order r = 1, (ii) with the transition matrix P = {p ij } m i,j=1, (iii) and circular sequences (the first symbols follows the last). d n (M 1 ) = X x n 1 = X k Fn sup P p k pk mm mm M k k11 k 1 «k11... kmm k m «k mm, where M k represents the numbers of strings x n 1 of type k, and k ij is the number of pairs ij A 2 in x n 1, k i = P m j=1 k ij. Matrix k = {k ij } m i,j=1 is an integer matrix such that (balance property) F n : mx k ij = n, and mx k ij = mx k ji, i,j=1 j=1 j=0 Matrix k satisfying the above conditions is called the frequency matrix or Markov type matrix.

22 Markov Types For (a binary) Markov source P(x n 1 ) = pk pk pk pk 11 11, thus all strings x n 1 with the same matrix k = {k ij} i,j A have the same empirical distribution. These sequences belong to the same type k. Let N a k be number of (cyclic) strings xn 1 belonging to the same type k and starting with a symbol a. Observe: N a k is equal to the number of Eulerian paths (starting from vertex a A) in a multigraph built on A vertices and k ij edges between the ith and the jth vertices. Example: Let A = {0, 1} and k =»

23 Main Technical Tool Let g k be a sequence of scalars indexed by matrices k and g(z) = X k g k z k be its regular generating function, and Fg(z) = X k F g k z k = X n 0 X k Fn g k z k the F-generating function of g k for which k F. Lemma 1. Let g(z) = P k g kz k. Then Fg(z) := X n 0 X k Fn g k z k = 1 2jπ «m I I dx1 x 1 dxm x m g([z ij x j x i ]) with the ij-coefficient of [z ij x j x i ] is z ij x j x i. Proof. It suffices to observe g([z ij x j x i ]) = X k g k z k m Y i=1 P i x k ij P j k ij i Thus Fg(z) is the coefficient of g([z ij x j x i ]) at x 0 1 x0 2 x0 m.

24 Main Results Theorem 1 (Jacquet and W.S., 2004). (i) Let M 1 be a Markov source over an m-ary alphabet. Then with d n (M 1 ) = A m = where K(1) = {y ij : of degree m 1. n Z 2π «m(m 1)/2 A m 1 + O ««1 n q P j y ij d[y ij ] yij Q j mf m (y ij ) Y K(1) i P ij y ij = 1} and F m ( ) is a polynomial expression (ii) Let M r be a Markov source of order r. Then d n (M r ) = n 2π «m r (m 1)/2 A r m 1 + O ««1 n where A r m is a constant defined in a similar fashion as A m above. In particular, for m = 2, A 2 = 16 C where C is Catalan s constant P ( 1) i i (2i+1)

25 Outline Update 1. Basic Facts of Source Coding 2. The Redundancy Rate Problem (a) Known Sources (b) Universal Memoryless Sources (c) Universal Markov Sources (d) Universal Renewal Sources 3. Conclusions

26 Renewal Sources The renewal process is defined as follows: Let T 1, T 2... be a sequence of i.i.d. positive-valued random variables with distribution Q(j) = Pr{T i = j}. The process T 0, T 0 + T 1, T 0 + T 1 + T 2,... is called the renewal process. With a renewal process we associate a binary renewal sequence in which the positions of the 1 s are at the renewal epochs (runs of zeros) T 0, T 0 + T 1,.... Csiszár and Shields (1996) proved that R n (R 0 ) = Θ( n). Theorem 2 (Flajolet and W.S., 1998). Let c = π Then R n (R 0) = 2 cn + O(log n). log 2

27 Maximal Minimax Redundancy For a sequence x n 0 = 10α 1 10 α αn 1 0 {z 0} k k m is the number of i such that α i = m. Then P(x n 1 ) = Qk 0 (0)Q k 1 (1) Q k n 1 (n 1)Pr{T 1 > k }. It can be proved that r n+1 1 d n (R 0 ) P n m=0 r m where r n = nx k=0 r n,k = X P(n,k) r n,k k «k0 «k1 k0 k1 kn 1 k 0 k n 1 k k k «kn 1 where P(n, k) is the summation set which happens to be the partition of n into k terms, i.e., n = k 0 + 2k nk n 1, k = k k n 1.

28 Main Asymptotic Results Theorem 3 (Flajolet and W.S., 1998). The quantity r n attains the following asymptotics where c = π r n = 2 5 cn log 2 8 lg n lg log n + O(1) Asymptotic analysis is sophisticated and follows these steps: first, we transform r n into another quantity s n ; using a probabilistic technique) we know recover r n from s n ; by combinatorial calculus we find the generating function of s n, which turns out to be an infinite product of tree-functions B(z); transform this product into a harmonic sum that can be analyzed asymptotically by the Mellin transform; find asymptotic expansion of the generating function around z = 1; finally, estimate R n (R 0) by the saddle point method.

29 Asymptotics: The Main Idea The quantity r n is to hard to analyze due to the factor k!/k k, hence we define a new quantity s n defined as ( sn = P n k=0 s n,k s n,k = e k P P(n,k) kk 0 k 0! kk n 1 k n 1!. To analyze it, we introduce the random variable K n as follows Stirling s formula yields Pr{K n = k} = s n,k s n. r n s n = nx k=0 r n,k s n,k s n,k s n = E[(K n )!K K n n e Kn ] = E[ p 2πK n ] + O(E[K n 1 2]).

30 Fundamental Lemmas Lemma 2. Let µ n = E[K n ] and σ 2 n = Var(K n). s n exp 2 cn 78 «log n + d + o(1) µ n = 1 4 r n c log n c + o( n) σ 2 n = O(n log n) = o(µ 2 n ), where c = π 2 /6 1, d = log log c 3 4 Lemma 3. For large n log π. E[ p K n ] = µ 1/2 n (1 + o(1)) E[K n 1 2] = o(1). where µ n = E[K n ]. Thus r n = s n E[ p 2πK n ](1 + o(1)) = s n p 2πµn (1 + o(1)).

31 Sketch of a Proof: Generating Functions 1. Define β(z) as By Lagrange inversion β(z) = β(z) = X k=0 k k k! e k z k. 1 1 T(ze 1 ) where the tree generating function T(z) = ze T(z). 2. Define S n (u) = X s n,k u k, S(z, u) = k=0 X S n (u)z n. n=0 Since s n,k involves convolutions of sequences of the form k k /k!, we have S(z, u) = X P n,k z 1k 0 +2k 1 + u e «k0 + +k n 1 k k 0 k 0! kkn 1 k n 1! = Y β(z i u). We need s n = [z n ]S(z, 1) where [z n ]F(z) is the coefficient at z n of F(z). 3. Let L(z) = log S(z, 1) and z = e t, so that i=1 L(e t ) = X log β(e kt ). k=1 which falls under the harmonic sum paradigm: L(t) = P k λ kf(µ k t).

32 Mellin Asymptotics 4. Computing the Mellin transform L (s) = M(L(e t ); s) of the above harmonic sum we find L (s) = ζ(s)λ(s) where ζ(s) = P n 1 n s is the Riemann zeta function, and Λ(s) = Z 0 log β(e t )t s 1 dt. This leads to the following asymptotic expansion «L Λ(1) (s) + 1 s 1 4s 2 log π «4s s=1 5. By the converse mapping property of the Mellin transform which translates in L(e t ) = Λ(1) t s= log t 1 4 log π + O( t), L(z) = Λ(1) 1 z log(1 z) 1 4 log π 1 2 Λ(1) + O( 1 z). where c = Λ(1) = π 2 / In summary, we just proved that, as z 1, S(z, 1) = e L(z) = a(1 z) 4 1 «c exp (1 + o(1)), 1 z where a = exp( 1 4 log π 1 2 c). 7. To extract asymptotic we need to apply the saddle point method..

33 Outline Update 1. Basic Facts of Source Coding 2. The Redundancy Rate Problem (a) Universal Memoryless Sources (b) Universal Markov Sources (c) Universal Renewal Sources 3. Conclusions

34 Analytic Information Theory In the 1997 Shannon Lecture Jacob Ziv presented compelling arguments for backing off from first-order asymptotics in order to predict the behavior of real systems with finite length description. To overcome these difficulties we propose replacing first-order analyses by full asymptotic expansions and more accurate analyses (e.g., large deviations, central limit laws). Following Knuth and Hadamard s precept 1, we study information theory problems using techniques of complex analysis such as generating functions, combinatorial calculus, Rice s formula, Mellin transform, Fourier series, sequences distributed modulo 1, saddle point methods, analytic poissonization and depoissonization, and singularity analysis. This program, which applies complex-analytic tools to information theory, constitutes analytic information theory. 1 The shortest path between two truths on the real line passes through the complex plane.

35 Information Theory and Computer Science Interface The interplay between IT and CS dates back to the founding father of information theory, Claude E. Shannon. In 2003 was the first Workshop on Information Theory and Computer Science Interface held in Chicago. Examples of IT and CS Interplay: Lempel-Ziv schemes (Ziv, Lempel, Louchard, Jacquet, Szpankowski) LDPC coding, Tornado and Raptor codes (Gallager, Luby, Mitzenmacher, Shokrollahi, Urbanke) List-decoding algorithms for error-correcting codes (Gallager, Sudan, Guruswami, Koetter, Vardy); Kolmogorov complexity (Kolmogorov, Cover, Li, Vitanyi, Lempel, Ziv); Analytic information theory (Jacquet, Flajolet, Drmota, Savari, Szpankowski); Quantum computing and information (Shor, Grover, Schumacher, Bennett, Deutsch, Calderbank); Network coding and wireless computing (Kumar, Yang, Effros, Bruck, Ephremides, Shroff, Verdu).

36 Information Beyond Shannon Participants of the Information Beyond Shannon workshop, Orlando, 2005 listed the following research issues: Delay: In computer networks, delay incurred is a nontrivial issue not yet addressed in information theory (e.g., complete information arriving late maybe useless). Space: In networks the spatially distributed components raise fundamental issues of limitations in information exchange since the available resources must be shared, allocated and re-used. Information and Control: Again in networks our objective is to reliably send data with high bit rate and small delay (control). For example, in wireless/ad-hoc networks, information is exchanged in space and time for decision making, thus timeliness of information delivery along with reliability and complexity constitute the basic objective. Dynamic information: In a complex network in a space-time-control environment ( e.g., human brain information is not simply communicated but also processed) how can the consideration of such dynamical sources be incorporated into the Shannon-theoretic model?

37 Today s Challenges We still lack measures and meters to define and appraise the amount of structure and organization embodied in artifacts and natural objects. Information accumulates at a rate faster than it can be sifted through, so that the bottleneck, traditionally represented by the medium, is drifting towards the receiving end of the channel. Timeliness, space and control are important dimensions of Information. Time and space varying situations are rarely studied in Shannon Information Theory. In a growing number of situations, the overhead in accessing Information makes information itself practically unattainable or obsolete. Microscopic systems do not seem to obey Shannon s postulates of Information. In the quantum world and on the level of living cells, traditional Information often fails to accurately describe reality.

38 Vision Perhaps it is time to initiate an Information Science Institute integrating research and teaching activities aimed at investigating the role of information from various viewpoints: from the fundamental theoretical underpinnings of information to the science and engineering of novel information substrates, biological pathways, communication networks, economics, and complex social systems. The specific means and goals for the Center are: initiate the Science Lecture Series on Information to collectively ponder short and long term goals; study dynamic information theory that extends information theory to time space varying situations; advance information algorithmics that develop new algorithms and data structures for the application of information; encourage and facilitate interdisciplinary collaborations; provide scholarships and fellowships for the best students, and support the development of new interdisciplinary courses.

Analytic Information Theory and Beyond

Analytic Information Theory and Beyond W. Szpankowski Department of Computer Science Purdue University W. Lafayette, IN 47907 May 12, 2010 AofA and IT logos Frankfurt 2010 Research supported by NSF Science