Algorithms, Combinatorics, Information, and Beyond. Dedicated to PHILIPPE FLAJOLET

Size: px

Start display at page:

Download "Algorithms, Combinatorics, Information, and Beyond. Dedicated to PHILIPPE FLAJOLET"

Doreen Clark
5 years ago
Views:

University, Montreal, 2011 Dedicated to PHILIPPE FLAJOLET

1 Algorithms, Combinatorics, Information, and Beyond Wojciech Szpankowski Purdue University October 3, 2011 Concordia University, Montreal, 2011 Dedicated to PHILIPPE FLAJOLET Research supported by NSF Science & Technology Center, and Humboldt Foundation.

2 Outline 1. Shannon Legacy 2. Analytic Combinatorics + IT = Analytic Information Theory 3. The Redundancy Rate Problem (a) Universal Memoryless Sources (b) Universal Markov Sources 4. Post-Shannon Challenges: Beyond Shannon 5. Science of Information: NSF Science and Technology Center Algorithms: Combinatorics: Information: are at the heart of virtually all computing technologies; provides indispensable tools for finding patterns and structures; permeates every corner of our lives and shapes our universe.

3 Shannon Legacy The Information Revolution started in 1948, with the publication of: A Mathematical Theory of Communication. Claude Shannon: Shannon information quantifies the extent to which a recipient of data can reduce its statistical uncertainty. Shannon, 1948: These semantic aspects of communication are irrelevant. Shannon, 1953: The Lattice Theory of Information. It is hardly to be expected that one single concept of information would satisfactorily account for... (all) possible applications. Applications Enabler/Driver: CD, ipod, DVD, video games, computer communication, Internet,... Design Driver: universal data compression, voiceband modems, discrete denosing,...

4 Three Theorems of Shannon Theorem 1 & 3. [Shannon 1948; Lossless & Lossy Data Compression] compression bit rate source entropy H(X) for distortion level D: lossy bit rate rate distortion function R(D) Theorem 2. [Shannon 1948; Channel Coding ] In Shannon s words: It is possible to send information at the capacity through the channel with as small a frequency of errors as desired by proper (long) encoding. This statement is not true for any rate greater than the capacity.

5 Theorem 2: Shannon Random Decoding Rule There are 2 nh(x) X-typical sequences There are 2 nh(y) Y-typical sequences There are 2 nh(x,y) jointly X,Y-typical pair of sequences, where H(X,Y ) = E[logP(X,Y)] is the joint entropy. Decoding Rule: Declare that sequence sent X is the one that is jointly typical with the received sequence Y provided it is unique.

6 Theorem 2: Shannon Random Decoding Rule There are 2 nh(x) X-typical sequences There are 2 nh(y) Y-typical sequences There are 2 nh(x,y) jointly X,Y-typical pair of sequences, where H(X,Y ) = E[logP(X,Y)] is the joint entropy. Decoding Rule: Declare that sequence sent X is the one that is jointly typical with the received sequence Y provided it is unique. Probability of an error when 2 nr messages are sent is approximately minp(error) 2 nr 2 n sup P(X) I(X,Y) = 2 n(c R), I(X;Y) = H(X)+H(Y ) H(X,Y) where C := sup P(X) I(X,Y).

7 Post-Shannon Challenges 1. Back off from infinity (Ziv 97): Extend Shannon findings to finite size data structures (i.e., sequences, graphs), that is, develop information theory of various data structures beyond first-order asymptotics. Claim: Many interesting information-theoretic phenomena appear in the second-order terms. 2. Science of Information: Information Theory needs to meet new challenges of current applications in biology, communication, knowledge extraction, economics,... to understand new aspects of information such as: and others such as: structure, time, space, and semantics, dynamic information, limited resources, complexity, representation-invariant information, and cooperation & dependency.

8 Outline Update 1. Shannon Legacy 2. Analytic Information Theory 3. Source Coding: The Redundancy Rate Problem 4. Post-Shannon Information 5. NSF Science and Technology Center

9 Analytic Combinatorics+Information Theory=Analytic Information Theory In the 1997 Shannon Lecture Jacob Ziv presented compelling arguments for backing off from first-order asymptotics in order to predict the behavior of real systems with finite length description. To overcome these difficulties, one may replace first-order analyses by non-asymptotic analysis, however, we propose to develop full asymptotic expansions and more precise analysis (e.g., large deviations, CLT). Following Hadamard s precept 1, we study information theory problems using techniques of complex analysis such as generating functions, combinatorial calculus, Rice s formula, Mellin transform, Fourier series, sequences distributed modulo 1, saddle point methods, analytic poissonization and depoissonization, and singularity analysis. This program, which applies complex-analytic tools to information theory, constitutes analytic information theory. 2 1 The shortest path between two truths on the real line passes through the complex plane. 2 Andrew Odlyzko: Analytic methods are extremely powerful and when they apply, they often yield estimates of unparalleled precision.

10 Some Successes of Analytic Information Theory Wyner-Ziv Conjecture concerning the longest match in the WZ 89 compression scheme (W.S., 1993). Ziv s Conjecture on the distribution of the number of phrases in the LZ 78 (Jacquet & W.S., 1995, 2011). Redundancy of the LZ 78 (Savari, 1997, Louchard & W.S., 1997). Steinberg-Gutman Conjecture regarding lossy pattern matching compression (Luczak & W.S., 1997; Kieffer & Yang, 1998; Kontoyiannis, 2003). Precise redundancy of Huffman s Code (W.S., 2000) and redundancy of a fixed-to-variable no prefix free code (W.S. & Verdu, 2010). Minimax Redundancy for memoryless sources (Xie &Barron, 1997; W.S., 1998; W.S. & Weinberger, 2010), Markov sources (Risannen, 1998; Jacquet & W.S., 2004), and renewal sources (Flajolet & W.S., 2002; Drmota & W.S., 2004). Analysis of variable-to-fixed codes such as Tunstall and Khodak codes (Drmota, Reznik, Savari, & W.S., 2006, 2008, 2010). Entropy of hidden Markov processes and the noisy constrained capacity (Jacquet, Seroussi, & W.S., 2004, 2007, 2010; Han & Marcus, 2007)....

11 Outline Update 1. Shannon Legacy 2. Analytic Information Theory 3. Source Coding: The Redundancy Rate Problem (a) Universal Memoryless Sources (b) Universal Markov Sources

12 Source Coding and Redundancy Source coding aims at finding codesc : A {0,1} of the shortest length L(C,x), either on average or for individual sequences. Known Source P : The pointwise and maximal redundancy are: R n (C n,p;x n 1 ) = L(C n,x n 1 ) + logp(xn 1 ) R (C n,p) = max{r x n n (C n,p;x n 1 )}( 0). 1 where P(x n 1 ) is the probability of xn 1 = x 1 x n. Unknown Source P : Following Davisson, the maximal minimax redundancy R n (S) for a family of sources S is: R n (S) = minsup Cn P S max[l(c x n n,x n 1 ) + logp(xn 1 )]. 1 Shtarkov s Bound: d n (S) := log x n 1 An sup P S P(x n 1 ) R n (S) log x n 1 An sup P S P(x n 1 ) +1 } {{ } Dn(S)

13 Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator.

14 Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator. C n ( Θ) balls ; log C n ( θ) useful bits θ θ'... 1 n θˆ indistinquishable models 2. Two models, say P θ (x n ) and P θ (x n ) are indistinguishable if the ML estimator ˆθ with high probability declares both models are the same. 3. The number of distinguishable distributions (i.e, ˆθ), C n (Θ), summarizes then learnable information can be viewed as I(Θ) = log 2 C n (Θ).

15 Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator. C n ( Θ) balls ; log C n ( θ) useful bits θ θ'... 1 n θˆ indistinquishable models 2. Two models, say P θ (x n ) and P θ (x n ) are indistinguishable if the ML estimator ˆθ with high probability declares both models are the same. 3. The number of distinguishable distributions (i.e, ˆθ), C n (Θ), summarizes then learnable information can be viewed as I(Θ) = log 2 C n (Θ). 4. Consider the following expansion of the Kullback-Leibler (KL) divergence D(Pˆθ P θ ) := E[logPˆθ(X n )] E[logP θ (X n )] 1 2 (θ ˆθ) T I(ˆθ)(θ ˆθ) d 2 I (θ, ˆθ) where I(θ) = {I ij (θ)} ij is the Fisher information matrix and d I (θ, ˆθ) is a rescaled Euclidean distance known as Mahalanobis distance.

16 Learnable Information and Redundancy 1. S := M k = {P θ : θ Θ} set of k-dimensional parameterized distributions. Let ˆθ(x n ) = argmax θ Θ log1/p θ (x n ) be the ML estimator. C n ( Θ) balls ; log C n ( θ) useful bits θ θ'... 1 n θˆ indistinquishable models 2. Two models, say P θ (x n ) and P θ (x n ) are indistinguishable if the ML estimator ˆθ with high probability declares both models are the same. 3. The number of distinguishable distributions (i.e, ˆθ), C n (Θ), summarizes then learnable information can be viewed as I(Θ) = log 2 C n (Θ). 4. Consider the following expansion of the Kullback-Leibler (KL) divergence D(Pˆθ P θ ) := E[logPˆθ(X n )] E[logP θ (X n )] 1 2 (θ ˆθ) T I(ˆθ)(θ ˆθ) d 2 I (θ, ˆθ) where I(θ) = {I ij (θ)} ij is the Fisher information matrix and d I (θ, ˆθ) is a rescaled Euclidean distance known as Mahalanobis distance. 5. Balasubramanian proved the number of distinguishable balls C n (Θ) of radius O(1/ n) is asymptotically equal to the minimax redundancy: Learnable Information = logc n (Θ) = inf θ Θ max x n log Pˆθ P θ = R n (Mk )

17 Outline Update 1. Shannon Legacy 2. Analytic Information Theory 3. Source Coding: The Redundancy Rate Problem (a) Universal Memoryless Sources (b) Universal Markov Sources

18 Maximal Minimax for Memoryless Sources For a memoryless source over the alphabet A = {1, 2,..., m} we have: Then P(x n 1 ) = p 1 k1 p m km, k k m = n. D n (M 0 ) := x n 1 sup P(x n 1 ) P(x n 1 ) = x n 1 = = sup p k 1 1 pk m m p 1,...,pm k 1 + +km=n k 1 + +km=n ( n k 1,...,k m ) sup p 1,...,pm ( n )( k1 k 1,...,k m n since the (unnormalized) likelihood distribution is sup P(x n P(x n 1 ) 1 ) = sup p k 1 1 pk m m = p 1,...,pm ( k1 n p k 1 1 pk m m ) k1 ( km ) k1 ( km n n ) k m ) k m.

19 Generating Function for D n (M 0 ) We write k 1 + +km=n ( n )( ) k1 k1 km ( k 1,...,k m n n ) k m = n! n n k 1 + +km=n k k 1 1 k 1! kkm m k m! Let us introduce a tree-generating function B(z) = k=0 k k k! zk = 1 1 T(z), T(z) = k=1 k k 1 z k k! where T(z) = ze T(z) (= W( z), Lambert s W -function) that enumerates all rooted labeled trees. Let now D m (z) = n=0 z n n n n! D n(m 0 ). Then by the convolution formula D m (z) = [B(z)] m 1.

20 Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) = B(z/e) = 1 2(1 z) O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n! n n[zn ][B(z)] m = 2πn(1 + O(1/n)) 1 2πi β(z) m dz. zn+1

Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) = B(z/e) = 1 2(1 z) + 1 3 + O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n!

21 Asymptotics for FINITE m The function B(z) has an algebraic singularity at z = e 1, and β(z) = B(z/e) = 1 2(1 z) O( (1 z). By Cauchy s coefficient formula D n (M 0 ) = n! n n[zn ][B(z)] m = 2πn(1 + O(1/n)) 1 2πi β(z) m dz. zn+1 For finite m, the singularity analysis of Flajolet and Odlyzko implies [z n ](1 z) α nα 1 Γ(α), α / {0, 1, 2,...} that finally yields (cf. Clarke & Barron, 1990, W.S., 1998) R n (M 0) = m 1 ( ) ( ) n π log + log 2 2 Γ( m 2 ) ( 3 + m(m 2)(2m + 1) Γ(m 2 )m 3Γ( m ) Γ2 ( m 2 )m2 9Γ 2 ( m ) ) 2 n 1 n +

22 Redundancy for LARGE m Now assume that m is unbounded and may vary with n. Then D n,m (M 0 ) = 2πn 1 2πi β(z) m where g(z) = mlnβ(z) (n + 1)lnz. z n+1 dz = 2πn 1 2πi The saddle point z 0 is a solution of g (z 0 ) = 0, that is, e g(z) dz g(z) = g(z 0 ) (z z 0) 2 g (z 0 ) + O(g (z 0 )(z z 0 ) 3 ). Under mild conditions satisfied by our g(z) (e.g., z 0 is real and unique), the saddle point method leads to: D n,m (M 0 ) = for some ρ < 3/2. e g(z 0 ) ( g (1 )) 2π g (z 0 ) (z 0 ) + O, (g (z 0 )) ρ

23 In Pictures... Complex plot of: β(z) m z n+1. m = o(n) m = n n = o(m)

24 Main Results for Large m Theorem 1 (Orlitsky and Santhanam, 2004, and W.S. and Weinberger, 2010). For memoryless sources M 0 over an m-ary alphabet, m as n grows, we have: (i) For m = o(n) R n,m = m 1 2 log n m + m 2 loge + mloge 3 m n 1 2 O ( ) m n (ii) For m = αn + l(n), where α is a positive constant and l(n) = o(n), R n,m = nlogb α + l(n)logc α log A α + O(l(n) 2 /n) where C α := /α, A α := C α + 2/α, B α = αc α+2 α e 1 Cα. (iii) For n = o(m) R n,m = nlog m n + 3 n 2 2m loge 3 ( ) n 1 2m loge + O n + n3. m 2

25 Outline Update 1. Shannon Legacy 2. Analytic Information Theory 3. Source Coding: The Redundancy Rate Problem (a) Universal Memoryless Sources (b) Universal Markov Sources

26 Maximal Minimax for Markov Sources Markov source over A = m (FINITE) with transition matrix P = {p ij } m i,j=1 : P(x n 1 ) = pk pk mm mm (e.g. P( ) = p 1 00 p2 01 p2 10 p2 11 ) where k ij represents the number of pairs (ij) in x n 1. Then D n (M 1 ) = x n 1 sup P p k pk mm mm = k Pn(m) T n (k) ( k11 k 1 ) k11... ( kmm k m ) k mm, where P n (m) denotes the set of Markov types (i.e., k = [k ij ]), while T n (k) := T n (x n 1 ) is the set of sequences of type k = [k ij].

27 Maximal Minimax for Markov Sources Markov source over A = m (FINITE) with transition matrix P = {p ij } m i,j=1 : P(x n 1 ) = pk pk mm mm (e.g. P( ) = p 1 00 p2 01 p2 10 p2 11 ) where k ij represents the number of pairs (ij) in x n 1. Then D n (M 1 ) = x n 1 sup P p k pk mm mm = k Pn(m) T n (k) ( k11 k 1 ) k11... ( kmm k m ) k mm, where P n (m) denotes the set of Markov types (i.e., k = [k ij ]), while T n (k) := T n (x n 1 ) is the set of sequences of type k = [k ij]. For circular strings the balance matrix k = [k ij ] satisfies: 1 i,j m k ij = n, m j=1 k ij = m j=1 k ji : F n (m) For example: m=3 k 11 + k 12 + k 13 + k 21 + k 22 + k 23 + k 31 + k 32 + k 33 = n k 12 + k 13 = k 21 + k 31 k 12 + k 32 = k 21 + k 23 k 13 + k 23 = k 31 + k 32 Matrices satisfying F n (m) are called balanced matrices.

28 Markov Types and Eulerian Cycles Example: Let A = {0,1} and k = [ ]

29 Markov Types and Eulerian Cycles Example: Let A = {0,1} and k = [ ] P n (m) Markov types but also... a set of all connected Eulerian di-graphs G = (V(G)],E(G)) such that V(G) A and E(G) = n. E n (m) set of connected Eulerian digraphs on A. F n (m) balanced matrices but also... set of (not necessary connected) Eulerian digraphs on A. Asymptotic equivalence: P n (m) = F n (m) + O(n m2 3m+3 ) E n (m).

30 Markov Types Main Results Theorem 2 (Knessl, Jacquet, and W.S., 2010). (i) For fixed m and n d(m) = 1 (2π) m 1 n m2 m P n (m) = d(m) (m 2 m)! + O(nm }{{} (m 1) f old m 1 j= ϕ 2 j k l 2 m 1 ) (ii) When m we find that provided that m 4 = o(n) (ϕ k ϕ l ) 2 dϕ 1dϕ 2 dϕ m 1. P n (m) 2m 3m/2 e m2 m 2m2 2 m nm πm/2 2 m Some numerics: P n (3) 1 n !, P n(4) 1 96 n 12 12!, P n(5) n 20 20!.

31 Markov Types Main Results Theorem 2 (Knessl, Jacquet, and W.S., 2010). (i) For fixed m and n d(m) = 1 (2π) m 1 n m2 m P n (m) = d(m) (m 2 m)! + O(nm }{{} (m 1) f old m 1 j= ϕ 2 j k l 2 m 1 ) (ii) When m we find that provided that m 4 = o(n) (ϕ k ϕ l ) 2 dϕ 1dϕ 2 dϕ m 1. P n (m) 2m 3m/2 e m2 m 2m2 2 m nm πm/2 2 m Some numerics: P n (3) 1 n !, P n(4) 1 96 n 12 12!, P n(5) n 20 20!. Open Question: Counting types and sequences of a given type in a Markov field. (Answer: P n (2) 1 n !?)

32 Outline Update 1. Shannon Information Theory 2. Source Coding 3. The Redundancy Rate Problem 4. Post-Shannon Information 5. NSF Science and Technology Center

33 Post-Shannon Challenges: Science of Information Classical Information Theory needs a recharge to meet new challenges of nowadays applications in biology, modern communication, knowledge extraction, economics and physics,.... We need to extend Shannon information theory to include new aspects of information such as: and others such as: structure, time, space, and semantics, dynamic information, limited resources, complexity, physical information, representation-invariant information, and cooperation & dependency.

34 Time & Space 1. Coding Rate for Finite Blocklength: Polyanskiy & Verdu, n logm (n,,ε) C V n Q 1 (ε) where C is the capacity, V is the channel dispersion, n is the block length, ε error probability, and Q is the complementary Gaussian distribution. Time & Space: Classical Information Theory is at its weakest in dealing with problems of delay (e.g., information arriving late maybe useless or has less value). 2. Speed of Information in DTN: Jacquet et al., Real-Time Communication over Unreliable Wireless Channels with Hard Delay Constraints: P.R. Kumar & Hou, 2011.

35 Limited Resources, Representation, and Cooperation Limited Computational Resources: In many scenarios, information is limited by available computational resources (e.g., cell phone, living cell). (Helman & Cover, 1970, Learning with Limited Memory.) Representation-invariant of information: How to know whether two representations of the same information are information equivalent? Cooperation. Often subsystems may be in conflict (e.g., denial of service) or in collusion (e.g., price fixing). How does cooperation impact information? (In wireless networks nodes should cooperate in their own self-interest.) (Cuff & Cover, IT, 2010.)

36 Structure Structure: Measures are needed for quantifying information embodied in structures (e.g., material structures, nanostructures, biomolecules, gene regulatory networks protein interaction networks, social networks, financial transactions).

37 Structure Structure: Measures are needed for quantifying information embodied in structures (e.g., material structures, nanostructures, biomolecules, gene regulatory networks protein interaction networks, social networks, financial transactions). Information Content of Unlabeled Graphs: A random structure model S of a graph G is defined for an unlabeled version. Some labeled graphs have the same structure. G 1 1 G 2 1 G 3 1 G 4 1 S 1 S G 5 1 G 6 1 G 7 1 G 8 1 S 3 S H G = E[ logp(g)] = G GP(G)logP(G), H S = E[ logp(s)] = S S P(S)logP(S).

38 Automorphism and Erdös-Rényi Graph Model Graph Automorphism: a For a graph G its automorphism is adjacency preserving permutation of vertices of G. b d c e Erdös and Rényi model: G(n, p) generates graphs with n vertices, where edges are chosen independently with probability p. If G has k edges, then P(G) = p k (1 p) (n 2) k.

39 Automorphism and Erdös-Rényi Graph Model Graph Automorphism: a For a graph G its automorphism is adjacency preserving permutation of vertices of G. b d c e Erdös and Rényi model: G(n, p) generates graphs with n vertices, where edges are chosen independently with probability p. If G has k edges, then P(G) = p k (1 p) (n 2) k. Theorem 3 (Y. Choi and W.S., 2008). For large n and all p satisfying lnn n p and 1 p lnn n (i.e., the graph is connected w.h.p.), H S = ( n 2) h(p) log n!+o(1) = ( n 2) h(p) nlogn+nloge 1 2 logn+o(1), where h(p) = plogp (1 p)log(1 p) is the entropy rate. AEP for structures: 2 ( n 2)(h(p)+ε)+logn! P(S) 2 ( n 2)(h(p) ε)+logn!. Sketch of Proof: 1. H S = H G logn! + S S P(S)log Aut(S). 2. S S P(S)log Aut(S) = o(1) by asymmetry of G(n,p).

40 Structural Zip (SZIP) Algorithm

41 Asymptotic Optimality of SZIP Theorem 4 (Choi, W.S., 2008). Let L(S) be the length of the code. (i) For large n, E[L(S)] ( n 2) h(p) nlogn + n(c + Φ(logn)) + o(n), where h(p) = plogp (1 p)log(1 p), c is an explicitly computable constant, and Φ(x) is a fluctuating function with a small amplitude or zero. (ii) Furthermore, for any ε > 0, P (L(S) E[L(S)] εnlogn) 1 o(1). (iii) The algorithm runs in O(n + e) on average, where e # edges. Real-world Table 1: The length of encodings (in bits) Networks # of # of our adjacency adjacency arithmetic nodes edges algorithm matrix list coding US Airports 332 2,126 8,118 54,946 38,268 12,991 Protein interaction (Yeast) 2,361 6,646 46,912 2,785, ,504 67,488 Collaboration (Geometry) 6,167 21, ,365 19,012, , ,811 Collaboration (Erdös) 6,935 11,857 62,617 24,043, , ,377 Genetic interaction (Human) 8,605 26, ,199 37,0 18, , ,569 Internet (AS level) 25,881 52, , ,900,140 1,572, ,060

42 Outline Update 1. Shannon Information Theory 2. Source Coding 3. The Redundancy Rate Problem 4. Post-Shannon Information 5. NSF Science and Technology Center on Science of Information

43 NSF Center for Science of Information In 2010 National Science Foundation established $25M Science and Technology Center for Science of Information (http: soihub.org) to advance science and technology through a new quantitative understanding of the representation, communication and processing of information in biological, physical, social and engineering systems. The center is located at Purdue University and partner istitutions include: Berkeley, MIT, Princeton, Stanford, UIUC and Bryn Mawr & Howard U. Specific Center s Goals: define core theoretical principles governing transfer of information. develop meters and methods for information. apply to problems in physical and social sciences, and engineering. offer a venue for multi-disciplinary long-term collaborations. transfer advances in research to education and industry.

45 That s IT THANK YOU

Minimax Redundancy for Large Alphabets by Analytic Methods

Minimax Redundancy for Large Alphabets by Analytic Methods Wojciech Szpankowski Purdue University W. Lafayette, IN 47907 March 15, 2012 NSF STC Center for Science of Information CISS, Princeton 2012 Joint