Rooted Trees with Probabilities Revisited

Size: px

Start display at page:

Download "Rooted Trees with Probabilities Revisited"

Ella Fletcher
6 years ago
Views:

1 Rooted Trees with Probabilities Revisited Georg öcherer Institute for Communications Engineering Technische Universität München arxiv: v1 [cs.it] 4 Feb 2013 February 5, 2013 Abstract Rooted trees with probabilities are convenient to represent a class of random processes with memory. They allow to describe and analyze variable length codes for data compression and distribution matching. In this work, the Leaf-Average Node-Sum Interchange Theorem (LANSIT) and the well-known applications to path length and leaf entropy are re-stated. The LANSIT is then applied to informational divergence. Next, the differential LANSIT is derived, which allows to write normalized functionals of leaf distributions as an average of functionals of branching distributions. Joint distributions of random variables and the corresponding conditional distributions are special cases of leaf distributions and branching distributions. Using the differential LANSIT, Pinsker s inequality is formulated for rooted trees with probabilities, with an application to the approximation of product distributions. In particular, it is shown that if the normalized informational divergence of a distribution and a product distribution approaches zero, then the entropy rate approaches the entropy rate of the product distribution. 2 / 34 1 / 34

2 Probability notation Random variable X, takes values in X Distribution P X : for each a X : P X (a) := Pr(X = a). Support supp P X := {a X : P X (a) > 0}. 3 / 34 Rooted Trees with Probabilities [1, 2, 3] L: set of leaves. L: random variable over L. We identify supp P L L, i.e., a node is a leaf of a tree if it has no successors and is generated with non-zero probability. N : set of all nodes on paths through the tree. root 0 N. = N \ L: set of branching nodes. L j : leaves below node j N. j L L j = j. S j : successors of node j. S j : random variable over successors of node j. We identify S j supp P Sj. 4 / 34

3 Node Probabilities We associate with each node j N a probability Q j = i L i P L (i) (1) The probabilities of the successors of node j are given by Q i = Q j P Sj (i), i S j. (2) 5 / 34 Example P S3 (5) = Q 5 = P L (5) = 1 2 P S1 (3) = 1 3 Q 3 = 3 4 P S0 (1) = Q 1 = Q 6 = P L (6) = 1 4 P S3 (6) = Q 0 = 1 4 P S0 (2) = Q 2 = P L (2) = 1 4 supp P L = L = {2, 5, 6} N = {0, 1, 2, 3, 5, 6} = N \ L = {0, 1, 3} supp P S1 = S 1 = {3} L 1 = {5, 6} Q 1 = i L 1 P L (i) = = 3 4 Q 3 = Q 1 P S1 (3) = = / 34

4 Leaf-Average Node-Sum Interchange Theorem (LANSIT) LANSIT [1, Theorem 1] Let f be a function that assigns to each node j N a real value f (j). For each j N \ 0, define f (j) := f (j) f (predecessor of j) Proposition 1 (LANSIT) E[f (L)] f (0) = j Q j E[ f (S j )] (3) 8 / 34

5 Proof of LANSIT Consider a tree with leaves L. Let S j L be a set of leaves with a common predecessor j. P L (i)f (i) (a) = Q j P Sj (i)[f (i) f (j) + f (j)] (4) i S j i S j =Q j f (j) P Sj (i) +Q j P Sj (i) f (i) (5) i S j i S j }{{} =1 =Q j f (j) + Q j E[ f (S j )] (6) where (a) follows from (2). L j L \ S j is a new tree with a reduced number of leaves and P L (j) = Q j. Repeat the procedure until j is the root node 0. Then Q j = 1 and Q j f (j) = f (0). 9 / 34 Path Length Lemma [3, Lemma 2.1] Function w(j) := length of path to node j. For each j N \ 0: w(j) = 1. w(0) = 0. Proposition 2 (Path Length Lemma) = j Q j. (7) 10 / 34

6 Leaf Entropy Lemma [3, Lemma 2.2] Function f (i) = log 2 Q i. Proposition 3 (Leaf Entropy Lemma) H(P L ) = j Q j H(P Sj ). (8) Proof. H(P L ) = E[ log 2 P L (L)] = E[f (L)] (a) = j Q j E[ f (S j )] (9) = j Q j E[ log 2 Q Sj Q j ] (10) (b) = j Q j E[ log 2 P Sj (S j )] (11) = j Q j H(P Sj ) (12) where (a) follows by the LANSIT and (b) by (2). 11 / 34 Informational Divergence Q Function f (i) = log i 2 Q i Proposition 4. D(P L P L ) = j Q j D(P Sj P S j ). (13) Proof. [ D(P L P L ) = E log 2 P L (L) ] (a) = Q j E[ f (S j )] (14) P L (L) j = j Q j E[log 2 Q Sj Q j Q j Q S j ] (15) (b) = P Sj (S j ) Q j E[log 2 P j S j (S j ) ] (16) = j Q j D(P Sj P S j ) (17) where (a) follows from the LANSIT and (b) by (2). 12 / 34

7 Random Vectors Remark. If all paths in a tree have the same length n, then P L can be thought of as a joint distribution P X n of a random vector X n = (X 1,..., X n ). In this case, Prop. 3 and Prop. 4 are the chain rules for entropy and informational divergence, respectively. 13 / 34 Differential LANSIT

8 Differential LANSIT : random variable over branching nodes. Define P (j) = Q j, j. (18) y path length lemma P (j) = j P defines a distribution over. Proposition 5 (Differential LANSIT) j Q j = 1. (19) E[f (L)] f (0) = E[ f (S )]. (20) Note that the expectation on the right-hand side is over P S P. 15 / 34 Example Consider the path length function w. y the Differential LANSIT, E[ w(s )] = w(0) = = 1. (21) 16 / 34

9 Entropy Rate Function f (i) = log 2 Q i. Proposition 6 H(P L ) = E[H(P S )] (22) Proof. H(P L ) = E[ log 2 P L (L)] (a) = E[ log 2 Q S Q ] (23) = E[ log 2 Q S Q ] (24) (b) = E[ log 2 P S (S )] (25) = E[H(P S )] (26) where (a) follows by the differential LANSIT and (b) by (2). 17 / 34 Normalized Informational Divergence Q Function f (i) = log i 2. Q i Proposition 7 D(P L P L ) = E[D(P S P S )]. (27) Proof. D(P L P L ) (a) = E[log 2 Q S Q Q Q S ] (28) (b) P S (S ) = E[log 2 P S (S ) ] (29) = E[D(P S P S )] (30) where (a) follows by the differential LANSIT and (b) by (2). 18 / 34

10 Pinsker s Inequality for Trees Variational distance P X, P Y two distributions on X. Variational distance d(p X, P Y ) is d(p X, P Y ) := a X P X (a) P Y (a) (31) ounds: d(p X, P Y ) 0, with equality iff a X : P X (a) = P Y (a) (32) d(p X, P Y ) 2, with equality iff supp P X supp P Y =. (33) 20 / 34

11 Approximating Distributions Set of distributions over X : P X. Proposition 8 i Pinsker s Inequality: D(P X P Y ) 1 2 ln 2 d 2 (P X, P Y ). (34) ii Let {P Xk } k=1 be a set of distributions in P X. D(P Xk P Y ) k 0 d(p Xk, P Y ) k 0 (35) iii Let g be a function on P X that is continuous in P Y. D(P Xk P Y ) k 0 g(p Xk ) g(p Y ) k 0. (36) 21 / 34 Example: Entropy y [4, Lemma 2.7], entropy is continuous in any distribution P Y P X. Thus D(P Xk P Y ) k 0 H(P Xk ) H(P Y ) k 0. (37) 22 / 34

12 Product Distributions Consider a tree and let P S be a branching distribution. Assign 1 P Sj = P S for all branching nodes j. We call the resulting node probabilities the product distribution P + S. For any complete tree with leaves L, P + S distribution, i.e., i L P+ S (i) = 1. defines a leaf For any (possibly non-complete) tree with leaves L, we define the informational divergence between the leaf distribution P L and P + S as D(P L P + S ) := i L P L (i) log 2 P L (i) P + S (i). (38) 1 This is a slight abuse of notation, since for j i, S i S i. However, we can think of P Sj as a distribution over branch labels. For example, for a binary tree, P Sj is then a distribution over the labels {0, 1}, for all i and the assignment P Sj = P S is meaningful. 23 / 34 Approximating Distributions on Trees Proposition 9 i Pinsker s Inequality for Trees: D(P L P L ) 1 2 ln(2) E[d 2 (P S, P S )]. (39) ii For any ɛ > 0, D(P L P L ) L 0 Pr[d(P S, P S ) ɛ] L 0. (40) iii Let P S be a branching distribution and let g be a function on P S that is bounded and continuous in P S. D(P L P + S ) L 0 E[g(P S )] g(p S ) 0. (41) L Proof. See Slides / 34

13 Entropy Rate y Prop. 6, H(P L ) = E[H(P S )]. (42) H is continuous and bounded. Thus by Prop. 9iii. we have the following proposition. Proposition 10 D(P L P + S ) L 0 H(P L) H(P S ) L 0. (43) 25 / 34 Random Vectors Remark. (See also Slide 13) If all paths in a tree have the same length n, then P L can be thought of as a joint distribution P X n of a random vector X n = (X 1,..., X n ). The (tree) product distribution P + S is then the conventional product distribution P n S. Prop. 9 applies and in particular, Prop. 10 becomes D(P X n P n S ) n n 0 H(P X n) n H(P S ) 0. (44) n 26 / 34

14 Proof of Prop. 9i. D(P L P L ) (a) = E[D(P S P S )] (45) (b) 1 2 ln 2 E[d 2 (P S, P S )] (46) where (a) follows by Prop. 7 and where (b) follows by Pinsker s inequality. 27 / 34 Proof of Prop. 9ii. Suppose E[d(P S, P S )] < ɛ 2 for some ɛ > 0. Then Pr[d(P S, P S ) ɛ] (a) E[d(P S, P S )] ɛ ɛ2 ɛ (47) (48) = ɛ (49) where (a) follows by Markov s inequality [3, Theo. A.2]. Together with statement i., statement ii. follows. 28 / 34

15 Proof of Prop. 9iii. (1) y assumption, g is bounded and continuous in P S. y boundedness, there exists a value g max < such that y continuity, we know that Define δ > 0: ɛ δ : ɛ < ɛ δ : j : g(p Sj ) g(p S ) g max. (50) d(p Sj, P S ) < ɛ g(p Sj ) g(p S ) < δ. (51) ɛ = min{ɛ δ, δ}. (52) 29 / 34 Proof of Prop. 9iii. (2) Suppose E[d(P S, P S )] < ɛ 2. We write E[g(P S )] g(p S ) = P (j)[g(p Sj ) g(p S )] j P (j) g(psj ) g(p S ) (53) j = P (j) g(p Sj ) g(p S ) j : d(p Sj,P S )<ɛ + j : d(p Sj,P S ) ɛ We next bound the two sums in (54). P (j) g(p Sj ) g(p S ). (54) 30 / 34

16 Proof of Prop. 9iii. (3) The first sum in (54) is bounded as j : d(p Sj,P S )<ɛ P (j) g(p Sj ) g(p S ) (a) where (a) follows by (51) and (52). j : d(p Sj,P S )<ɛ P (j)δ δ (55) 31 / 34 Proof of Prop. 9iii. (4) The second sum in (54) is bounded as j : d(p Sj,P S ) ɛ P (j) g(p Sj ) g(p S ) (a) j : d(p Sj,P S ) ɛ (b) ɛg max P (j)g max (c) δg max (56) where (a) follows by (50), where (b) follows by our assumption E[d(P S, P S )] < ɛ 2 and Slide 28 and where (c) follows by (52). 32 / 34

17 Proof of Prop. 9iii. (5) Using (55) and (56) in (54), we get E[g(P S )] g(p S ) δ + δg max = δ(1 + g max ). (57) For δ 0, the error bound on the right-hand side goes to zero, which proves part iii. of Prop / 34 References [1] R. A. Rueppel and J. L. Massey, Leaf-average node-sum interchanges in rooted trees with applications, in Communications and Cryptography: Two sides of One Tapestry, R. E. lahut, D. J. Costello Jr., U. Maurer, and T. Mittelholzer, Eds. Kluwer Academic Publishers, [2] J. L. Massey, Applied digital information theory I, lecture notes, ETH Zurich. [Online]. Available: scr/adit1.pdf [3] G. Kramer, Information theory, lecture notes TU Munich, edition WS 2012/2013. [4] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, / 34

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de