Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8
General Chain Rules Definition Conditional mutual information is of random variables X and Y, given the random variable Z, is the reduction in the uncertainty of X due to knowledge of Y, given Z. It is defined as p(x, Y Z) I(X; Y Z) = H(X Z) H(X Y, Z) = E p(x,y,z) log p(x Z)p(Y Z) Chain Rule for Information: I(X 1, X 2,..., X n; Y ) = n I(X i; Y X 1, X 2,... X i 1) Exercise i=1 Prove the above theorem. Hint: I(X 1, X 2,..., X n; Y ) = H(X 1, X 2,..., X n) H(X 1, X 2,..., X n Y ). Then, write the joint entropy expressions as sums of conditional entropies (use the chain rule for entropy here). c Ritwik Banerjee Information Theory and Communication 2/8
General Chain Rules Definition For joint probability mass functions p(x, y) and q(x, y), the conditional relative entropy, denoted by D (p(y x) q(y x)), is the average of the relative entropies between the conditional probability mass functions p(y x) and q(y x), over the probability mass function p(x). That is, D (p(y x) q(y x)) = p(x) p(y x) log p(y x) q(y x) x X y Y = E p(x,y) log p(y X) q(y X) KL divergence, i.e., relative entropy, between two joint distributions on a pair of random variables can be expressed as the sum of a relative entropy and a conditional relative entropy. c Ritwik Banerjee Information Theory and Communication 3/8
General Chain Rules Chain Rule for Divergence: D(p(x, y) q(x, y)) = D(p(x) q(x)) + D(p(y x) q(y x)). Exercise Prove the above theorem. Hint: Write the joint distributions inside the logarithm as conditionals, and remember to sum out variables that do not occur outside the probability mass function. c Ritwik Banerjee Information Theory and Communication 4/8
Some important questions How do we define inequality of information? When is the entropy of a random variable maximised? Are there any bounds on the entropy of a random variable? Does conditioning always reduce entropy (i.e., is more information really always a good thing)? Next, we will look at some properties of entropy (and other related definitions) and answer these questions. c Ritwik Banerjee Information Theory and Communication 5/8
Convex and Concave functions Definition A function f(x) is said to be convex over an interval (a, b) if x 1, x 2 (a, b), 0 λ 1, f(λx 1 + (1 λx 2)) λf(x 1) + (1 λ)f(x 2). If the equality holds only when λ is 0 or 1, then the function is said to be strictly convex. A function f(x) is said to be concave over an interval (a, b) if f is convex over that interval. If a function f has a second derivative that is non-negative (positive) over an interval, then it is convex (strictly convex) over that interval. c Ritwik Banerjee Information Theory and Communication 6/8
Jensen s Inequality Jensen s Inequality: Given any convex function f and random variable X, E(f(X)) f(e(x)). Proof. By induction (on the number of mass points) for discrete distributions. By using arguments of limits and continuity, this proof can be extended to continuous distributions as well. Consequences of Jensen s inequality The relation between per capita income and well-being is a concave function. Jensen s inequality implies that the maximum average well-being of a society is attained when income is spread evenly (i.e., a uniform distribution). KL divergence is always non-negative. c Ritwik Banerjee Information Theory and Communication 7/8
Information Inequality Let x X and p(x), q(x) be two probability mass functions. Then, the divergence of p from q is non-negative. That is, D(p(x) q(x)) 0 Hint: it is easier to show that the negative of KL divergence is always non-positive. Corollary 1: For any two random variable X and Y, I(X; Y ) 0. Corollary 2: Conditional relative entropy is non-negative. Corollary 3: Conditional mutual information is non-negative: I(X; Y Z) 0. It is equal to zero if and only if X and Y are conditionally independent given Z. c Ritwik Banerjee Information Theory and Communication 8/8