Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 27
Outline Entropy (Shannon entropy, differential entropy, and conditional entropy) Kullback-Leibler (KL) divergence Mutual information Jensen s inequality and Gibb s inequality 2 / 27
Information Theory Information theory answers two fundamental questions in communication theory What is the ultimate data compression? entropy H. What is the ultimate transmission rate of communication? channel capacity C. In the early 1940 s, it was thought that increasing the transmission rate of information over a communication channel increased the probability of error This is not true. Shannon surprised the communication theory community by proving that this was not true as long as the communication rate was below the channel capacity. Although information theory was developed for communications, it is also important to explain ecological theory of sensory processing. Information theory plays a key role in elucidating the goal of unsupervised learning. 3 / 27
Shannon Entopy 4 / 27
Information and Entropy Information can be thought of as surprise, uncertainty, or unexpectedness. Mathematically it is defined by I = log p i, where p i is the probability that the event labelled i occurs. The rare event gives large information and frequent event produces small information. Entropy is average information, i.e., N H = E [I ] = p i log p i. i=1 5 / 27
Example: Horse Race Suppose we have a horse race with eight horses taking part. Assume that the probabilities of winning for the eight horse are ( 1 2, 1 4, 1 8, 1 16, 1 64, 1 64, 1 64, 1 64 ). Suppose that we wish to send a message to another person indicating which horse won the race. How many bits are required to describe this for each of the horses? 3 bits for any of the horses? 6 / 27
No! The win probabilities are not uniform. It makes sense to use shorter descriptions for the more probable horses and longer descriptions for the less probable ones so that we achieve a lower average description length. For example, we can use the following strings to represent the eight horses: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. The average description length in this case is 2 bits as opposed to 3 bits for the uniform code. We calculate the entropy: H = 1 2 log 2 = 2 bits. 1 2 1 4 log 1 2 4 1 8 log 1 2 8 1 16 log 2 1 16 4 1 64 log 2 The entropy of a random variable is a lower bound on the average number of bits required to represent the random variables and also on the average number of questions needed to identify the variable in a game of twenty questions. 1 64 7 / 27
Shannon Entropy Given a discrete random variable X with image X, (Shannon) Entropy is the average information (a measure of uncertainty) that is defined by H(X ) = x X p(x) log p(x) = E p [ log p(x)]. Properties H(X ) 0. (since each term in the summation is nonnegative) H(X ) = 0 if and only if P[X = x] = 1 for some x X. The entropy is maximal if all the outcomes are equally likely. 8 / 27
Differential Entropy Given a continuous random variable X, Differential entropy is defined as H(p) = p(x) log p(x)dx = E p [log p(x)]. Properties It can be negative. Given a fixed variance, Gaussian distribution achieves the maximal differential entropy. For x N (µ, σ 2 ), H(x) = 1 2 log(2πeσ2 ). For x N (µ, Σ), H(x) = 1 log det (2πeΣ). 2 9 / 27
Conditional Entropy Given two discrete random variables X and Y with images X and Y, respectively, we expand the joint entropy H(X, Y ) = ( ) 1 p(x, y) log p(x, y) x X y Y = ( ) 1 p(x, y) log p(x)p(y x) x X y Y = ( ) 1 p(x, y) log + ( ) 1 p(x, y) log p(x) p(y x) x X y Y x X y Y = ( ) 1 p(x) log + p(x) ( ) 1 p(y x) log p(x) p(y x) x X x X y Y = H(X ) + p(x)h(y X = x). x X }{{} H(Y X ) Chain rule: H(X, Y ) = H(X ) + H(Y X ). 10 / 27
H(X, Y ) H(X ) + H(Y ) H(Y X ) H(Y ) Try to prove these by yourself! 11 / 27
KL Divergence 12 / 27
Relative Entropy (Kullback-Leibler Divergence) Introduced by Solomon Kullback and Richard Leibler in 1951. A measure of how one probability distribution q diverges from the other p D KL [p q] (KL divergence of q(x) from p(x)) Discrete probability distributions p and q: D KL [p q] = x p(x) log p(x) q(x). Probability distributions p and q of continuous random variables: D KL [p q] = p(x) log p(x) q(x) dx. Properties of KL divergence Divergence is not symmetric: DKL [p q] D KL [q p]. Divergence is always nonnegative: DKL [p q] 0 (Gibb s inequality). Divergence is a convex function on the domain of probability distributions. 13 / 27
Theorem (Convexity of divergence) Let p 1, q 1, and p 2, q 2, be probability distributions over a random variable X and λ (0, 1) define p = λp 1 + (1 λ)p 2, q = λq 1 + (1 λ)q 2. Then, D KL [p q] λd KL [p 1 q 1 ] + (1 λ)d KL [p 2 q 2 ]. Proof. It is deferred to the end of this lecture. 14 / 27
Entropy and Divergence The entropy of a random variable X with a probability distribution p(x) is related to how much p(x) diverges from the uniform distribution on the support of X. H(X ) = ( ) 1 p(x) log p(x) x X = ( ) X p(x) log p(x) X x X = log X ( ) p(x) p(x) log 1 x X X = log X D KL [p unif]. The more p(x) diverges the lesser its entropy and vice versa. 15 / 27
Recall D KL [p q] = p(x) log p(x) q(x) dx. Characterizing KL divergence If p and q are high, we are happy If p is high but q isn t, we pay a price If p is low, we do not care D KL = 0, then distributions are equal 16 / 27
17 / 27
18 / 27
KL Divergence of Two Gaussians Two univariate Gaussians (x R) p(x) = N (µ1, σ1) 2 and q(x) = N (µ 2, σ2) 2 Calculated as D KL [p q] = p(x) log p(x) q(x) dx = p(x) log p(x)dx p(x) log q(x)dx = 1 σ1 2 + 1 (µ 2 µ 1) 2 + log σ1 1 2 σ2 2 2 σ2 2 σ 2 2. Two multivariate Gaussians (x R D ) p(x) = N (µ1, Σ 1) and q(x) = N (µ 2, Σ 2) Calculated as D KL [p q] = 1 [ ( ) tr Σ 1 2 Σ 1 + (µ 2 2 µ 1 ) Σ 1 2 (µ 2 µ 1 ) D + log Σ2 ]. Σ 1 19 / 27
Mutual Information 20 / 27
Mutual Information Mutual information is the relative entropy between the joint distribution and the product of marginal distributions, I (x, y) = [ ] p(x, y) p(x, y) log p(x)p(y) x X y Y = D KL [p(x, y) p(x)p(y)] [ ]} p(x, y) = E p(x,y) {log. p(x)p(y) Mutual information can be interpreted as the reduction in the uncertainty of x due to the knowledge of y, i.e., I (x, y) = H(x) H(x y), where H(x y) = E p(x,y) [log p(x y)] is the conditional entropy 21 / 27
Convexity, Jensen s inequality, and Gibb s inequality 22 / 27
Convex Sets and Functions Definition (Convex Sets) Let C be a subset of R m. C is called a convex set if αx + (1 α)y C, x, y C, α [0, 1] Definition (Convex Function) Let C be a convex subset of R m. A function f : C R is called a convex function if f (αx + (1 α)y) αf (x) + (1 α)f (y) x, y C, α [0, 1] 23 / 27
Jensen s Inequality Theorem (Jensen s Inequality) If f (x) is a convex function and x is a random vector, then E [f (x)] f (E [x]). Note: Jensen s inequality can also be rewritten for a concave function, with the direction of the inequality reversed. 24 / 27
Proof of Jensen s Inequality ( N Need to show that N i=1 p if (x i ) f i=1 p ix i ). The proof is based on the recursion, working from the right-hand side of this equation. ( N ) ( ) N f p i x i = f p 1x 1 + p i x i i=1 and so forth. i=2 [ N ] ( N ) ( ) i=2 p 1f (x 1) + p i f p ix i N i=2 i=2 p choose α = p1 N i i=1 p i [ N ] ( N )} i=3 p 1f (x 1) + p i {αf (x 2) + (1 α)f p ix i N i=2 i=3 p i ( ) choose α = p2 N i=2 p i ( N ) N i=3 = p 1f (x 1) + p 2f (x 2) + p i f p ix i N i=3 p, i i=3 25 / 27
Gibb s Inequality Theorem D KL [p q] 0 with equality iff p = q. Proof: Consider the Kullback-Leibler divergence for discrete distributions: D KL [p q] = i p i log p i q i = p i log q i p i i [ ] q i log p i p i i [ ] = log q i = 0. i (by Jensen s inequality) 26 / 27
More on Gibb s Inequality In order to find the distribution p which minimizes D KL [p q], we consider a Lagrangian ( E = D KL [p q] + λ 1 ) p i = ( p i log p i + λ 1 ) p i. q i i i i Compute the partial derivative E p k and set to zero, E p k = log p k log q k + 1 λ = 0, which leads to p k = q k e λ 1. It follows from i p i = 1 that i q ie λ 1 = 1, which leads to λ = 1. Therefore p i = q i. The Hessian, 2 E = 1 p i 2 p i, 2 E p i p j = 0, is positive definite, which shows that p i = q i is a genuine minimum. 27 / 27