Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 28

Jensen s Inequality Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 2 / 28

Jensen s Inequality Concave functions A function f is concave on interval I if for any x, y I, λf (x) + (1 λ)f (y) f (λx + (1 λ)y) for anyλ [0, 1] Weighted average of function is upper bounded by function of weighted average. It is strictly concave if the equality holds only when x=y. Nevin L. Zhang (HKUST) Machine Learning 3 / 28

Jensen s Inequality Jensen s Inequality Theorem (1.1) Suppose function f is concave on interval I.Then For any p i [0, 1], n i=1 p i = 1 and x i I. n n p i f (x i ) f ( p i x i ) i=1 Weighted average of function is upper bounded by function of weighted average. If f is strictly CONCAVE, the equality holds iff p i p j 0 implies x i =x j. Exercise: Prove this (using induction). i=1 Nevin L. Zhang (HKUST) Machine Learning 4 / 28

Jensen s Inequality Logarithmic function The logarithmic function is concave in the interval (0, ): Hence n n p i log(x i ) log( p i x i ) i=1 i=1 0 x i In words, exchanging i p i with log increases a quantity. Nevin L. Zhang (HKUST) Machine Learning 5 / 28

Entropy Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 6 / 28

Entropy Entropy The entropy of a random variable X : H(X ) = X P(X ) log 1 P(X ) = E P[log P(X )] with convention that 0 log(1/0) = 0. Base of logarithm is 2, unit is bit. Sometimes, also called the entropy of the distribution. Nevin L. Zhang (HKUST) Machine Learning 7 / 28

Entropy Entropy H(X ) measures uncertainty about X : X binary. The chart on the right shows H(X ) as a function of p=p(x =1). The higher H(X ) is, the more uncertainty about the value of X Nevin L. Zhang (HKUST) Machine Learning 8 / 28

Entropy Entropy Another example: X result of coin tossing Y result of dice throw Z result of randomly pick a card from a deck of 54 Which one has the highest uncertainty? Entropy: H(X ) = 1 2 log 2 + 1 log 2 = 1(log 2) 2 H(Y ) = 1 6 log 6 +... + 1 log 6 = log 6 6 H(Z) = 1 54 log 54 +... + 1 log 54 = log 54 54 Indeed we have: H(X ) < H(Y ) < H(Z). Nevin L. Zhang (HKUST) Machine Learning 9 / 28

Entropy Entropy Proposition (1.2) H(X ) 0 H(X ) = 0 equality iff P(X =x) = 1 for some x Ω X. i.e. iff no uncertainty. H(X ) log( X ) with equality iff P(X =x)=1/ X. Uncertainty is the highest in the case of uniform distribution. Proof: Because log is concave, by Jensen s inequality: H(X ) = X log X 1 P(X )log P(X ) 1 P(X ) P(X ) = log X Nevin L. Zhang (HKUST) Machine Learning 10 / 28

Entropy Conditional entropy The conditional entropy of X given event Y =y: Entropy of the conditional distribution P(X Y = y), i.e. H(X Y =y) = X 1 P(X Y =y)log P(X Y =y) The uncertainty that remains about X when Y is known to be y. It is possible that H(X Y =y) > H(X ) Intuitively Y =y might contradicts our prior knowledge about X and increase our uncertainty about X Exercise: Give example. Nevin L. Zhang (HKUST) Machine Learning 11 / 28

Entropy Conditional entropy The conditional entropy of X given variable Y : H(X Y ) = y Ω Y P(Y = y)h(x Y =y) = P(Y ) 1 P(X Y )log P(X Y ) Y X = 1 P(X, Y )log P(X Y ) X,Y = E[logP(X Y )] The average uncertainty that remains about X when Y is known. Nevin L. Zhang (HKUST) Machine Learning 12 / 28

Entropy Joint entropy The joint entropy of X and Y : H(X, Y ) = 1 P(X, Y )log P(X, Y ) X,Y Chain rule: H(X, Y ) = H(X ) + H(Y X ) = H(Y, X ) = H(Y ) + H(X Y ) Proof: 1 P(X, Y )log P(X, Y ) X,Y = 1 P(X, Y )log P(X )P(Y X ) X,Y = 1 P(X, Y )log P(X ) + 1 P(X, Y )log P(Y X X,Y X,Y = 1 P(X )log + H(Y X ) P(X ) X = H(X ) + H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 13 / 28

Divergence Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 14 / 28

Divergence Kullback-Leibler divergence Relative entropy or Kullback-Leibler divergence Measures how much a distribution Q(X ) differs from a true probability distribution P(X ). K-L divergence of Q from P is defined as follows: KL(P Q) = X P(X )log P(X ) Q(X ) = E P[logP(X )] E P [logq(x )] 0log 0 0 = 0 and plog p 0 = if p 0 Not symmetric. So, not a distance measure mathematically. The second term is called cross entropy: H(P, Q) = E P [logq(x )]. H(P, Q) = KL(P Q) + H(P) Nevin L. Zhang (HKUST) Machine Learning 15 / 28

Divergence KL divergence between P and Q is larger than 0 unless P and Q are Nevin L. Zhang (HKUST) Machine Learning 16 / 28 Kullback-Leibler divergence Theorem (1.2) (Gibbs inequality) KL(P, Q) 0 with equality holds iff P is identical to Q Proof: P(X )log P(X ) Q(X ) X = X log X P(X )log Q(X ) P(X ) P(X ) Q(X ) P(X ) Jensen s inequality = log X Q(X ) = 0.

Divergence A corollary Corollary (1.1) (Gibbs Inequality) H(P, Q) H(P), or P(X ) log Q(X ) P(X ) log P(X ) X X In general, let f (X ) be a non-negative function. Then f (X ) log Q(X ) f (X ) log P (X ) X X where P (X ) = f (X )/ X f (X ). Nevin L. Zhang (HKUST) Machine Learning 17 / 28

Divergence Jensen-Shannon divergence KL is not symmetric: KL(P Q) usually is not equal to reverse KL KL(Q P). Jensen-Shannon divergence is one symmetrized version of KL: JS(P Q) = 1 2 KL(P M) + 1 2 KL(Q M) where M = P+Q 2 Properties: 0 JS(P Q) log 2 JS(P Q) = 0 if P = Q JS(P Q) = log 2 if P and Q has disjoint support. Nevin L. Zhang (HKUST) Machine Learning 18 / 28

Mutual Information Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 19 / 28

Mutual Information Mutual information The mutual information of X and Y : I (X ; Y ) = H(X ) H(X Y ) Average reduction in uncertainty about X from learning the value of Y, or Average amount of information Y conveys about X. Nevin L. Zhang (HKUST) Machine Learning 20 / 28

Mutual Information Mutual information and KL Divergence Note that: I (X ; Y ) = 1 P(X )log P(X ) 1 P(X, Y )log P(X Y ) X X,Y = 1 P(X, Y )log P(X ) 1 P(X, Y )log P(X Y ) X,Y X,Y = X,Y P(X Y ) P(X, Y )log P(X ) = P(X, Y )log P(X, Y ) P(X )P(Y ) X,Y = KL(P(X, Y ), P(X )P(Y )) Due to equivalent definition: equivalent definition I (X ; Y ) = H(X ) H(X Y ) = I (Y ; X ) = H(Y ) H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 21 / 28

Mutual Information Property of Mutual information Theorem (1.3) with equality holds iff X Y. I (X ; Y ) 0 Interpretation: X and Y are independent iff X contains no information about Y and vice versa. Proof: Follows from previous slide and Theorem 1.2. Nevin L. Zhang (HKUST) Machine Learning 22 / 28

Mutual Information Conditional Entropy Revisited Theorem (1.4) H(X Y ) H(X ) with equality holds iff X Y Observation reduces uncertainty in average except for the case of independence. Proof: Follows from Theorem 1.3. Nevin L. Zhang (HKUST) Machine Learning 23 / 28

Mutual Information Mutual information and Entropy From definition of mutual information I (X ; Y ) = H(X ) H(X Y ) and the chain rule, H(X, Y ) = H(Y ) + H(X Y ) we get H(X ) + H(Y ) = H(X, Y ) + I (X ; Y ) I (X ; Y ) = H(X ) + H(Y ) H(X, Y ) Consequently H(X, Y ) H(X ) + H(Y ) with equality holds iff X Y. Nevin L. Zhang (HKUST) Machine Learning 24 / 28

Mutual Information Mutual information and entropy Venn Diagram: Relationships among joint entropy, conditional entropy, and mutual information H(X ) + H(Y ) = H(X, Y ) + I (X ; Y ) I (X ; Y ) = H(X ) H(X Y ) I (Y ; X ) = H(Y ) H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 25 / 28

Mutual Information Conditional Mutual information The conditional mutual information of X and Y given Z: I (X ; Y Z) = H(X Z) H(X Y, Z) Average amount of information Y conveys about X given Z. Nevin L. Zhang (HKUST) Machine Learning 26 / 28

Mutual Information Conditional mutual information and KL Divergence Note: I (X ; Y Z) = 1 P(X, Z)log P(X Z) 1 P(X, Y, Z)log P(X Y, Z) X,Z X,Y,Z X,Y,Z = 1 P(X, Y, Z)log P(X Z) 1 P(X, Y, Z)log P(X Y, Z) = = Z X,Y,Z P(X Y, Z) P(X, Y, Z)log P(X Z) X,Y,Z P(Z) P(X, Y Z) P(X, Y Z)log P(X Z)P(Y Z) X,Y equivalent definition = Z P(Z)KL(P(X, Y Z), P(X Z)P(Y Z)) 0. Nevin L. Zhang (HKUST) Machine Learning 27 / 28

Mutual Information Property of conditional mutual information Theorem (1.5) with equality hold iff X Y Z. Interpretation: I (X ; Y Z) 0 H(X Z) H(X Y, Z) More observations reduce uncertainty on average except for the case of conditional independence. X and Y are independently given Z iff X contain no information about Y given Z and vice versa: X Y Z I (X ; Y Z) = 0. Another characterization of conditional independence. Nevin L. Zhang (HKUST) Machine Learning 28 / 28