ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017"

Transcription

1 ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 207 Author: Galen Reeves Last Modified: August 3, 207 Outline of lecture: 2. Quantifying Information Entropy and Mutual Information Entropy Mutual Information Eample: Testing for a disease Relative Entropy Conveity & Concavity Data Processing Inequality Fano s Inequality Summary of Basic Inequalities Aiomatic Derivation of Mutual Information [Optional] Proof of uniqueness of i(, y) Quantifying Information How much information do the answers to the following questions provide? (i) Will it rain today in Durham? (two possible answers) (ii) Will it rain today in Death Valley? (two possible answers) (iii) What is today s winning lottery number? 258,890,850 possible answers) (for the Mega Millions Jackpot, there are The amount of information is linked to the number of possible answers. In 928, Ralph Hartley gave the following definition: Hartley Information = log # answers Hartley s measure of information is additive. The number of possible answers for two questions corresponds to the product of the number of answers for each question. Taking the logarithm turns the product into a sum. What is today s winning lottery number? log 2 ( ) 28(bits) What are the winning lottery numbers for today and tomorrow? log 2 ( ) = log 2 ( ) + log 2 ( ) 56(bits)

2 2 ECE 587 / STA 563: Lecture 2 But Hartley s information does not distinguish between likely and unlikely answers (e.g. rain in Durham vs. rain in Death Valley). In 948, Shannon introduced measures of information which depend on the probabilities of the answers. 2.2 Entropy and Mutual Information 2.2. Entropy Let X be discrete random variable with pmf p() and finite support X. The entropy of X is H(X) = X p() log p() The entropy can also be epressed as the epected value of the random variable log(/p(x)), i.e. [ ( )] H(X) = E log p(x) Binary Entropy: If X is a Bernoulli(p) random variable (i.e. P[X = ]p and P[X = 0] = p), then its entropy is given by the binary entropy function H b (p) = p log p ( p) log( p) H b (p) is concave, the maimum at H b (/2) = log(2) and has minimum is at H b (0) = H b () = 0. Eample: Two Questions Will it rain Today in Durham? 04 H b Will it rain Today in Death Valley? H b bits bits Fundamental Inequality For any base b > 0 and > 0, ( ) log b (e) log b () ( ) log b (e) with equalities on both sides if, and only if, =. Proof:

3 ECE 587 / STA 563: Lecture 2 3 For the natural log, this simplifies to ( ) ln() ( ) To prove the upper bound, note that equality is attained at =. For >, ( ) ln() = ( ) d > 0 }{{} strictly positive and for <, ( ) ln() = d > 0 }{{} strictly positive To prove the lower bound, let y = / so that ln(y) y = y ln y = ln() Theorem: 0 H(X) log X Left side follows from 0 p(). For right side, observe that p() log = ( ) X p() log p() p() X = log( X ) + ( ) p() log p() X log( X ) + ( ) p() log(e) p() X = log( X ) + log(e) log(e) = log( X ) Fundamental Inq. The entropy of an n-dimensional random vector X = (X, X 2,, X n ) with pmf p() is defined as H(X) = H(X, X 2,, X n ) = p() log p() X The joint entropy of random variables X and Y is simply the entropy of the vector (X, Y ) H(X, Y ) = X y Y p(, y)

4 4 ECE 587 / STA 563: Lecture 2 Conditional Entropy: The entropy of a random variable Y conditioned on the event {X = } is a function of the conditional distribution p Y X ( ) and is given by: H(Y X = ) = p(y ) log p(y ) y Y The conditional entropy of Y given X is a function of the joint distribution p(, y): H(Y X) = p()h(y X = ) = p(y ) X,y Warning: Note that H(Y X) is not a random variable! This is differs from the usual convention for conditioning where, for eample, E[Y X] is a random variable. Chain Rule: The joint entropy of X and Y can be decomposed as and more generally, H(X, Y ) = H(X) + H(Y X) H(X, X 2,, X n ) = n H(X i X i,, X ) i= Proof of chain rule: H(X; Y ) =,y =,y =,y ( ( [ ) p(, y) ) p() p(y ) ( ) ( )] + log p() p(y ) = H(X) + H(Y X) Mutual Information Measure of the amount of information that one RV contains about another RV I(X; Y ) = ( ) p(, y) p()p(y) X y Y It can also be epressed as an epectation: ( ) p(, y) I(X; Y ) = E[i(X, Y )] where i(, y) = log p()p(y) Mutual information between X and Y can be epressed as the amount by which knowledge of X reduces the entropy of Y : I(X; Y ) = H(Y ) H(Y X)

5 ECE 587 / STA 563: Lecture 2 5 and by symmetry Proof: X y Y ( ) p(, y) = p()p(y) X I(X; Y ) = H(X) H(X Y ) = X y Y y Y [ ( ( } {{ } H(Y ) The mutual information between X and itself is equal to entropy: I(X; X) = H(X) H(X X) = H(X) Thus entropy is sometimes known as self-information Venn diagram ) ( )] log p(y) p(y ) ) p(y) p(y ) X y Y } {{ } H(Y X) The conditional mutual information between X and Y given Z is I(X; Y Z) = ( ) p(, y z) p(, y, z) log p( z)p(y z),y,z or equivalently I(X; Y Z) = E[i(X, Y Z)], ( ) p(, y z) i(, y z) = log p( z)p(y z) Chain Rule for mutual information: and more generally I(X; Y, Y 2 ) = I(X; Y ) + I(X; Y 2 Y ) I(X; Y, Y 2,, Y n ) = n I(X; Y i Y, Y 2,, Y k ) i= Eample: Testing for a disease There is a % chance I have a certain disease. There eists a test for this disease which is 90% accurate (i.e. P[test is pos I have disease] = P[test is neg I don t have disease] = 0.9). Let { {, I have disease, ith test is positive X = and Y i = 0, I don t have disease 0, ith test is negative Assume the the test outcomes Y = (Y, Y 2 ) are conditionally independent given X.

6 6 ECE 587 / STA 563: Lecture 2 The probability mass functions can be computed as and p(, y) y = (0, 0) y = (0, ) y = (, 0) y = (, ) = = p() = = 0.0 The individual entropies are p(y) y = (0, 0) y = (0, ) y = (, 0) y = (, ) p(y ) y = y = H(X) = H b (0.0) H(Y ) = H(Y 2 ) = H b (0.080) The conditional entropy of X given Y is computed as follows: H(X Y = ) = H b (0.967) and so H(X Y = 0) = H b (0.00) H(X Y ) = P[Y = ]H(X Y = ) + P[Y = 0]H(H Y = 0) Furthermore H(X Y, Y 2 ) = H(X, Y, Y 2 ) H(Y, Y 2 ) H(Y Y 2 ) = H(Y, Y 2 ) H(Y 2 ) The mutual information is I(X; Y ) = H(X) H(X Y ) I(X; Y, Y 2 ) = H(X) H(X Y, Y 2 ) The conditional mutual information is I(X; Y 2 Y ) = H(X Y ) H(X Y, Y 2 ) Relative Entropy The relative entropy between a distributions p and q is defined by D(p q) = ( ) p() p() log q() X This is also known as the Kullback-Leibler divergence. It can be epressed as the epectation of the epectation of the log likelihood ratio ( ) p() D(p q) = E[Λ(X)], X p, Λ() = log q()

7 ECE 587 / STA 563: Lecture 2 7 Note that if there eists such that p() > 0 and q() = 0, then D(p q) =. Warning: D(p q) is not a metric since it is not symmetric and it does not satisfy the triangle inequality. Mutual information between X and Y is equal to the relative entropy between p X,Y (, y) and p X ()p Y (y), I(X; Y ) = D(p X,Y (, y) p X ()p Y (y)) Theorem: Relative entropy is nonnegative, i.e D(p q) 0. It is equal to zero if and only if p = q. The proof is given in CT problem Proof: D(p q) = p() log q() p() ( ) q() p() log(e) p() Fundamental Inq. = log(e) = 0 q() log(e) p() Important consequences of the non-negativity of relative entropy: Mutual information is nonnegative, I(X; Y ) 0, with equality if an only if X and Y are independent. This means that H(X) H(X Y ) 0, and thus conditioning cannot increase entropy, H(X Y ) H(X) Warning: Although conditioning cannot increase entropy (in epectation), it is possible that the entropy of X conditioned on an specific event, say {Y = y}, is greater than H(X), i.e. H(X Y = y) > H(X). 2.3 Conveity & Concavity A function f() is conve over an interval (a, b) R if for every, 2 (a, b) and 0 λ, f(λ + ( λ) 2 ) λf( ) + ( λ)f( 2 ). The function is strictly conve if equality holds only if λ = 0 or λ =. Illustration of conveity. Let = λ + ( λ) 2

8 8 ECE 587 / STA 563: Lecture 2 Theorem: H(X) is a concave function p(), i.e. H(λp + ( λ)p 2 }{{} p ) λh(p ) + ( λ)h(p 2 ) This can be proved using the fundamental inequality (try it yourself) Here is an alternative proof which uses the fact that conditioning cannot increase entropy. Let Z be Bernoulli(λ )and let { p, Z = X p 2, Z = 0 Then, Since conditioning cannot increase entropy, H(X) = H(λp + ( λ)p 2 ) H(X) H(X Z) = λh(x Z = ) + ( λ)h(x Z = 0). Combining the displays completes the proof. Jensen s Inequality If f( ) is a conve function over an interval I and X is a random variable with support X I then E[f(X)] f(e[x]) Moreover, if f( ) is strictly conve, equality occurs if and only if X = E[X] is a constant. The proof of Jensen s Inequality is given in [C&T] Eample For any set of positive numbers { i } n i=, the geometric mean is no greater than the arithmetic mean: ( n ) /n i n i n i= Proof: Let Z be uniformly distributed on { i } so that P[Z = i ] = /n. By Jensen s inequality, ( n ) /n ( ) log i = n n log i = E[log(Z)] log(e[z]) = log i n n i= i= 2.4 Data Processing Inequality Markov Chain: Random variables X, Y, Z form a Markov chain, denoted X Y Z if X and Z are independent conditioned on Y. alternatively p(, z y) = p( y)p(z y) p(, y, z) = p()p(y, z ) i always true = p()p(y )p(z, y) always true = p()p(y )p(z y) if Markov chain i

9 ECE 587 / STA 563: Lecture 2 9 Note X Y Z implies Z Y X If Z = f(y ) then X Y Z. Theorem: (Data Processing Inequality) If X Y Z, then I(X; Y ) I(X; Z) In particular, for any function g( ), we have X Y g(y ) and so I(X; Y ) I(X; g(y )). No clever manipulation of Y can increase the mutual information! proof: By chain rule, we can epand mutual information two different ways: I(X; Y, Z) = I(X; Z) + I(X; Y Z) = I(X; Y ) + I(X; Z Y ) Since X and Z are conditionally independent given Y, we have I(X; Z Y ) = 0. I(X; Y Z) 0, we have I(X; Y ) I(X; Z) Since 2.5 Fano s Inequality Suppose we want to estimate a random variable X from an observation Y. The probability of error for an estimator ˆX = φ(y ) is [ ] P e = P ˆX X Theorem: (Fano s Inequality) For any estimator ˆX such that X Y ˆX, H b (P e ) + P e log( X ) H(X Y ) and thus P e H(X Y ) log( X ) Remark: Fano s Inequality provides a lower bound on P e for any possible function of Y! Proof: Let E be a random variable that indicates whether an error has occured: {, ˆX = X E = 0, ˆX X By the chain rule, the entropy of (E, X) given ˆX can be epanded two different ways H(E, X ˆX) = H(X ˆX) + H(E X, ˆX) }{{} =0 = H(E }{{ ˆX) + H(X E, }}{{ ˆX) } H b (P e) P e log X

10 0 ECE 587 / STA 563: Lecture 2 H(E ˆX) H(E) = H b (P e ) since conditioning cannot increase entropy H(E X, ˆX) = 0 since E is a deterministic function of X and ˆX. By the data processing inequality, Furthermore, H(X ˆX) H(X Y ) H(X E, ˆX) = P[E = ] H(X ˆX, E = ) +P[E = 0] H(X }{{} ˆX, E = 0) }{{} =0 log X Putting everything together proves the desire result. 2.6 Summary of Basic Inequalities Jensen s inequality: If f is a conve function then if f is a concave function then E[f(X)] f(e[x]) E[f(X)] f(e[x]) Data Processing Inequality: If X Y Z form a Markov chain, then I(X; Y ) I(X; Z) Fano s Inequality: If X Y ˆX forms a Markov chain, then [ P X ˆX ] H(X Y ) log( X ) 2.7 Aiomatic Derivation of Mutual Information [Optional] This section is based on lecture notes from Toby Berger. Let X, Y denote discrete random variables with respective alphabets X and Y. (Assume X < and Y <.) Let i(, y) be the amount of information about event {X = } conveyed by learning {Y = y} Let i(, y z) be the amount of information about event {X = } conveyed by learning {Y = y} conditioned on the event {Z = z} Consider the four postulates: (A) Bayesianness: i(, y) depends only on p(, y), i.e. i(, y) = f(α, β) for some function f : [0, ] 2 R. α=p() β=p( y)

11 ECE 587 / STA 563: Lecture 2 (B) Smoothness: partial derivatives of f(, ) eist. f (α, β) = f(α, β) f(α, β), f 2 (α, β) = α β (C) successive revelation: Let y = (w, z). Then i(, y) = i(, w) + i(, z w) where i(, w) = f(p(), p( w)) and i(, z w) = f(p( w), p( z, w)) and so the function f(, ) must obey f(α, γ) = f(α, β) + f(β, γ), 0 α, β, γ (D) Additivity: If (X, Y ) and (U, V ) are independent, i.e. p(, y, u, v) = p(, y)p(u, v), then i((, u), (y, v)) = i(, y) + i(u, v) where i(, u) = f(p(, u), p(, u y, v)) = f(p()p(u), p( y)p(u v)) and so the function f(, ) must obey Theorem: The function f(αγ, βδ) = f(α, β) + f(γ, δ) 0 α, β, γ, δ ( ) p(, y) i(, y) = log p()p(y) is the is the only function which satisfies our four postulates above Proof of uniqueness of i(, y) Because of B, we can apply β to left and right sides of C 0 = f 2 (α, β) + f (β, γ) = f 2 (α, β) = f (β, γ) Thus f 2 (α, β) must be a function only of β, say g (β). Integrating w.r.t. β gives f 2 (α, β)dβ = f(α, β) + c(α) i.e. and so Put this back into C g (β)dβ = g(β) = f(α, β) + c(α) f(α, β) = g(β) c(α) f(α, γ) = g(γ) c(α) = g(β) c(α) + g(γ) c(β) c(β) = g(β) f(α, β) = g(β) g(α)

12 2 ECE 587 / STA 563: Lecture 2 Net, write D in terms of g( ) g(βδ) g(αγ) = g(β) g(α) + g(δ) g(γ) Take derivative w.r.t δ of both sides to get βg (βδ) = g (δ) Set δ = /2 (could be δ = but scared to try) βg (β/2) = g (/2) = K, a constant and so g (β/2) = K/β Take the integral of both sides with respect to β to get g(β/2) = K ln(β) + C So or g() = K ln(2) + C g() = K ln() + C Thus f(α, β) = g(β) g(α) = K ln(β) K ln(α) = K ln(β/α) By A, ( ) p( y) i(, y) = K ln p() Choosing K is equivalent to choosing the log base: K = corresponds to measuring information in nats K = log 2 (e) corresponds to measuring information in bits