1 Basic Information Theory

Size: px

Start display at page:

Download "1 Basic Information Theory"

Rosa Gaines
5 years ago
Views:

1 ECE 6980 An Algorithmic and Information-Theoretic Toolbo for Massive Data Instructor: Jayadev Acharya Lecture #4 Scribe: Xiao Xu 6th September, 206 Please send errors to and We did a brief recap of the previous lecture. We then outline the three things we will discuss today: Basics of information theory Proof of Fano s Inequality A simple algorithm to learn many classes almost optimally Basic Information Theory. Entropy Definition. The entropy of a discrete distribution P over X is defined as H(P = ( P ( log P ( X ( Claim 2. Let P be a discrete distribution over X, then H(P log X (2 Proof. We use Jensen s inequality and the concavity of log( to prove the claim. H(P = ( ( P ( log log P ( = log X (3 P ( P ( X X To understand entropy, we consider an eample of distinguishing a number in a set. Suppose X = {0,, 2,..., 27} and is randomly chosen from X with equal probability. We would like to identify by asking several Yes/No questions. The problem is what is the smallest number of questions we need to ask to find the eact value of. The answer is 7 = log(28 and we will use a binary search method to do this: firstly, we ask if 64, if yes, we ask the second question if 32, or otherwise, ask if 96 and keep doing this until we successfully identify the eact value of. Actually, entropy H characterizes the shortest length we need to distinguish a random variable.

2 .2 Joint Entropy Definition 3. We consider a joint discrete distribution P over X Y, then the joint entropy is defined as H(P = ( P (, y log (4 P (, y Definition 4. Suppose P is a joint distribution over X Y, the marginal distribution of P is defined as P X ( = y P Y (y = P (, y (5 P (, y (6 Definition 5. Suppose P is a joint distribution over X Y, we say P is a product distribution if P (, y = P X ( P Y (y (7 We consider the following eample. Table gives us some statistics of the weather in San Diego. Suppose X = {Sunny, Not Sunny}, Y = {Hot, Cold}. Hot Cold Sunny Not Sunny Table : Number of days of different weather The question is, is the probability distribution of different kind of weather a product distribution? The answer is no since given Y = Hot or Cold, the probability Pr(X = Sunny Y = Hot = = Pr(X = Sunny Y = Cold 63 In fact, we can change the number in the table appropriately to make it a product distribution. Claim 6. If P : X Y is a product distribution, then we have H(P = H(P X + H(P Y (8 2

3 Proof. H(P = = = = P (, y log P X (P Y (y log P X (P Y (y log P X ( log P X ( P (, y P X ( P Y (y P X ( + y + P X (P Y (y log ( P Y (y log P Y (y P Y (y (9 = H(P X + H(P Y Definition 7. If X is a random variable from a distribution P over X, we define the entropy of the random variable X as H(X =H(P (0 Similar to Claim 6, we also have the conclusion that if X, Y are independent r.v.s, More generally, we have the following claim. H(X, Y = H(X + H(Y ( Claim 8. Consider two random variables X, Y, the following inequality holds: H(X, Y H(X + H(Y (2 Proof. According to the definition, H(X, Y = ( P (, y log P (, y H(X = ( P X ( log = P X ( H(Y = ( P Y (y log = P y Y (y P (, y log P X ( P (, y log P Y (y (3 Thus, we have H(X + H(Y H(X, Y = ( P (, y P (, y log P X (P Y (y = D(P P X P Y 0 (4 3

4 .3 Conditional Entropy Definition 9. Consider two random variables X, Y defined on X, Y respectively. P is the joint distribution. The conditional entropy of X given Y is defined as H(X Y = y = ( P (X = Y = y log (5 P (X = Y = y H(X Y = y P Y (yh(x Y = y = ( P (, y log P (X = Y = y (6 Eercise. Show the chain rule of entropy: H(X, Y = H(Y + H(X Y = H(X + H(Y X (7 More generally, suppose X,..., X n are n random variables, show that: n H(X,...X n = H(X + H(X i X,..., X i (8 Remark. Combine the chain rule of entropy and Claim 8 together, we can derive that H(X Y H(X (9 Intuitively, when given Y, we get more information of X, then the uncertainty of X is smaller. i=2 Definition 0. The mutual information of two r.v.s X, Y is defined as I(X; Y = H(X H(X Y = H(Y H(Y X = H(X + H(Y H(X, Y (20 Intuitively, I(X; Y characterizes the information provided by Y (or X to reduce the uncertainty of X(or Y and is always non-negative. 2 Multiway Classification and Fano s Inequality 2. Multiway Classification Suppose there are M different distributions P,..., P M. Consider the following steps:. Randomly choose a distribution P X, X U[M], 2. Observe Y from distribution P X, 3. Using the outcome Y to predict X. 4

5 For the process described above, we have the following claim: Claim. I(X; Y Pr(correct log(m log 2 (2 Proof. Define Z = { 0, if X X, if X = X (22 It is obvious that H(Z X, X = 0. Thus, using the chain rule of entropy, we can get On the other hand, we have H(X, Z X = H(X X + H(Z X, X = H(X X (23 H(X, Z X = H(Z X + H(X Z, X H(Z + Pr(Z = H(X X, Z = + Pr(Z = 0H(X X, Z = 0 log 2 + Pr(Z = 0 log(m (24 The last inequality holds because H(X X, Z = = 0 and H(X X, Z = 0 = H(X X, X X log(m Thus, we can get H(X X log 2 + Pr(error log(m (25 Since H(X = log M, we have I(X; X Pr(correct log(m log 2 (26 Consider the probability model, we have X Y X Using data processing inequality, we get the conclusion that I(X; Y I(X; X Pr(correct log(m log 2 (27 We use this result to prove Fano s inequality. 5

6 2.2 Fano s Inequality Theorem 2 (Fano s inequality. Suppose there are M different distributions P,..., P M s.t. D(P i P j β, i, j For the multiway classification problem defined in section 2., the following inequality holds: Pr(correct log(m log 2 β (28 Proof. For the multiway classification problem, it is not hard to find that Pr(X = j = M Pr(Y = y = M (29 P j (y = P (y (30 j Using the result in Claim, we know that if I(X; Y β, the statement is true. Consider I(X; Y = H(X H(X Y = ( Pr(X = j Y = y Pr(X = j, Y = y log Pr(X = j j,y = ( Pr(X = j, Y = y Pr(X = j, Y = y log Pr(X = jpr(y = y j,y = ( M P P j (y j(y log j,y M j P j(y = D(P j M P So, we only need to prove that D(P i P β. Since j D(P Q j = ( P M ( P ( log M j= j= Q j( = M ( P ( P ( log ( M j= Q j( /M M ( P ( P ( log M ( M j= Q j( = MD P Q j ( M j= (3 (32 6

7 The inequality comes from conveity of ep( : /M M Q j ( = ep M j= M = M log(q j ( j= ep(log(q j ( j= Q j ( Thus, D(P i P D(P i P j β M Thus, I(X; Y β and then we get the conclusion. j j= (33 3 Learning Distributions Definition 3. Consider a collection of distributions P and a distance measure d : P P R, define an ε cover of P as a set of distributions P, P 2,..., P N P, s.t. P P, there eists i N s.t. d(p, P i < ε. Claim 4. For any collection of distributions P, we use the total variation distance as the distance measure, i.e. d = d T V. Let N ε be the smallest size of the ε cover of P. Then for any distribution P P, we need only samples to learn ˆP s.t. d T V ( ˆP, P < ε with probability at least 3/4. log(n ε ε 2 (34 To prove this claim, we first introduce the problem of finding the closest distribution. Consider a collection of distributions P and N distributions P, P 2,..., P N P. Suppose there is another distribution P P and we observe n samples X,..., X n from P. Our goal is to output the closest distribution to P among {P i } N based on the distance measure d = d T V. Theorem 5. With C log(n ε 2 (35 samples, with probability at least 3/4 we can learn P j s.t. where = min j d T V (P, P j d T V (P, P j 8 + O(ε (36 In the net lecture, we will show how to prove this theorem and therefore prove the previous claim. Also, we will give a simple algorithm to learn distributions optimally. 7

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but