ECE 4400:693 - Information Theory

Size: px

Start display at page:

Download "ECE 4400:693 - Information Theory"

Derick Harris
5 years ago
Views:

1 ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43

2 Outline 1 Review: Entropy of discrete RVs 2 Differential Entropy: Motivation 3 Continuous RVs: A Review Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 2 / 43

3 Outline 1 Review: Entropy of discrete RVs 2 Differential Entropy: Motivation 3 Continuous RVs: A Review Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 2 / 43

4 Outline 1 Review: Entropy of discrete RVs 2 Differential Entropy: Motivation 3 Continuous RVs: A Review Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 2 / 43

5 Review: Entropy of discrete RVs Self-Information of a RV What is Entropy? Are there some intuitive notions about Entropy? We first define a so-called Self-Information At first, let consider a discrete RV X with PMF P(X = x). For convenience, hereafter, we shall denote the PMF as p(x). p(x) and p(y) refer to two different RVs and different PMFs Note that x is outcome of an experiment, not necessarily a number. Now,foraRVX, Self-Information of an event X = x is defined as: I(x) =log 1 = log p(x) p(x) If the base of the logarithm is e, it is measured in nats. Unless otherwise state, we take logarithm to base 2 and the measurement will be in bits. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 3 / 43

6 Review: Entropy of discrete RVs Self-Information of a RV (Continued) I(x) =log 1 p(x) = log p(x) Let see a very simple example: Suppose we have a discrete information source that emits binary bits 0 and 1 with equal probability of 1/2. What is Self-Information: It is 1 bit. Now, if a source emits k bits in a block k time intervals, Self-Information will be k bits. So we somehow already have some appropriate measure of information!!!!! Observe that High probability event conveys less information than low-probability one. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 4 / 43

7 Entropy of a RV Review: Entropy of discrete RVs So what is Entropy of a RV: Simply speaking It is an average of self-information. Or it can be understood considered as a measure of the uncertainty ofarv. Definition The entropy H(X) of a discrete RV X with PMF p(x) is given by H(X) = x p(x) log p(x) Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 5 / 43

8 Review: Entropy of discrete RVs Entropy of a RV (Continued) H(X) = x p(x) log p(x) Note that H(X) is a functional of the distribution of X: It does not depends on the actual values taken by X but only on the probabilities. We can observe that the entropy H(X) can be interpreted as the expected value of RV log 1 : An average p(x) self-information Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 6 / 43

9 Differential Entropy: Motivation Motivation What we considered earlier applies to discrete RVs. For continuous RVs, we also need to define entropy, mutual information. In fact, most of the time, we need to work on continuous RVs, e.g., Gaussian channel. Given H(X) = x p(x) log p(x) for discrete X, can you guess what would it be for a continuous one? H(X) = f X (x) log f X (x)dx S where f X (x), or for simplicity f (x), is the probability density function of RV X and S is support set of X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 7 / 43

10 Differential Entropy: Motivation Motivation What we considered earlier applies to discrete RVs. For continuous RVs, we also need to define entropy, mutual information. In fact, most of the time, we need to work on continuous RVs, e.g., Gaussian channel. Given H(X) = x p(x) log p(x) for discrete X, can you guess what would it be for a continuous one? H(X) = f X (x) log f X (x)dx S where f X (x), or for simplicity f (x), is the probability density function of RV X and S is support set of X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 7 / 43

11 Differential Entropy: Motivation Motivation What we considered earlier applies to discrete RVs. For continuous RVs, we also need to define entropy, mutual information. In fact, most of the time, we need to work on continuous RVs, e.g., Gaussian channel. Given H(X) = x p(x) log p(x) for discrete X, can you guess what would it be for a continuous one? H(X) = f X (x) log f X (x)dx S where f X (x), or for simplicity f (x), is the probability density function of RV X and S is support set of X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 7 / 43

12 CDF and PDF Continuous RVs: A Review We have cumulative distribution function (CDF), which gives a complete description of the random variable: F X (x) =P(X x) The probability density function (PDF) is defined as the derivative of the CDF: f X (x) = df X(x) dx Then one has the following relationship: P(x 1 X x 2 )=P(X x 2 ) P(X x 1 ) = F X (x 2 ) F X (x 1 )= x2 x 1 f X (x)dx Some properties: f X (x) 0 (and the set of x is referred to as support set), + f X(x)dx = 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 8 / 43

13 Joint PDF For two RVs X and Y defined in sample space Ω, one has the joint CDF: The joint PDF is: F X,Y (x, y) =P(X x, Y y) f X,Y (x, y) = 2 F X,Y (x, y) x y The marginal PDF can be obtained from the joint PDF as: f X (x) = + f X,Y (x, y)dy; f Y (y) = + f X,Y (x, y)dx Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 9 / 43

14 The Conditional PDF The conditional PDF of the RV Y, given that the value of the RV X is equal to x, is defined as { fx,y (x,y) f Y (y x) = f X, f (x) X (x) 0 0, Otherwise Two RVs X and Y are statistically independent if and only if f Y (y x) =f Y (y) or equivalently, f X,Y (x, y) =f X (x) f Y (y) It means that knowledge of X does not affect the statistics of Y, and vice versa. As we will see later, if X and Y are independent, then X provides no information about Y and vice-versa. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 10 / 43

15 The Conditional PDF The conditional PDF of the RV Y, given that the value of the RV X is equal to x, is defined as { fx,y (x,y) f Y (y x) = f X, f (x) X (x) 0 0, Otherwise Two RVs X and Y are statistically independent if and only if f Y (y x) =f Y (y) or equivalently, f X,Y (x, y) =f X (x) f Y (y) It means that knowledge of X does not affect the statistics of Y, and vice versa. As we will see later, if X and Y are independent, then X provides no information about Y and vice-versa. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 10 / 43

16 Quantized Random Variables Before proceeding further with differential entropy, let first consider quantized RVs from a continuous PDF. We divide the range of x into bins of width Δ (quantization): f(x) Δ For any bin ith, x i such that f (x i )Δ = (i+1)δ f (x)dx: From iδ mean-value theorem. We then now define the following (discrete) RV: X Δ = {x i } p X Δ = {f (x i )Δ} x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 11 / 43

17 Quantized Random Variables Before proceeding further with differential entropy, let first consider quantized RVs from a continuous PDF. We divide the range of x into bins of width Δ (quantization): f(x) Δ For any bin ith, x i such that f (x i )Δ = (i+1)δ f (x)dx: From iδ mean-value theorem. We then now define the following (discrete) RV: X Δ = {x i } p X Δ = {f (x i )Δ} x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 11 / 43

18 Quantized Random Variables Before proceeding further with differential entropy, let first consider quantized RVs from a continuous PDF. We divide the range of x into bins of width Δ (quantization): f(x) Δ For any bin ith, x i such that f (x i )Δ = (i+1)δ f (x)dx: From iδ mean-value theorem. We then now define the following (discrete) RV: X Δ = {x i } p X Δ = {f (x i )Δ} x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 11 / 43

19 Quantized Random Variable f(x) f(x) Δ Δ x x X Δ = {x i } p X Δ = {f (x i )Δ} It is a scaled, quantized version of f (x), with unevenly spaced x i. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 12 / 43

20 Entropy of Quantized Random Variable X Δ = {x i } p X Δ = {f (x i )Δ} H(X Δ ) = f (x i )Δ log (f (x i )Δ) = log Δ f (x i ) log (f (x i )) Δ Now, if Δ 0, wehave: H(X Δ )= log Δ The parameter h(x) = + + f (x) log f (x)dx f (x) log f (x)dx is defined as differential entropy of a continuous RV X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 13 / 43

21 Entropy of Quantized Random Variable X Δ = {x i } p X Δ = {f (x i )Δ} H(X Δ ) = f (x i )Δ log (f (x i )Δ) = log Δ f (x i ) log (f (x i )) Δ Now, if Δ 0, wehave: H(X Δ )= log Δ The parameter h(x) = + + f (x) log f (x)dx f (x) log f (x)dx is defined as differential entropy of a continuous RV X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 13 / 43

22 Entropy of Quantized Random Variable X Δ = {x i } p X Δ = {f (x i )Δ} H(X Δ ) = f (x i )Δ log (f (x i )Δ) = log Δ f (x i ) log (f (x i )) Δ Now, if Δ 0, wehave: H(X Δ )= log Δ The parameter h(x) = + + f (x) log f (x)dx f (x) log f (x)dx is defined as differential entropy of a continuous RV X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 13 / 43

23 Differential Entropy: Definition Definition The differential entropy h(x) of a continuous random variable X with density f (x) is defined as h(x) = f (x) log f (x)dx = E {log f (X)} S where S is the support set of the random variable. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 14 / 43

24 Differential Entropy With Δ 0, H(X Δ )= log Δ h(x) = + + f (x) log f (x)dx f (x) log f (x)dx = E {log f (X)} h(x) does not give the amount of information in X Not necessarily positive However, one still can compare the uncertainly of two continuous r.v. (quantized to the same precision) Relative Entropy and Mutual Information still work well Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 15 / 43

25 Differential Entropy With Δ 0, H(X Δ )= log Δ h(x) = + + f (x) log f (x)dx f (x) log f (x)dx = E {log f (X)} h(x) does not give the amount of information in X Not necessarily positive However, one still can compare the uncertainly of two continuous r.v. (quantized to the same precision) Relative Entropy and Mutual Information still work well Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 15 / 43

26 Example: Uniform Distribution Uniform distribution X U(a, b): f (x) = 1 b a for x (a, b) and 0 else where b 1 h(x) = a b a log 1 dx = log(b a) b a We can observe that h(x) < 0 when (b a) < 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 16 / 43

27 Example: Uniform Distribution Uniform distribution X U(a, b): f (x) = 1 b a for x (a, b) and 0 else where b 1 h(x) = a b a log 1 dx = log(b a) b a We can observe that h(x) < 0 when (b a) < 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 16 / 43

28 Example: Uniform Distribution Uniform distribution X U(a, b): f (x) = 1 b a for x (a, b) and 0 else where b 1 h(x) = a b a log 1 dx = log(b a) b a We can observe that h(x) < 0 when (b a) < 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 16 / 43

29 Example: Gaussian Distribution Gaussian distribution X N(μ, σ 2 ): ( ) 1 f (x) = exp (x μ)2 2πσ 2 2σ 2 Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 17 / 43

30 Joint and Conditional Entropy Definition (Joint Entropy) The differential entropy of a set X 1,...,X n of random variables with joint pdf f (x 1,...,x n ) is defined as: h (X 1,...,X n )= f (x n ) log f (x n )dx n. Definition (Conditional Entropy) If X and Y have a joint density function f (x, y), we can define the conditional entropy h(x Y as h(x Y) = f (x, y) log f (x y)dxdy. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 18 / 43

31 Joint and Conditional Entropy Definition (Joint Entropy) The differential entropy of a set X 1,...,X n of random variables with joint pdf f (x 1,...,x n ) is defined as: h (X 1,...,X n )= f (x n ) log f (x n )dx n. Definition (Conditional Entropy) If X and Y have a joint density function f (x, y), we can define the conditional entropy h(x Y as h(x Y) = f (x, y) log f (x y)dxdy. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 18 / 43

32 Multivariate Gaussian Theorem (Entropy of a multivariate Gaussian distribution) Let X 1,...,X n have a multivariate Gaussian distribution with mean µ and covariance matrix Q, denoted as N n (µ, Q). Then h (X 1,...,X n )= 1 2 log(2πe)n Q where Q is the determinant of Q. Proof. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 19 / 43

33 Relative Entropy and Mutual Information Definition (Relative Entropy) The relative entropy (or Kullback-Leibler distance) D(f g) between two densities f and g is defined by D(f g) = f log f g. Note that D(f g) is finite only if the support set of f is contained in the support set of g. Definition (Mutual Information) The mutual information I(X; Y) between two random variables with joint density f (x, y) is defined as f (x, y) I(X; Y) = f (x, y) log f (x)f (y) dxdy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 20 / 43

34 Information Inequality Theorem D(f g) 0 with equality iff f = g almost everywhere (a.e.). Proof. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 21 / 43

35 Properties of Mutual Information I(X; Y) = f (x, y) log f (x, y) f (x)f (y) dxdy From the definition, it is clear that: I(X; Y) =h(x) h(x Y) =h(y) h(y X) =h(x)+h(y) h(x, Y) Also, I(X; Y) =D (f (x, y) f (x)f (y)) Properties of D(f g) and I(X; Y) are the same as discrete case. Why? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 22 / 43

36 Properties of Mutual Information I(X; Y) = f (x, y) log f (x, y) f (x)f (y) dxdy From the definition, it is clear that: I(X; Y) =h(x) h(x Y) =h(y) h(y X) =h(x)+h(y) h(x, Y) Also, I(X; Y) =D (f (x, y) f (x)f (y)) Properties of D(f g) and I(X; Y) are the same as discrete case. Why? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 22 / 43

37 Properties of Mutual Information I(X; Y) = f (x, y) log f (x, y) f (x)f (y) dxdy From the definition, it is clear that: I(X; Y) =h(x) h(x Y) =h(y) h(y X) =h(x)+h(y) h(x, Y) Also, I(X; Y) =D (f (x, y) f (x)f (y)) Properties of D(f g) and I(X; Y) are the same as discrete case. Why? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 22 / 43

38 Mutual Information with Finite Partitions Definition (Partition) Let X be the range of X. A partition P of X is a finite collection of disjoint sets P i such that i P i = X. Definition (Quantization) The quantization of X by P, denoted [X] P is a discrete RV defined by: Pr([X] P = i) =Pr(X P i = df(x). P i We can now have a general definition of mutual information between two arbitrary RVs X and Y with partitions P and Q using the mutual information between the quantized version of X and Y, [X] P and [Y] Q. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 23 / 43

39 Mutual Information with Finite Partitions Definition (Partition) Let X be the range of X. A partition P of X is a finite collection of disjoint sets P i such that i P i = X. Definition (Quantization) The quantization of X by P, denoted [X] P is a discrete RV defined by: Pr([X] P = i) =Pr(X P i = df(x). P i We can now have a general definition of mutual information between two arbitrary RVs X and Y with partitions P and Q using the mutual information between the quantized version of X and Y, [X] P and [Y] Q. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 23 / 43

40 Mutual Information with Finite Partitions Definition (Partition) Let X be the range of X. A partition P of X is a finite collection of disjoint sets P i such that i P i = X. Definition (Quantization) The quantization of X by P, denoted [X] P is a discrete RV defined by: Pr([X] P = i) =Pr(X P i = df(x). P i We can now have a general definition of mutual information between two arbitrary RVs X and Y with partitions P and Q using the mutual information between the quantized version of X and Y, [X] P and [Y] Q. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 23 / 43

41 Mutual Information with Finite Partitions Definition (Mutual Information) The mutual information between two RVs X and Y is given by: I(X; Y) =sup I([X] P, [Y] Q ) P,Q where the supremum is over all finite partitions P and Q. In fact, the above definition can be used to both RVs having pdf and pmf: More general. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 24 / 43

42 Mutual Information with Finite Partitions Definition (Mutual Information) The mutual information between two RVs X and Y is given by: I(X; Y) =sup I([X] P, [Y] Q ) P,Q where the supremum is over all finite partitions P and Q. In fact, the above definition can be used to both RVs having pdf and pmf: More general. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 24 / 43

43 Example: Mutual Information between Correlated Gaussian RVs Let (X, Y N (0, K) where ( σ 2 ρσ K = 2 ρσ 2 σ 2 ) It means the correlation is ρ. What is I(X; Y)? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 25 / 43

44 More Properties of Differential Entropy Corollary h(x Y) h(x) with equality iff X and Y are independent. Theorem (Chain rule for differential entropy) n h(x 1,...,X n )= h(x i X 1,...,X i 1 ) i=1 Corollary h(x 1,...,X n ) h(x i ) with equality iff X 1,...,X n are independent. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 26 / 43

45 Changing Variable Now we have Y = g(x). We know that f Y (y) = df Y(y) dy = f X (g 1 (y)) dg 1 (y) dy = f X(g 1 (y)) dx dy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 27 / 43

46 Changing Variable - Example Theorem Translation does not change differential entropy: h(x + c) =h(x). Theorem Corollary h(ax) =h(x)+log a For a vector-valued RV: h(ax) =h(x)+log det(a) Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 28 / 43

47 Concavity and Convexity Same properties with discrete RVs: Differential Entropy h(x) is a concave function of f X (x). Mutual Information h(x) is a concave function of f X (x). I(X; Y) is a concave function of f X (x) for fixed f Y X (y). I(X; Y) is a convex function of f Y X (y) for fixed f X (x). Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 29 / 43

48 Maximum Entropy Distribution Going back to discrete RVs: For a discrete random variable taking on K values, what distribution maximized the entropy? For continuous RVs, we are interested in: Maximizing the entropy h(f ) over all f that satisfy: 1. f (x) 0 with equality outside the support set S 2. S f (x)dx = 1 3. Moment constraints S f (x)r i(x)dx = α i for 1 i m. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 30 / 43

49 Maximum Entropy Distribution Going back to discrete RVs: For a discrete random variable taking on K values, what distribution maximized the entropy? For continuous RVs, we are interested in: Maximizing the entropy h(f ) over all f that satisfy: 1. f (x) 0 with equality outside the support set S 2. S f (x)dx = 1 3. Moment constraints S f (x)r i(x)dx = α i for 1 i m. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 30 / 43

50 Maximum Entropy Distribution Maximizing the entropy h(f ) over all f that satisfy: 1. f (x) 0 with equality outside the support set S 2. f (x)dx = 1 S 3. Moment constraints f (x)r S i(x)dx = α i for 1 i m. Theorem (Maximum entropy distribution) Let f (x) =f λ (x) =exp(λ 0 + m i=1 λ ir i (x)), x S, where λ 0,...,λ m are chosen so that f satisfies the above constraints. Then f uniquely maximizes h(f ) over all probability densities f satisfying the above constraints. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 31 / 43

51 Example 1 Continuous RVs: A Review First, let S =[a, b]. What is f? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 32 / 43

52 Example 2 Continuous RVs: A Review Now we consider that we have constraints E(X) =0 and E(X 2 )=σ 2. What is f? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 33 / 43

53 Example 3 Continuous RVs: A Review What zero-mean distribution maximizes the entropy on (, ) n for a given covariance matrix K? Answer: A multivariate Gaussian 1 φ(x) = ( 2πn K exp 1 ) 2 x K 1 x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 34 / 43

54 Example 3 Continuous RVs: A Review What zero-mean distribution maximizes the entropy on (, ) n for a given covariance matrix K? Answer: A multivariate Gaussian 1 φ(x) = ( 2πn K exp 1 ) 2 x K 1 x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 34 / 43

55 Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

56 Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

57 Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

58 Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

59 Fano s Inequality Assume we wish to estimate X and observe Y related to X by p(y x). From Y, we can calculate an estimate ˆX = g(y). We observe that X Y ˆX and wish to bound the probability P e = Pr(X ˆX). Also, let an error RV E = {1, 0}. Then Theorem (Fano s Inequality) For any estimate ˆX such that X Y ˆX, with P e = Pr(X ˆX) and H(P e ) H(E). We have H(P e )+P e log X H(X ˆX) H(X Y) This implies that P e H(X Y) 1 : P log X e cannot be too small if H(X Y) is large i.e., correct estimation only happens when residual randomness of X is small after observation of Y. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 36 / 43

60 Fano s Inequality Assume we wish to estimate X and observe Y related to X by p(y x). From Y, we can calculate an estimate ˆX = g(y). We observe that X Y ˆX and wish to bound the probability P e = Pr(X ˆX). Also, let an error RV E = {1, 0}. Then Theorem (Fano s Inequality) For any estimate ˆX such that X Y ˆX, with P e = Pr(X ˆX) and H(P e ) H(E). We have H(P e )+P e log X H(X ˆX) H(X Y) This implies that P e H(X Y) 1 : P log X e cannot be too small if H(X Y) is large i.e., correct estimation only happens when residual randomness of X is small after observation of Y. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 36 / 43

61 Fano s Inequality Assume we wish to estimate X and observe Y related to X by p(y x). From Y, we can calculate an estimate ˆX = g(y). We observe that X Y ˆX and wish to bound the probability P e = Pr(X ˆX). Also, let an error RV E = {1, 0}. Then Theorem (Fano s Inequality) For any estimate ˆX such that X Y ˆX, with P e = Pr(X ˆX) and H(P e ) H(E). We have H(P e )+P e log X H(X ˆX) H(X Y) This implies that P e H(X Y) 1 : P log X e cannot be too small if H(X Y) is large i.e., correct estimation only happens when residual randomness of X is small after observation of Y. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 36 / 43

62 Estimation Error and Differential Entropy Now we have the estimation counterpart to Fano s inequality: Theorem (Estimation error and differential entropy) For any RV X and estimator ˆX, E(X ˆX) 2 1 2πe exp(2h(x)) with equality iff X is Gaussian and ˆX is the mean of X and h(x be in nats. E(X ˆX) 2 : The expected prediction error. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 37 / 43

63 AEP for Continuous RVs Theorem (AEP - Discrete) If {X 1,...,X n } are i.i.d. with p(x) then 1 n log p (X 1,...,X n ) H(X) in probability. Theorem (AEP - Continuous) If {X 1,...,X n } are i.i.d. with f (x) then 1 n log f (X 1,...,X n ) h(x) in probability. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 38 / 43

64 AEP for Continuous RVs Theorem (AEP - Discrete) If {X 1,...,X n } are i.i.d. with p(x) then 1 n log p (X 1,...,X n ) H(X) in probability. Theorem (AEP - Continuous) If {X 1,...,X n } are i.i.d. with f (x) then 1 n log f (X 1,...,X n ) h(x) in probability. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 38 / 43

65 Typical Set Continuous RVs: A Review Definition (Typical Set - Discrete) The typical set A (n) ɛ with respect to p(x) is the set of sequence (x 1,...,x n ) X n with the property: 2 n(h(x)+ɛ) p(x 1,...,x n ) 2 n(h(x) ɛ) Definition (Typical Set - Continuous) For ɛ>0 and any n, the typical set A (n) ɛ defined as A (n) ɛ = { (x 1,...,x n ) S n : where f (x 1,...,x n )= n i=1 f (x i). with respect to f (x) is } 1 n log f (x 1,...,x n ) h(x) ɛ Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 39 / 43

66 Typical Set Continuous RVs: A Review Definition (Typical Set - Discrete) The typical set A (n) ɛ with respect to p(x) is the set of sequence (x 1,...,x n ) X n with the property: 2 n(h(x)+ɛ) p(x 1,...,x n ) 2 n(h(x) ɛ) Definition (Typical Set - Continuous) For ɛ>0 and any n, the typical set A (n) ɛ defined as A (n) ɛ = { (x 1,...,x n ) S n : where f (x 1,...,x n )= n i=1 f (x i). with respect to f (x) is } 1 n log f (x 1,...,x n ) h(x) ɛ Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 39 / 43

67 Typical Set and Volume Definition The volume Vol(A) of a set A R n is defined as Vol(A) = A dx 1...dx n. Theorem (Typical Set Properties) ( ) 1. Pr > 1 ɛ for n sufficient large. A (n) ɛ 2. Vol(A (n) ɛ ) 2 n(h(x)+ɛ) for all n. 3. Vol(A (n) ɛ ) (1 ɛ)2 n(h(x) ɛ) for n sufficient large. Theorem The set A (n) ɛ is the smallest volume set with probability larger or equal 1 ɛ, to first order in the exponent. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 40 / 43

68 Typical Set and Volume Definition The volume Vol(A) of a set A R n is defined as Vol(A) = A dx 1...dx n. Theorem (Typical Set Properties) ( ) 1. Pr > 1 ɛ for n sufficient large. A (n) ɛ 2. Vol(A (n) ɛ ) 2 n(h(x)+ɛ) for all n. 3. Vol(A (n) ɛ ) (1 ɛ)2 n(h(x) ɛ) for n sufficient large. Theorem The set A (n) ɛ is the smallest volume set with probability larger or equal 1 ɛ, to first order in the exponent. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 40 / 43

69 Differential Entropy: Summary h(x) = h(f ) = f(x)log f(x)dx S f(x n ) =2. nh(x) Vol(A (n) ɛ ) =2. nh(x). H ([X] 2 n) h(x) + n. h(n(0,σ 2 )) = 1 2 log 2πeσ2. h(n n (μ, K)) = 1 2 log(2πe)n K. D(f g) = f log f g 0. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 41 / 43

70 Differential Entropy: Summary h(x 1,X 2,...,X n ) = n h(x i X 1,X 2,...,X i 1 ). (8.88) i=1 h(x Y) h(x). (8.89) h(ax) = h(x) + log a. (8.90) I(X; Y) = f(x,y)log f(x,y) 0. (8.91) f(x)f(y) max h(x) = 1 EXX t =K 2 log(2πe)n K. (8.92) E(X ˆX(Y)) 2 1 2πe e2h(x Y). 2 nh (X) is the effective alphabet size for a discrete random variable. 2 nh(x) is the effective support set size for a continuous random variable. 2 C is the effective alphabet size of a channel of capacity C. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 42 / 43

71 Thank you! Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 43 / 43

Lecture 17: Differential Entropy

Lecture 17: Differential Entropy Differential entropy AEP for differential entropy Quantization Maximum differential entropy Estimation counterpart of Fano s inequality Dr. Yao Xie, ECE587, Information