ECE 4400:693 - Information Theory

Similar documents
Lecture 17: Differential Entropy

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 8: Channel Capacity, Continuous Random Variables

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 5 Channel Coding over Continuous Channels

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

Lecture 11: Continuous-valued signals and differential entropy

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

EE4601 Communication Systems

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Basics on Probability. Jingrui He 09/11/2007

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 11. Probability Theory: an Overveiw

Lecture 22: Final Review

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Recitation 2: Probability

Review of Probability Theory

Information Theory Primer:

Review: mostly probability and some statistics

Lecture 2: August 31

conditional cdf, conditional pdf, total probability theorem?

5 Mutual Information and Channel Capacity

Random Variables and Their Distributions

ELEC546 Review of Information Theory

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

LECTURE 13. Last time: Lecture outline

Continuous Random Variables

ECE353: Probability and Random Processes. Lecture 7 -Continuous Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

BASICS OF PROBABILITY

ENGG2430A-Homework 2

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Formulas for probability theory and linear models SF2941

Exercises with solutions (Set D)

Chapter I: Fundamental Information Theory

Lecture 5 - Information theory

Computing and Communications 2. Information Theory -Entropy

Chapter 2: Random Variables

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture 2: Repetition of probability theory and statistics

Algorithms for Uncertainty Quantification

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

2 (Statistics) Random variables

Capacity of AWGN channels

1 Joint and marginal distributions

CHAPTER 3. P (B j A i ) P (B j ) =log 2. j=1

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

The binary entropy function

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Bivariate distributions

Lecture 3: Channel Capacity

SDS 321: Introduction to Probability and Statistics

x log x, which is strictly convex, and use Jensen s Inequality:

Lecture Notes 3 Multiple Random Variables. Joint, Marginal, and Conditional pmfs. Bayes Rule and Independence for pmfs

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Hands-On Learning Theory Fall 2016, Lecture 3

Introduction to Machine Learning

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

LECTURE 3. Last time:

Multiple Random Variables

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Math 416 Lecture 2 DEFINITION. Here are the multivariate versions: X, Y, Z iff P(X = x, Y = y, Z =z) = p(x, y, z) of X, Y, Z iff for all sets A, B, C,

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Lecture 5: Asymptotic Equipartition Property

Introduction to Machine Learning

Lecture 14 February 28

1 Random Variable: Topics

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

ECE Lecture #9 Part 2 Overview

Introduction to Probability Theory

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Statistics for scientists and engineers

1 Presessional Probability

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

2 Functions of random variables

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

A Probability Review

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Communications Theory and Engineering

National Sun Yat-Sen University CSE Course: Information Theory. Maximum Entropy and Spectral Estimation

p. 6-1 Continuous Random Variables p. 6-2

3. Review of Probability and Statistics

Lecture 11: Quantum Information III - Source Coding

Machine Learning Srihari. Information Theory. Sargur N. Srihari

4. CONTINUOUS RANDOM VARIABLES

Transcription:

ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43

Outline 1 Review: Entropy of discrete RVs 2 Differential Entropy: Motivation 3 Continuous RVs: A Review Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 2 / 43

Outline 1 Review: Entropy of discrete RVs 2 Differential Entropy: Motivation 3 Continuous RVs: A Review Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 2 / 43

Outline 1 Review: Entropy of discrete RVs 2 Differential Entropy: Motivation 3 Continuous RVs: A Review Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 2 / 43

Review: Entropy of discrete RVs Self-Information of a RV What is Entropy? Are there some intuitive notions about Entropy? We first define a so-called Self-Information At first, let consider a discrete RV X with PMF P(X = x). For convenience, hereafter, we shall denote the PMF as p(x). p(x) and p(y) refer to two different RVs and different PMFs Note that x is outcome of an experiment, not necessarily a number. Now,foraRVX, Self-Information of an event X = x is defined as: I(x) =log 1 = log p(x) p(x) If the base of the logarithm is e, it is measured in nats. Unless otherwise state, we take logarithm to base 2 and the measurement will be in bits. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 3 / 43

Review: Entropy of discrete RVs Self-Information of a RV (Continued) I(x) =log 1 p(x) = log p(x) Let see a very simple example: Suppose we have a discrete information source that emits binary bits 0 and 1 with equal probability of 1/2. What is Self-Information: It is 1 bit. Now, if a source emits k bits in a block k time intervals, Self-Information will be k bits. So we somehow already have some appropriate measure of information!!!!! Observe that High probability event conveys less information than low-probability one. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 4 / 43

Entropy of a RV Review: Entropy of discrete RVs So what is Entropy of a RV: Simply speaking It is an average of self-information. Or it can be understood considered as a measure of the uncertainty ofarv. Definition The entropy H(X) of a discrete RV X with PMF p(x) is given by H(X) = x p(x) log p(x) Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 5 / 43

Review: Entropy of discrete RVs Entropy of a RV (Continued) H(X) = x p(x) log p(x) Note that H(X) is a functional of the distribution of X: It does not depends on the actual values taken by X but only on the probabilities. We can observe that the entropy H(X) can be interpreted as the expected value of RV log 1 : An average p(x) self-information Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 6 / 43

Differential Entropy: Motivation Motivation What we considered earlier applies to discrete RVs. For continuous RVs, we also need to define entropy, mutual information. In fact, most of the time, we need to work on continuous RVs, e.g., Gaussian channel. Given H(X) = x p(x) log p(x) for discrete X, can you guess what would it be for a continuous one? H(X) = f X (x) log f X (x)dx S where f X (x), or for simplicity f (x), is the probability density function of RV X and S is support set of X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 7 / 43

Differential Entropy: Motivation Motivation What we considered earlier applies to discrete RVs. For continuous RVs, we also need to define entropy, mutual information. In fact, most of the time, we need to work on continuous RVs, e.g., Gaussian channel. Given H(X) = x p(x) log p(x) for discrete X, can you guess what would it be for a continuous one? H(X) = f X (x) log f X (x)dx S where f X (x), or for simplicity f (x), is the probability density function of RV X and S is support set of X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 7 / 43

Differential Entropy: Motivation Motivation What we considered earlier applies to discrete RVs. For continuous RVs, we also need to define entropy, mutual information. In fact, most of the time, we need to work on continuous RVs, e.g., Gaussian channel. Given H(X) = x p(x) log p(x) for discrete X, can you guess what would it be for a continuous one? H(X) = f X (x) log f X (x)dx S where f X (x), or for simplicity f (x), is the probability density function of RV X and S is support set of X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 7 / 43

CDF and PDF Continuous RVs: A Review We have cumulative distribution function (CDF), which gives a complete description of the random variable: F X (x) =P(X x) The probability density function (PDF) is defined as the derivative of the CDF: f X (x) = df X(x) dx Then one has the following relationship: P(x 1 X x 2 )=P(X x 2 ) P(X x 1 ) = F X (x 2 ) F X (x 1 )= x2 x 1 f X (x)dx Some properties: f X (x) 0 (and the set of x is referred to as support set), + f X(x)dx = 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 8 / 43

Joint PDF For two RVs X and Y defined in sample space Ω, one has the joint CDF: The joint PDF is: F X,Y (x, y) =P(X x, Y y) f X,Y (x, y) = 2 F X,Y (x, y) x y The marginal PDF can be obtained from the joint PDF as: f X (x) = + f X,Y (x, y)dy; f Y (y) = + f X,Y (x, y)dx Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 9 / 43

The Conditional PDF The conditional PDF of the RV Y, given that the value of the RV X is equal to x, is defined as { fx,y (x,y) f Y (y x) = f X, f (x) X (x) 0 0, Otherwise Two RVs X and Y are statistically independent if and only if f Y (y x) =f Y (y) or equivalently, f X,Y (x, y) =f X (x) f Y (y) It means that knowledge of X does not affect the statistics of Y, and vice versa. As we will see later, if X and Y are independent, then X provides no information about Y and vice-versa. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 10 / 43

The Conditional PDF The conditional PDF of the RV Y, given that the value of the RV X is equal to x, is defined as { fx,y (x,y) f Y (y x) = f X, f (x) X (x) 0 0, Otherwise Two RVs X and Y are statistically independent if and only if f Y (y x) =f Y (y) or equivalently, f X,Y (x, y) =f X (x) f Y (y) It means that knowledge of X does not affect the statistics of Y, and vice versa. As we will see later, if X and Y are independent, then X provides no information about Y and vice-versa. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 10 / 43

Quantized Random Variables Before proceeding further with differential entropy, let first consider quantized RVs from a continuous PDF. We divide the range of x into bins of width Δ (quantization): f(x) Δ For any bin ith, x i such that f (x i )Δ = (i+1)δ f (x)dx: From iδ mean-value theorem. We then now define the following (discrete) RV: X Δ = {x i } p X Δ = {f (x i )Δ} x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 11 / 43

Quantized Random Variables Before proceeding further with differential entropy, let first consider quantized RVs from a continuous PDF. We divide the range of x into bins of width Δ (quantization): f(x) Δ For any bin ith, x i such that f (x i )Δ = (i+1)δ f (x)dx: From iδ mean-value theorem. We then now define the following (discrete) RV: X Δ = {x i } p X Δ = {f (x i )Δ} x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 11 / 43

Quantized Random Variables Before proceeding further with differential entropy, let first consider quantized RVs from a continuous PDF. We divide the range of x into bins of width Δ (quantization): f(x) Δ For any bin ith, x i such that f (x i )Δ = (i+1)δ f (x)dx: From iδ mean-value theorem. We then now define the following (discrete) RV: X Δ = {x i } p X Δ = {f (x i )Δ} x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 11 / 43

Quantized Random Variable f(x) f(x) Δ Δ x x X Δ = {x i } p X Δ = {f (x i )Δ} It is a scaled, quantized version of f (x), with unevenly spaced x i. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 12 / 43

Entropy of Quantized Random Variable X Δ = {x i } p X Δ = {f (x i )Δ} H(X Δ ) = f (x i )Δ log (f (x i )Δ) = log Δ f (x i ) log (f (x i )) Δ Now, if Δ 0, wehave: H(X Δ )= log Δ The parameter h(x) = + + f (x) log f (x)dx f (x) log f (x)dx is defined as differential entropy of a continuous RV X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 13 / 43

Entropy of Quantized Random Variable X Δ = {x i } p X Δ = {f (x i )Δ} H(X Δ ) = f (x i )Δ log (f (x i )Δ) = log Δ f (x i ) log (f (x i )) Δ Now, if Δ 0, wehave: H(X Δ )= log Δ The parameter h(x) = + + f (x) log f (x)dx f (x) log f (x)dx is defined as differential entropy of a continuous RV X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 13 / 43

Entropy of Quantized Random Variable X Δ = {x i } p X Δ = {f (x i )Δ} H(X Δ ) = f (x i )Δ log (f (x i )Δ) = log Δ f (x i ) log (f (x i )) Δ Now, if Δ 0, wehave: H(X Δ )= log Δ The parameter h(x) = + + f (x) log f (x)dx f (x) log f (x)dx is defined as differential entropy of a continuous RV X. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 13 / 43

Differential Entropy: Definition Definition The differential entropy h(x) of a continuous random variable X with density f (x) is defined as h(x) = f (x) log f (x)dx = E {log f (X)} S where S is the support set of the random variable. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 14 / 43

Differential Entropy With Δ 0, H(X Δ )= log Δ h(x) = + + f (x) log f (x)dx f (x) log f (x)dx = E {log f (X)} h(x) does not give the amount of information in X Not necessarily positive However, one still can compare the uncertainly of two continuous r.v. (quantized to the same precision) Relative Entropy and Mutual Information still work well Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 15 / 43

Differential Entropy With Δ 0, H(X Δ )= log Δ h(x) = + + f (x) log f (x)dx f (x) log f (x)dx = E {log f (X)} h(x) does not give the amount of information in X Not necessarily positive However, one still can compare the uncertainly of two continuous r.v. (quantized to the same precision) Relative Entropy and Mutual Information still work well Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 15 / 43

Example: Uniform Distribution Uniform distribution X U(a, b): f (x) = 1 b a for x (a, b) and 0 else where b 1 h(x) = a b a log 1 dx = log(b a) b a We can observe that h(x) < 0 when (b a) < 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 16 / 43

Example: Uniform Distribution Uniform distribution X U(a, b): f (x) = 1 b a for x (a, b) and 0 else where b 1 h(x) = a b a log 1 dx = log(b a) b a We can observe that h(x) < 0 when (b a) < 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 16 / 43

Example: Uniform Distribution Uniform distribution X U(a, b): f (x) = 1 b a for x (a, b) and 0 else where b 1 h(x) = a b a log 1 dx = log(b a) b a We can observe that h(x) < 0 when (b a) < 1. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 16 / 43

Example: Gaussian Distribution Gaussian distribution X N(μ, σ 2 ): ( ) 1 f (x) = exp (x μ)2 2πσ 2 2σ 2 Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 17 / 43

Joint and Conditional Entropy Definition (Joint Entropy) The differential entropy of a set X 1,...,X n of random variables with joint pdf f (x 1,...,x n ) is defined as: h (X 1,...,X n )= f (x n ) log f (x n )dx n. Definition (Conditional Entropy) If X and Y have a joint density function f (x, y), we can define the conditional entropy h(x Y as h(x Y) = f (x, y) log f (x y)dxdy. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 18 / 43

Joint and Conditional Entropy Definition (Joint Entropy) The differential entropy of a set X 1,...,X n of random variables with joint pdf f (x 1,...,x n ) is defined as: h (X 1,...,X n )= f (x n ) log f (x n )dx n. Definition (Conditional Entropy) If X and Y have a joint density function f (x, y), we can define the conditional entropy h(x Y as h(x Y) = f (x, y) log f (x y)dxdy. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 18 / 43

Multivariate Gaussian Theorem (Entropy of a multivariate Gaussian distribution) Let X 1,...,X n have a multivariate Gaussian distribution with mean µ and covariance matrix Q, denoted as N n (µ, Q). Then h (X 1,...,X n )= 1 2 log(2πe)n Q where Q is the determinant of Q. Proof. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 19 / 43

Relative Entropy and Mutual Information Definition (Relative Entropy) The relative entropy (or Kullback-Leibler distance) D(f g) between two densities f and g is defined by D(f g) = f log f g. Note that D(f g) is finite only if the support set of f is contained in the support set of g. Definition (Mutual Information) The mutual information I(X; Y) between two random variables with joint density f (x, y) is defined as f (x, y) I(X; Y) = f (x, y) log f (x)f (y) dxdy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 20 / 43

Information Inequality Theorem D(f g) 0 with equality iff f = g almost everywhere (a.e.). Proof. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 21 / 43

Properties of Mutual Information I(X; Y) = f (x, y) log f (x, y) f (x)f (y) dxdy From the definition, it is clear that: I(X; Y) =h(x) h(x Y) =h(y) h(y X) =h(x)+h(y) h(x, Y) Also, I(X; Y) =D (f (x, y) f (x)f (y)) Properties of D(f g) and I(X; Y) are the same as discrete case. Why? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 22 / 43

Properties of Mutual Information I(X; Y) = f (x, y) log f (x, y) f (x)f (y) dxdy From the definition, it is clear that: I(X; Y) =h(x) h(x Y) =h(y) h(y X) =h(x)+h(y) h(x, Y) Also, I(X; Y) =D (f (x, y) f (x)f (y)) Properties of D(f g) and I(X; Y) are the same as discrete case. Why? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 22 / 43

Properties of Mutual Information I(X; Y) = f (x, y) log f (x, y) f (x)f (y) dxdy From the definition, it is clear that: I(X; Y) =h(x) h(x Y) =h(y) h(y X) =h(x)+h(y) h(x, Y) Also, I(X; Y) =D (f (x, y) f (x)f (y)) Properties of D(f g) and I(X; Y) are the same as discrete case. Why? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 22 / 43

Mutual Information with Finite Partitions Definition (Partition) Let X be the range of X. A partition P of X is a finite collection of disjoint sets P i such that i P i = X. Definition (Quantization) The quantization of X by P, denoted [X] P is a discrete RV defined by: Pr([X] P = i) =Pr(X P i = df(x). P i We can now have a general definition of mutual information between two arbitrary RVs X and Y with partitions P and Q using the mutual information between the quantized version of X and Y, [X] P and [Y] Q. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 23 / 43

Mutual Information with Finite Partitions Definition (Partition) Let X be the range of X. A partition P of X is a finite collection of disjoint sets P i such that i P i = X. Definition (Quantization) The quantization of X by P, denoted [X] P is a discrete RV defined by: Pr([X] P = i) =Pr(X P i = df(x). P i We can now have a general definition of mutual information between two arbitrary RVs X and Y with partitions P and Q using the mutual information between the quantized version of X and Y, [X] P and [Y] Q. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 23 / 43

Mutual Information with Finite Partitions Definition (Partition) Let X be the range of X. A partition P of X is a finite collection of disjoint sets P i such that i P i = X. Definition (Quantization) The quantization of X by P, denoted [X] P is a discrete RV defined by: Pr([X] P = i) =Pr(X P i = df(x). P i We can now have a general definition of mutual information between two arbitrary RVs X and Y with partitions P and Q using the mutual information between the quantized version of X and Y, [X] P and [Y] Q. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 23 / 43

Mutual Information with Finite Partitions Definition (Mutual Information) The mutual information between two RVs X and Y is given by: I(X; Y) =sup I([X] P, [Y] Q ) P,Q where the supremum is over all finite partitions P and Q. In fact, the above definition can be used to both RVs having pdf and pmf: More general. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 24 / 43

Mutual Information with Finite Partitions Definition (Mutual Information) The mutual information between two RVs X and Y is given by: I(X; Y) =sup I([X] P, [Y] Q ) P,Q where the supremum is over all finite partitions P and Q. In fact, the above definition can be used to both RVs having pdf and pmf: More general. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 24 / 43

Example: Mutual Information between Correlated Gaussian RVs Let (X, Y N (0, K) where ( σ 2 ρσ K = 2 ρσ 2 σ 2 ) It means the correlation is ρ. What is I(X; Y)? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 25 / 43

More Properties of Differential Entropy Corollary h(x Y) h(x) with equality iff X and Y are independent. Theorem (Chain rule for differential entropy) n h(x 1,...,X n )= h(x i X 1,...,X i 1 ) i=1 Corollary h(x 1,...,X n ) h(x i ) with equality iff X 1,...,X n are independent. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 26 / 43

Changing Variable Now we have Y = g(x). We know that f Y (y) = df Y(y) dy = f X (g 1 (y)) dg 1 (y) dy = f X(g 1 (y)) dx dy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 27 / 43

Changing Variable - Example Theorem Translation does not change differential entropy: h(x + c) =h(x). Theorem Corollary h(ax) =h(x)+log a For a vector-valued RV: h(ax) =h(x)+log det(a) Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 28 / 43

Concavity and Convexity Same properties with discrete RVs: Differential Entropy h(x) is a concave function of f X (x). Mutual Information h(x) is a concave function of f X (x). I(X; Y) is a concave function of f X (x) for fixed f Y X (y). I(X; Y) is a convex function of f Y X (y) for fixed f X (x). Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 29 / 43

Maximum Entropy Distribution Going back to discrete RVs: For a discrete random variable taking on K values, what distribution maximized the entropy? For continuous RVs, we are interested in: Maximizing the entropy h(f ) over all f that satisfy: 1. f (x) 0 with equality outside the support set S 2. S f (x)dx = 1 3. Moment constraints S f (x)r i(x)dx = α i for 1 i m. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 30 / 43

Maximum Entropy Distribution Going back to discrete RVs: For a discrete random variable taking on K values, what distribution maximized the entropy? For continuous RVs, we are interested in: Maximizing the entropy h(f ) over all f that satisfy: 1. f (x) 0 with equality outside the support set S 2. S f (x)dx = 1 3. Moment constraints S f (x)r i(x)dx = α i for 1 i m. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 30 / 43

Maximum Entropy Distribution Maximizing the entropy h(f ) over all f that satisfy: 1. f (x) 0 with equality outside the support set S 2. f (x)dx = 1 S 3. Moment constraints f (x)r S i(x)dx = α i for 1 i m. Theorem (Maximum entropy distribution) Let f (x) =f λ (x) =exp(λ 0 + m i=1 λ ir i (x)), x S, where λ 0,...,λ m are chosen so that f satisfies the above constraints. Then f uniquely maximizes h(f ) over all probability densities f satisfying the above constraints. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 31 / 43

Example 1 Continuous RVs: A Review First, let S =[a, b]. What is f? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 32 / 43

Example 2 Continuous RVs: A Review Now we consider that we have constraints E(X) =0 and E(X 2 )=σ 2. What is f? Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 33 / 43

Example 3 Continuous RVs: A Review What zero-mean distribution maximizes the entropy on (, ) n for a given covariance matrix K? Answer: A multivariate Gaussian 1 φ(x) = ( 2πn K exp 1 ) 2 x K 1 x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 34 / 43

Example 3 Continuous RVs: A Review What zero-mean distribution maximizes the entropy on (, ) n for a given covariance matrix K? Answer: A multivariate Gaussian 1 φ(x) = ( 2πn K exp 1 ) 2 x K 1 x Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 34 / 43

Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

Estimation Error and Differential Entropy Recall for discrete RVs: The problem: Assume we know RV Y and wish to guess the value of a correlated RV X Fano s inequality relates the probability of error in guessing X to its conditional entropy H(X Y). As we shall see later, this problem is indeed crucial in proving the converse to Shannon s channel capacity theorem. In one of the assignments, we will see that H(X Y) =0 if and only if X is a function of Y. It means that when H(X Y) =0, we can estimate X from Y with zero probability of error. Fano s inequality quantifies the following idea: Estimate X with a small probability of error only if H(X Y) is small Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 35 / 43

Fano s Inequality Assume we wish to estimate X and observe Y related to X by p(y x). From Y, we can calculate an estimate ˆX = g(y). We observe that X Y ˆX and wish to bound the probability P e = Pr(X ˆX). Also, let an error RV E = {1, 0}. Then Theorem (Fano s Inequality) For any estimate ˆX such that X Y ˆX, with P e = Pr(X ˆX) and H(P e ) H(E). We have H(P e )+P e log X H(X ˆX) H(X Y) This implies that P e H(X Y) 1 : P log X e cannot be too small if H(X Y) is large i.e., correct estimation only happens when residual randomness of X is small after observation of Y. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 36 / 43

Fano s Inequality Assume we wish to estimate X and observe Y related to X by p(y x). From Y, we can calculate an estimate ˆX = g(y). We observe that X Y ˆX and wish to bound the probability P e = Pr(X ˆX). Also, let an error RV E = {1, 0}. Then Theorem (Fano s Inequality) For any estimate ˆX such that X Y ˆX, with P e = Pr(X ˆX) and H(P e ) H(E). We have H(P e )+P e log X H(X ˆX) H(X Y) This implies that P e H(X Y) 1 : P log X e cannot be too small if H(X Y) is large i.e., correct estimation only happens when residual randomness of X is small after observation of Y. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 36 / 43

Fano s Inequality Assume we wish to estimate X and observe Y related to X by p(y x). From Y, we can calculate an estimate ˆX = g(y). We observe that X Y ˆX and wish to bound the probability P e = Pr(X ˆX). Also, let an error RV E = {1, 0}. Then Theorem (Fano s Inequality) For any estimate ˆX such that X Y ˆX, with P e = Pr(X ˆX) and H(P e ) H(E). We have H(P e )+P e log X H(X ˆX) H(X Y) This implies that P e H(X Y) 1 : P log X e cannot be too small if H(X Y) is large i.e., correct estimation only happens when residual randomness of X is small after observation of Y. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 36 / 43

Estimation Error and Differential Entropy Now we have the estimation counterpart to Fano s inequality: Theorem (Estimation error and differential entropy) For any RV X and estimator ˆX, E(X ˆX) 2 1 2πe exp(2h(x)) with equality iff X is Gaussian and ˆX is the mean of X and h(x be in nats. E(X ˆX) 2 : The expected prediction error. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 37 / 43

AEP for Continuous RVs Theorem (AEP - Discrete) If {X 1,...,X n } are i.i.d. with p(x) then 1 n log p (X 1,...,X n ) H(X) in probability. Theorem (AEP - Continuous) If {X 1,...,X n } are i.i.d. with f (x) then 1 n log f (X 1,...,X n ) h(x) in probability. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 38 / 43

AEP for Continuous RVs Theorem (AEP - Discrete) If {X 1,...,X n } are i.i.d. with p(x) then 1 n log p (X 1,...,X n ) H(X) in probability. Theorem (AEP - Continuous) If {X 1,...,X n } are i.i.d. with f (x) then 1 n log f (X 1,...,X n ) h(x) in probability. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 38 / 43

Typical Set Continuous RVs: A Review Definition (Typical Set - Discrete) The typical set A (n) ɛ with respect to p(x) is the set of sequence (x 1,...,x n ) X n with the property: 2 n(h(x)+ɛ) p(x 1,...,x n ) 2 n(h(x) ɛ) Definition (Typical Set - Continuous) For ɛ>0 and any n, the typical set A (n) ɛ defined as A (n) ɛ = { (x 1,...,x n ) S n : where f (x 1,...,x n )= n i=1 f (x i). with respect to f (x) is } 1 n log f (x 1,...,x n ) h(x) ɛ Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 39 / 43

Typical Set Continuous RVs: A Review Definition (Typical Set - Discrete) The typical set A (n) ɛ with respect to p(x) is the set of sequence (x 1,...,x n ) X n with the property: 2 n(h(x)+ɛ) p(x 1,...,x n ) 2 n(h(x) ɛ) Definition (Typical Set - Continuous) For ɛ>0 and any n, the typical set A (n) ɛ defined as A (n) ɛ = { (x 1,...,x n ) S n : where f (x 1,...,x n )= n i=1 f (x i). with respect to f (x) is } 1 n log f (x 1,...,x n ) h(x) ɛ Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 39 / 43

Typical Set and Volume Definition The volume Vol(A) of a set A R n is defined as Vol(A) = A dx 1...dx n. Theorem (Typical Set Properties) ( ) 1. Pr > 1 ɛ for n sufficient large. A (n) ɛ 2. Vol(A (n) ɛ ) 2 n(h(x)+ɛ) for all n. 3. Vol(A (n) ɛ ) (1 ɛ)2 n(h(x) ɛ) for n sufficient large. Theorem The set A (n) ɛ is the smallest volume set with probability larger or equal 1 ɛ, to first order in the exponent. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 40 / 43

Typical Set and Volume Definition The volume Vol(A) of a set A R n is defined as Vol(A) = A dx 1...dx n. Theorem (Typical Set Properties) ( ) 1. Pr > 1 ɛ for n sufficient large. A (n) ɛ 2. Vol(A (n) ɛ ) 2 n(h(x)+ɛ) for all n. 3. Vol(A (n) ɛ ) (1 ɛ)2 n(h(x) ɛ) for n sufficient large. Theorem The set A (n) ɛ is the smallest volume set with probability larger or equal 1 ɛ, to first order in the exponent. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 40 / 43

Differential Entropy: Summary h(x) = h(f ) = f(x)log f(x)dx S f(x n ) =2. nh(x) Vol(A (n) ɛ ) =2. nh(x). H ([X] 2 n) h(x) + n. h(n(0,σ 2 )) = 1 2 log 2πeσ2. h(n n (μ, K)) = 1 2 log(2πe)n K. D(f g) = f log f g 0. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 41 / 43

Differential Entropy: Summary h(x 1,X 2,...,X n ) = n h(x i X 1,X 2,...,X i 1 ). (8.88) i=1 h(x Y) h(x). (8.89) h(ax) = h(x) + log a. (8.90) I(X; Y) = f(x,y)log f(x,y) 0. (8.91) f(x)f(y) max h(x) = 1 EXX t =K 2 log(2πe)n K. (8.92) E(X ˆX(Y)) 2 1 2πe e2h(x Y). 2 nh (X) is the effective alphabet size for a discrete random variable. 2 nh(x) is the effective support set size for a continuous random variable. 2 C is the effective alphabet size of a channel of capacity C. Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 42 / 43

Thank you! Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 43 / 43