Lecture 11: Continuous-valued signals and differential entropy

Similar documents
ECE 4400:693 - Information Theory

Lecture 17: Differential Entropy

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 8: Channel Capacity, Continuous Random Variables

Chaos, Complexity, and Inference (36-462)

EE5585 Data Compression May 2, Lecture 27

Lecture 5 Channel Coding over Continuous Channels

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

(Classical) Information Theory III: Noisy channel coding

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Lecture 5: Asymptotic Equipartition Property

Joint Probability Distributions and Random Samples (Devore Chapter Five)

INTRODUCTION TO INFORMATION THEORY

Recitation 2: Probability

Noisy channel communication

X 1 : X Table 1: Y = X X 2

Lecture 18: Quantum Information Theory and Holevo s Bound

Advanced Mathematics Unit 2 Limits and Continuity

Advanced Mathematics Unit 2 Limits and Continuity

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

(Classical) Information Theory II: Source coding

Network coding for multicast relation to compression and generalization of Slepian-Wolf

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Lecture 11: Quantum Information III - Source Coding

Math 115 Spring 11 Written Homework 10 Solutions

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

ELEC546 Review of Information Theory

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Communications Theory and Engineering

Lecture 2: Review of Probability

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Lecture 8: Shannon s Noise Models

Lecture 15: Conditional and Joint Typicaility

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Revision of Lecture 5

Lecture 22: Final Review

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Line of symmetry Total area enclosed is 1

Modern Digital Communication Techniques Prof. Suvra Sekhar Das G. S. Sanyal School of Telecommunication Indian Institute of Technology, Kharagpur

Chapter I: Fundamental Information Theory

An Introduction to Laws of Large Numbers

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

From Determinism to Stochasticity

Homework Set #2 Data Compression, Huffman code and AEP

An introduction to basic information theory. Hampus Wessman

Order Statistics and Distributions

Lecture 14 February 28

Northwestern University Department of Electrical Engineering and Computer Science

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Entropy as a measure of surprise

1 Review of The Learning Setting

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Information Theory in Intelligent Decision Making

Lecture 9 4.1: Derivative Rules MTH 124

Lecture 1: Introduction, Entropy and ML estimation

Calculus (Math 1A) Lecture 5

Capacity of AWGN channels

Chapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Shannon s Noisy-Channel Coding Theorem

Information Theory: Entropy, Markov Chains, and Huffman Coding

Entropies & Information Theory

Properties of Continuous Probability Distributions The graph of a continuous probability distribution is a curve. Probability is represented by area

Principles of Communications

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

ITCT Lecture IV.3: Markov Processes and Sources with Memory

Lecture 3: Channel Capacity

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

The binary entropy function

EECS 750. Hypothesis Testing with Communication Constraints

Continuous random variables

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

ter. on Can we get a still better result? Yes, by making the rectangles still smaller. As we make the rectangles smaller and smaller, the

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Lecture 4: September Reminder: convergence of sequences

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

LECTURE 13. Last time: Lecture outline

Shannon s noisy-channel theorem

6c Lecture 14: May 14, 2014

Chapter 9 Fundamental Limits in Information Theory

10.1 The Formal Model

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Lecture 11: Polar codes construction

Information & Correlation

RVs and their probability distributions

Lecture 21: Minimax Theory

Bivariate distributions

Introduction to Stochastic Processes

Intro to Information Theory

Great Theoretical Ideas in Computer Science

Transcription:

Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components of what follows particularly the statements of theorems and definitions come directly from that text. Thus far in the course, we have dealt exclusively with discrete-valued signals, i.e., with communication using a finite alphabet. Many signals and in particular, many biological signals may take on continuous values. For example, the volume of a cry, the interspike interval in a spike train, and the length of a peacock s tail are all continuous-valued signals. How do we compute the entropy of a continuous valued signal? When doing so, we immediately face a problem. Using a noiseless channel, once can send an infinite amount of information using a single continuous valued signal. Suppose I have an n-bit message that I want to send. I can send it using n symbols across a discrete-valued binary channel or I can send it with a single signal across a continuous-valued channel, as follows. Instead of sending e.g. the message 1011000010111011010000101111011000000111111 using n bits, I can send a single real-valued number, equal to the fraction 0.101100001011011010000101111011000000111111 1

through my continuous-valued channel. In this way, as I let get n get large, I can send arbitrarily large amounts of information through a noiseless channel with a single message. Given that any continuous-valued signal thus can be said to carry infinite information, how can we talk about the entropy of continuous signals? First, we note that in practice, the encoding scheme above works only in an arbitrarily noiseless continuous valued channel. If noise prevents us from distinguishing values within k of one another, we will in practice be limited to sending log k bits per symbol through our continuous channel. Thus by considering noise, we resolve the paradox of why it seems that we could send infinite information in finite time through a continuous-valued channel. We ll come back to noise later. For now, though, we want to develop the machinery to quantify the entropy of continuous-valued signals, even the in the absence of noise. Rates of entropy increase To do so, we will think of continuous-valued signal spaces as being the limit of a sequence of increasingly finely subdivided discrete-valued alphabets. Consider a discrete-valued alphabet with n symbols; this can carry at most log n bits of information (in the case that all values are equally likely). Thus the entropy of a continuous distribution on [0, 1] with n = 1/k divisions of length k is at most log n = log k. 2

k -log k H(X) 1/2 1 1 1/4 2 2 1/8 3 3 1/16 4 4 1/32 5 5 dh/d(-log k) = 1 Now halve the size of the bins to k/2 each. The entropy can now be at most log k/2 = 1 log k bits. Halve again to k/4, and the entropy can be at most log k/4 = 2 log k. Halve again: entropy is now at most 3 log k. And so forth: each type we double the number of divisions, the entropy of what bin we are in can increase by at most 1 bit. In other words, the rate at which entropy increases with the increasingly fine subdivisions of our interval is at most 1 bit per doubling of the number of subdivisions. But this rate is a derivative: d[max entropy] d[- log subdivision width] = 1 This is the maximum rate at which entropy can increase with increasingly small subdivisions. What probability distribution achieves this rate? Well, to achieve this rate, each bin has to be equally likely to occur, and that has to be true for each level of subdivision. distribution achieves this rate. What happens if we have some other distribution? example below So the uniform probability I ve sketched an 3

k -log k H(X) 1/2 1 1 1/4 2 1.4 1/8 3 1.7 1/16 4 1.95 1/32 5 2.17 lim dh/d(-log k) = 0.20 Now (in this particular example) the first division increases the entropy by 1 bit, but subsequent divisions increase the entropy by less than a bit and in this made-up example, the rate at which entropy increases with increasingly small subdivisions is asymptotically 0.2. Now compare the two distributions. In the sense that we discussed at the start of the lecture, a single value from either distribution gives us an infinite precision real number and thus carries infinite information with infinite entropy. But in another sense, the former distribution (the uniform) seems to leave us more uncertain than the latter distribution (the bell-shaped). We want to capture this with some kind of measure, and our idea of looking at this derivative dh/d( log k) has the right sort of property. Instead of measuring the total entropy, now that we are working with continuous variables we measure the rate at which the total entropy increases as well allow increasingly fine subdivisions. This is the conceptual idea underlying differential entropy and in general the measurement of entropy of continuous variables. We can now explore the definition of continuous entropy and derive a number of expressions that parallel our expressions for discrete entropy and related quantities. But first, recall two basic definitions for continuous probabilities. Recall 1 The cumulative distribution function (CDF), F (x) of a probability distribution F is the probability that a random draw X from F is less than 4

F (x): F (x) = Prob(X x) Note that the range of F (x) is [0, 1]. Recall 2 The probability density function (PDF), f(x) of distribution F is the derivative of the CDF. f(x) = F (x) We will work with well-behaved continuous distributions, such that F (x) is continuous and continuously differentiable, and f(x) = 1. Now we definite the differential entropy h(x). (Note that we use a lowercase h to indicate that this is the differential entropy rather than the discrete entropy H). Definition 1 The differential entropy h(x) of random variable X with probability density function f(x) is given by h(x) = f(x) log f(x)dx S where S is the set of possibility values of the X, i.e., the support set of X. A few examples: Uniform on [0, 1] Uniform on [0, a] Normal with mean 0, variance σ 2. We also notice that differential entropy is invariant to translations of the random variable x: Theorem 1 Translation invariance of differential entropy h(x + c) = h(x) 5

Proof: the definition of differential entropy involves only the probabilities f(x) and not the values of x itself. Multiplying a random variable by a constant a changes its differentially entropy by an additive constant log a. Theorem 2 Multiplication alters differential entropy by an additive constant. h(ax) = h(x) + log a The proof is by change of variables, defining a new variable Y = ax and applying the definition of differential entropy. Relation to discrete entropy Now we can formally see the connection between this definition and the idea of continuous symbols as the limit of n-division discrete spaces. We take the conventional Riemann integration view of the integral as the limit of a sequence of discrete sums over smaller and smaller subdivisions. In other words, we break the distribution up into n bins of width, and we consider the new discrete random variable Y to be the bin into which the value of x falls: Y = x i, if i Y < (i + 1)δ The probability distribution of Y is then p i = (i+1) i f(x)dx = f(x i ) 6

where the latter equality holds by the mean value theorem. The discrete entropy of Y is then H(Y ) = p i log p i = f(x i ) log f(x i ) = f(x i ) log f(x i ) f(x i ) log = f(x i ) log f(x i ) log Since f(x i ) log f(x i ) f(x) log f(x) by the definition of the Riemann integral, we now have the following theorem: Theorem 3 The entropy of an n-bit quantization of a continuous random variable is approximately h(x) + n. H(Y ) + log h(f) as 0 This links up our earlier example of an n-bit approximation of a continuous signal with the concept of differential entropy. Joint differential entropy Now we are ready to develop a set of additional measures that we can apply to continuous random variables, much as we did in the second lecture of the course for the discrete entropy. If we just think of replacing sums with integrals, none of these are particularly surprising in form. We begin with the joint differential entropy. Recall that in lecture 3 of the course, we defined the joint entropy of a set of discrete random variables as Recall 3 The joint entropy H(X, Y ) of a pair of discrete random variables with a joint distribution p(x, y) is defined as H(X, Y ) = p(x, y) log p(x, y). x X y Y Our definition of joint differential entropy parallels this closely: 7

Definition 2 The joint differential entropy h(x, Y ) of a pair of discrete random variables with a joint density p(x, y) is defined as h(x, Y ) = f(x, y) log f(x, y)dx dy. This generalizes from pairs to collections of n random variables in the obvious way. Conditional differential entropy Now we move on to conditional differential entropy. Our definition again parallels the discrete definition: Recall 4 The discrete conditional entropy H(Y X) is defined as H(Y X) = x X p(x)h(y X = x). Definition 3 The conditional differential entropy h(y X) is defined as h(y X) = f(x, y)h(y X = x)dx dy. We can alternatively write this as h(y X) = f(x, y) log f(x y)dx dy. Since f(x y) = f(x, y)/f(y), we can recover an expression analogous to what we had for discrete entropy: h(y X) = h(x, Y ) h(y ) Relative entropy for continuous random variables We can also define a relative entropy between two continuous density functions f and g: Recall 5 The relative entropy D(p q) between two discrete probability distributions p and q is given by D(p q) = x X p(x) log p(x) q(x). 8

Definition 4 The relative entropy D(p q) between two continuous probability distributions f and g is given by D(f g) = f log f g. Mutual information for continuous random variables And finally, we can define the mutual information between two continuous variables. Definition 5 The mutual information I(X; Y ) measures how much (on average) the realization of random variable Y tells us about the realization of X, i.e., how by how much the entropy of X is reduced if we know the realization of Y. If X and Y are discrete random variables, the mutual information is given by: I(X; Y ) = H(X) H(X Y ). When X and Y are continuous random variables, the mutual information is given by: and y. I(X; Y ) = h(x) h(x Y ). As before, we can also write this as f(x, y) I(X, Y ) = f(x, y) log dx dy. f(x)f(y) From this formulation, we again see this as symmetric with respect to x The AEP for continuous random variables The AEP for discrete random variables states that with arbitrarily high probability, the log probability of a realized sequence of n iid random variables approaches n times the entropy of the random variables as n gets large. Formally, 9

Recall 6 The asymptotic equipartition theorem. Let X 1, X 2, X 3,... be independent and identically distributed (iid) random variables drawn from some distribution with probability function p(x), and let H(X) be the entropy of this distribution. Then as n, 1 n log p(x 1, X 2,..., X n ) H(X) in probability. The AEP for continuous random variables states the equivalent for the probability density of a sequence of iid continuous random variables. Theorem 4 The asymptotic equipartition theorem for continuous random variables Let X 1, X 2, X 3,... be independent and identically distributed (iid) random variables drawn from a continuous distribution with probability density f(x), and let h(x) be the entropy of this distribution. Then as n, 1 n log f(x 1, X 2,..., X n ) E[ log f(x)] = h(x) in probability. The convergence to expectation is a direct application of the weak law of large numbers; the equality is a consequence of the definition of h(x). Much of the point of developing the discrete AEP was to let us work with typical sets. Recall 7 The (weakly) typical set of sequences A n ɛ for a probability distribution p(x) is the set of sequences x 1, x 2,... with average probability very close to the entropy of the random variable X: 1 n log p(x 1, x 2,...) H(X) ɛ Again we can talk about typical sets for continuous random variables. Definition 6 The (weakly) typical set of sequences A n ɛ for a continuous probability distribution f(x) is the set of sequences x 1, x 2,... with average probability very close to the entropy of the random variable X: 1 n log f(x 1, x 2,...) h(x) ɛ 10

As with the discrete AEP, almost all realized sequences will be members of the typical set: [ ] Pr X A n ɛ > 1 ɛ. That is, the typical set covers almost all of the probability as n gets large. When we looked at typical sets for discrete random variables, we counted the size of these sets, i.e., number of members that they contained, and found that this was approximately 2 n H(X). For continuous random variables (as with any continuous region) the appropriate measure of size of a set is not the number of members (it has an infinite number of members) but the volume of the set. It turns out that the volume of the typical set for a continuous random variable is approximately equal to 2 n h(x), where the volume of a set S on R n is simply integral over the set S dx 1 dx 2.... The proof, which we omit here, is provided in Cover and Thomas. In the case of discrete typical sets, the number of members of the typical set made up only a small fraction of the number of possible typical sequences (unless the distribution was uniform). The same thing holds for continuous typical sets. As n gets large, the volume of the typical set is only a small fraction of the volume of the set of all possible sequences. 11