The binary entropy function

Similar documents
ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Information Theory and Communication

Lecture 5 - Information theory

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Information Theory Primer:

Lecture 2: August 31

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Lecture 22: Final Review

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Example: Letter Frequencies

Example: Letter Frequencies

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

1 Basic Information Theory

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

ECE 4400:693 - Information Theory

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

5 Mutual Information and Channel Capacity

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 1: Introduction, Entropy and ML estimation

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Error Correcting Codes Prof. Dr. P. Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Hands-On Learning Theory Fall 2016, Lecture 3

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

x log x, which is strictly convex, and use Jensen s Inequality:

Lecture 5: Asymptotic Equipartition Property

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

18.303: Introduction to Green s functions and operator inverses

Computing and Communications 2. Information Theory -Entropy

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Introduction to Statistical Learning Theory

Section - 9 GRAPHS. (a) y f x (b) y f x. (c) y f x (d) y f x. (e) y f x (f) y f x k. (g) y f x k (h) y kf x. (i) y f kx. [a] y f x to y f x

Quantitative Biology II Lecture 4: Variational Methods

CS 630 Basic Probability and Information Theory. Tim Campbell

Steve Smith Tuition: Maths Notes

Information Theory. Week 4 Compressing streams. Iain Murray,

Expectation Maximization

NOTATION: We have a special symbol to use when we wish to find the anti-derivative of a function, called an Integral Symbol,

Lecture 18: Quantum Information Theory and Holevo s Bound

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

An Invitation to Mathematics Prof. Sankaran Vishwanath Institute of Mathematical Science, Chennai. Unit - I Polynomials Lecture 1B Long Division

Quantum Mechanics- I Prof. Dr. S. Lakshmi Bala Department of Physics Indian Institute of Technology, Madras

Lecture 1: September 25, A quick reminder about random variables and convexity

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Example: Letter Frequencies

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013

1 Rational Exponents and Radicals

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

Lecture 6: Finite Fields

Statistical Machine Learning Lectures 4: Variational Bayes

1 Review of The Learning Setting

A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction

Entropies & Information Theory

Noisy-Channel Coding

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

CSCI-6971 Lecture Notes: Probability theory

Lecture 11: Continuous-valued signals and differential entropy

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

9.4 Radical Expressions

Noisy channel communication

Mathematics-I Prof. S.K. Ray Department of Mathematics and Statistics Indian Institute of Technology, Kanpur. Lecture 1 Real Numbers

Lecture 35: December The fundamental statistical distances

Chapter 9 Notes SN AA U2C9

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Lecture 16 Oct 21, 2014

Conditional probabilities and graphical models

Solutions to Homework Set #3 Channel and Source coding

Shannon s Noisy-Channel Coding Theorem

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Electro Magnetic Field Dr. Harishankar Ramachandran Department of Electrical Engineering Indian Institute of Technology Madras

Bivariate distributions

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

MITOCW MIT18_01SCF10Rec_24_300k

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes.

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Solving Equations. Lesson Fifteen. Aims. Context. The aim of this lesson is to enable you to: solve linear equations

Information Theory in Intelligent Decision Making

Thermodynamics (Classical) for Biological Systems Prof. G. K. Suraishkumar Department of Biotechnology Indian Institute of Technology Madras

CIS 2033 Lecture 5, Fall

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Computational Systems Biology: Biology X

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

LECTURE 3. Last time:

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

We have been going places in the car of calculus for years, but this analysis course is about how the car actually works.

Chapter 1 Review of Equations and Inequalities

Applied Calculus I. Lecture 29

Introduction to Information Theory

How to Quantitate a Markov Chain? Stochostic project 1

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Transcription:

ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but important theorems: Jensen s inequality, and the information inequality In this lecture a bunch of definitions and a few simple theorems are presented. While these might be somewhat bewildering at first, you should just hang on and dig in. There is a sort of algebra of informational quantities that must be presented as preliminary material before we can get much further. The binary entropy function We saw last time that the entropy of a random variable X is H(X) = p() log p() Suppose X is a binary random variable, { 1 with probability p X = 0 with probability 1 p Then the entropy of X is H(X) = p log p (1 p) log(1 p) Since this depends on p, this is also written sometimes as H(p). Plot. Observe: concave function of p. (What does this mean?) H(0) = 0, H(1) = 0. Why? Where is the ma? More generally, the entropy of a binary discrete random variable with probability p is written as either H(X) or H(p). Joint entropy Often we are interested in the entropy of pairs of random variables (X, Y ). Another way of thinking of this is as a vector of random variables. Definition 1 If X and Y are jointly distributed according to p(x, Y ), then the joint entropy H(X, Y ) is H(X, Y ) = p(, y) log p(, y) X y Y or H(X, Y ) = E log p(x, Y ) Definition 2 If (X, Y ) p(, y), then the conditional entropy H(Y X) is H(Y X) = p(, y) log p(y ) = E p(,y) log p(y ) X y Y

ECE 7680: Lecture 2 Definitions and Basic Facts 2 This can also be written in the following equivalent (and also useful) ways: H(Y X) = p() p(y ) log p(y ) X y Y = p()h(y X = ) X Theorem 1 (chain rule) H(X, Y ) = H(X) + H(Y X) Interpretation: The uncertainty (entropy) about both X and Y is equal to the uncertainty (entropy) we have about X, plus whatever we have about Y, given that we know X. Proof This proof is very typical of the proofs in this class: it consists of a long string of equalities. (Later proofs will consist of long strings of inequalities. Some people make their livelihood out of inequalities!) H(X, Y ) = p(, y) log p(, y) y = p(, y) log p()p(y ) y = p(, y) log p() p(, y) log p(y ) y y = p() log p() p(, y) log p(y ) y = H(X) + H(Y X) This can also be done in the following streamlined manner: Write and take the epectation of both sides. log p(x, Y ) = log p(x) + log p(y X) We can also have a joint entropy with a conditioning on it, as shown in the following corollary: Corollary 1 H(X, Y Z) = H(X Z) + H(Y X, Z) The proof is similar to the one above. (This is a good one to work on your own.) Relative entropy and mutual information Suppose there is a r.v. with true distribution p. Then (as we will see) we could represent that r.v. with a code that has average length H(p). However, due to incomplete information we do not know p; instead we assume that the distribution of the r.v. is q. Then (as we will see) the code would need more bits to represent the r.v. The difference in the number of bits is denoted as D(p q). The quantity D(p q) comes up often enough that it has a name: it is known as the relative entropy.

ECE 7680: Lecture 2 Definitions and Basic Facts 3 Definition 3 The relative entropy or Kullback-Leibler distance between two probability mass functions p() and q() is defined as D(p q) = X p() log p() q() = E p log p(x) q(x) Note that this is not symmetric, and the q (the second argument) appears only in the denominator. Another important concept is that of mutual information. How much information does one random variable tell about another one. In fact, this perhaps the central idea in much of information theory. When we look at the output of a channel, we see the outcomes of a r.v. What we want to know is what went into the channel we want to know what was sent, and the only thing we have is what came out. The channel coding theorem (which is one of the high points we are trying to reach in the class) is basically a statement about mutual information. Definition 4 Let X and Y be r.v.s with joint distribution p(x, Y ) and marginal distributions p() and p(y). The mutual information I(X; Y ) is the relative entropy between the joint distribution and the product distribution: I(X; Y ) = D(p(, y) p()p(y)) = p(, y) p(, y) log p()p(y) y Note that when X and Y are independent, p(, y) = p()p(y) (definition of independence), so I(X; Y ) = 0. This makes sense: if they are independent random variables then Y can tell us nothing about X. An important interpretation of mutual information comes from the following. Theorem 2 I(X; Y ) = H(X) H(X Y ) Interpretation: The information that Y tells us about X is the reduction in uncertainty about X due to the knowledge of Y. Proof I(X; Y ) =,y =,y p(, y) log p(, y) p()p(y) p(, y) log p( y) p() = p(, y) log p() +,y,y = H(X) H(X Y ) p(, y) log p( y) Observe that by symmetry I(X; Y ) = H(Y ) H(Y X) = I(Y ; X). That is, Y tells as much about X as X tells about Y. Using H(X, Y ) = H(X) + H(Y X) we get I(X; Y ) = H(X) + H(Y ) H(X, Y )

ECE 7680: Lecture 2 Definitions and Basic Facts 4 The information that X tells about Y is the uncertainty in X plus the uncertainty about Y minus the uncertainty in both X and Y. We can summarize a bunch of statements about entropy as follows: More than two variables I(X; Y ) = H(X) H(X Y ) I(X; Y ) = H(Y ) H(Y X) I(X; Y ) = H(X) + H(Y ) H(X, Y ) I(X; Y ) = I(Y ; X) I(X; X) = H(X) It will be important to deal with more than one or two variables, and a variety of chain rules have been developed for this purpose. In each of these, the sequence of r.v.s X 1, X 2,..., X n are drawn according to the joint distribution p( 1, 2,..., n ). Theorem 3 The joint entropy of X 1, X 2,..., X n is Proof Observe that H(X 1, X 2,..., X n ) = H(X i X i 1,..., X 1 ). H(X 1, X 2 ) = H(X 1 ) + H(X 2 X 1 ) H(X 1, X 2, X 3 ) = H(X 1 ) + H(X 2, X 3 X 1 ) = H(X 1 ) + H(X 2 X 1 ) + H(X 3 X 2, X 1 ) ( ). H(X 1, X 2,..., X n ) = H(X 1 ) + H(X 2 X 1 ) + H(X 3 X 2, X 1 ) + + H(X n X n 1, X n 2,..., X 1 ) An alternate proof can be obtained by observing that and taking an epectation. p( 1, 2,..., n ) = n p( i i 1,..., 1 ) We sometimes have two variables that we wish to consider, both conditioned upon another variable. Definition 5 The conditional mutual information of random variables X and Y given Z is defined by I(X; Y Z) = H(X Z) H(X Y, Z) p(x, Y Z) = E log p(x Z)p(Y Z) In other words, it is the same as mutual information, but everything is conditioned upon Z. The chain rule for entropy leads us to a chain rule for mutual information. Theorem 4 I(X 1, X 2,..., X n ; Y ) = I(X i ; Y X i 1, X i 2,..., X 1 ).

ECE 7680: Lecture 2 Definitions and Basic Facts 5 Proof I(X 1, X 2,..., X n ; Y ) = H(X 1, X 2,..., X n ) H(X 1, X 2,..., X n Y ) = H(X i X i 1,..., X 1 ) H(X i X i 1,..., X 1, Y ) = I(X i ; Y X i 1, X i 2,..., X 1 ). (Skip conditional relative entropy for now.) Conveity and Jensen s inequality A large part of information theory consists in finding bounds on certain performance measures. The analytical idea behind a bound is to substitute a complicated epression for something simpler but not eactly equal, known to be either greater or smaller than the thing it replaces. This gives rise to simpler statements (and hence gain some insight), but usually at the epense of precision. Knowing when to use a bound to get a useful results generally requires a fair amount of mathematical maturity and eperience. One of the more important inequalities we will use throughout information theory is Jensen s inequality. Before introducing it, you need to know about conve and concave functions. Definition 6 A function f() is said to be conve over an interval (a, b) if for every 1, 2 (a, b) and 0 λ 1, f(λ 1 + (1 λ) 2 )) λf( 1 ) + (1 λ)f( 2 ) A function is strictly conve if equality holds only if λ = 0 or λ = 1. To understand the definition, recall that λ 1 +(1 λ) 2 ) is simply a line segment connecting 1 and 2 (in the direction) and λf( 1 )+(1 λ)f( 2 ) is a line segment connecting f( 1 ) and f( 2 ). Pictorially, the function is conve if the function lies below the straight line segment connecting two points, for any two points in the interval. Definition 7 A function f is concave if f is conve. You will need to keep reminding me of which is which, since when I learned this, the nomenclature was conve and conve. Eample 1 Conve 2, e,, log. Concave: log, One reason why we are interested in conve functions is that it is known that over the interval of conveity there is only one minimum. This can strengthen many of the results we might want. Theorem 5 If f has a second derivative which is non-negative (positive) everywhere, then f is conve (strictly conve).

ECE 7680: Lecture 2 Definitions and Basic Facts 6 Proof The Taylor-series epansion of f about the point 0 is f() = f( 0 ) + f ( 0 )( 0 ) + 1 2 f ( )( 0 ) 2 If f () 0 the last term is non-negative. Let 0 = λ 1 + (1 λ) 2 and let = 1. Then Now let = 2 and get f( 1 ) f( 0 ) + f ( 0 )[(1 λ)( 1 2 )]. f( 2 ) f( 0 ) + f ( 0 )[λ( 2 1 )] Multiply the first by λ and the second by 1 λ, and add together to get the conveity result. We now introduce Jensen s inequality. Theorem 6 If f is a conve function and X is a r.v. then Put another way, Ef(X) f(ex). ( ) p()f() f p() If f is strictly conve then equality in the theorem implies that X = EX w.p. 1. If f is concave then Ef(X) f(ex). The theorem allows us (more or less) to pull a function outside of a summation in some circumstances. Proof The proof is by induction. When X takes on two values the inequality is p 1 f( 1 ) + p 2 f( 2 ) f(p 1 1 + p 2 2 ). This is true by the definition of conve functions. Inductive hypothesis: suppose the theorem is true for distributions with k 1 values. Then let p i = p i/(1 p k ) for i = 1, 2,..., k 1, k k 1 p i f( i ) = p k f( k ) + (1 p k ) p if( i ) p k f( k ) + (1 p k )f ( k 1 ) p i i ( ) k 1 f p k + (1 p k ) p i i ( k ) = f p i i There is another inequality that got considerable use (in many of the same ways as Jensen s inequality) way back in the dark ages when I took information theory. I may refer to it simply as the information inequality.

ECE 7680: Lecture 2 Definitions and Basic Facts 7 Theorem 7 log 1, with equality if and only if = 1. This can also be generalized by taking the line at different points along the function. With these simple inequalities we can now prove some facts about some of the information measures we defined so far. Theorem 8 D(p q) 0, with equality if and only if p() = q() for all. Proof D(p q) = p() log p() q() = p() log q() p() log p() q() p() = log q() = 0 (Jensen s) Since log is a strictly concave function, we have equality if and only if q()/p() = 1. Proof Here is another proof using the information inequality: D(p q) = p() log p() q() = p() log q() p() ( ) q() p() p() 1 = q() p() = 0. Corollary 2 Mutual information is positive: I(X; Y ) 0, with equality if and only if X and Y are independent. (log 1) Let X be the set of values that the random variable X takes on and let X denote the number of elements in the set. For discrete random variables, the uniform distribution over the range X has the maimum entropy. Theorem 9 H(X) log X, with equality iff X has a uniform distribution. Proof Let u() = 1 X be the uniform distribution and let p(x) be the distribution for X. Then D(p u) = p() log p() = log X H(X). u() Note how easily this optimizing value drops in our lap by means of an inequality. There is an important principle of engineering design here: if you can show that some performance criterion is upper-bounded by some function, then show how to achieve that upper bound, you have got an optimum design. No calculus required! The more we know, the less uncertainty there is:

ECE 7680: Lecture 2 Definitions and Basic Facts 8 Theorem 10 Condition reduces entropy: H(X Y ) H(X), with equality iff X and Y are independent. Proof 0 I(X; Y ) = H(X) H(X Y ). Theorem 11 H(X 1, X 2,..., X n ) H(X i ) with equality if and only if the X i are independent. Proof By the chain rule for entropy, H(X 1, X 2,..., X n ) = H(X i X i 1,..., X 1 ) H(X i ) (conditioning reduces entropy)