LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Similar documents
LECTURE 3. Last time:

LECTURE 13. Last time: Lecture outline

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Information Theory and Communication

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

ECE 4400:693 - Information Theory

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

x log x, which is strictly convex, and use Jensen s Inequality:

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Hands-On Learning Theory Fall 2016, Lecture 3

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Example: Letter Frequencies

Information Theory Primer:

Example: Letter Frequencies

Computing and Communications 2. Information Theory -Entropy

Lecture 5 - Information theory

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

Exercises with solutions (Set D)

The binary entropy function

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

B ɛ (P ) B. n N } R. Q P

Lecture 22: Final Review

Probability Background

Lecture 4: Convex Functions, Part I February 1

Lecture 8: Channel Capacity, Continuous Random Variables

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Information Theory: Entropy, Markov Chains, and Huffman Coding

Chapter 5 Random vectors, Joint distributions. Lectures 18-23

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14

EE 4TM4: Digital Communications II. Channel Capacity

Convergence of Random Variables Probability Inequalities

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

1 Basic Information Theory

Introduction to Statistical Learning Theory

BASICS OF CONVEX ANALYSIS

Static Problem Set 2 Solutions

Division of the Humanities and Social Sciences. Supergradients. KC Border Fall 2001 v ::15.45

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Lecture 02: Summations and Probability. Summations and Probability

Homework 2: Solution

Lecture 2: Convex Sets and Functions

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

1 Random Walks and Electrical Networks

EE514A Information Theory I Fall 2013

Class 8 Review Problems 18.05, Spring 2014

2 2 + x =

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework Set #2 Data Compression, Huffman code and AEP

Lecture 5. 1 Chung-Fuchs Theorem. Tel Aviv University Spring 2011

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Lecture 13 and 14: Bayesian estimation theory

U Logo Use Guidelines

Remark 3.2. The cross product only makes sense in R 3.

1 Stat 605. Homework I. Due Feb. 1, 2011

18.440: Lecture 33 Markov Chains

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 17: Differential Entropy

1 Variance of a Random Variable

6.254 : Game Theory with Engineering Applications Lecture 7: Supermodular Games

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Functional Analysis, Math 7320 Lecture Notes from August taken by Yaofeng Su

Lecture 4 October 18th

Lecture Notes 8

14 Lecture 14 Local Extrema of Function

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

EC9A0: Pre-sessional Advanced Mathematics Course. Lecture Notes: Unconstrained Optimisation By Pablo F. Beker 1

IE 521 Convex Optimization

Immerse Metric Space Homework

Chapter I: Fundamental Information Theory

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Lecture Notes 3 Multiple Random Variables. Joint, Marginal, and Conditional pmfs. Bayes Rule and Independence for pmfs

D(f/g)(P ) = D(f)(P )g(p ) f(p )D(g)(P ). g 2 (P )

Lecture 4 Noisy Channel Coding

0.1 Uniform integrability

Convex Analysis and Economic Theory AY Elementary properties of convex functions

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Example: Letter Frequencies

The Envelope Theorem

Lecture 5 Channel Coding over Continuous Channels

Lecture 2: August 31

5. Duality. Lagrangian

Lecture Slides - Part 1

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

1 Joint and marginal distributions

Lecture 14 February 28

Shannon s Noisy-Channel Coding Theorem

Solutions to Set #2 Data Compression, Huffman code and AEP

Concentration Inequalities

Lecture 2 One too many inequalities

Useful Probability Theorems

Transcription:

LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity and concavity Jensen s inequality Positivity of mutual information Data processing theorem Fano s inequality Reading: Scts. 2.6-2.8, 2.11.

Convexity Definition: a function f(x) is convex over (a, b) iff x 1, x 2 (a, b) and 0 λ 1 f (λx 1 + (1 λ)x 2 ) λf (x 1 ) + (1 λ)f(x 2 ) and is strictly convex iff equality holds iff λ = 0 or λ = 1. f is concave iff f is convex. Convenient test: if f has a second deriva tive that is non-negative (positive) every where, then f is convex (strictly convex)

Jensen s inequality if f is a convex function and X is a r.v., then E X [f(x)] f(e X [X]) if f is strictly convex, then E X [f(x)] = f (E X [X]) X = E[X].

Concavity of entropy Let f(x) = x log(x) then f (x) 1 = x log(e) log(x) x = log(x) log(e) and for x > 0. 1 f (x) = log(e) x < 0 H(X) = x X f(p X (x)) thus the entropy of X is concave in the value of P X (x) for every x. Thus, consider two random variables, X 1 and X 2 with common X. Then the ran dom variable X defined over the same X such that P X (x) = λp X1 (x) + (1 λ)p X2 (x) satisfies: H(X) λh(x 1 ) + (1 λ)h(x 2 ).

Maximum entropy Consider any random variable X1 1 on X. For simplicity, consider X = {1,..., X } (we just want to use the elements of X as indices). Now consider X1 2 a random variable such P X 1(x) = P X 1(shift(x)) where 2 1 shift denotes the cyclic shift on (1,..., X ). Clearly H(X 1 1 ) = H(X 1 2 ). Moreover, consider X2 1 defined over the same X such that P X 2(x) = λp X 1(x) + (1 λ)p X 1(x) then 1 1 2 H(X 2 1 ) H(X 1 1 ). We can show recursively with the obvious extension of notation that H(X n 1 ) H(X m 1 ) n > m 1. Now lim n P X1 n(x) = X x X. Hence, the uniform distribution maximizes entropy and H(X) log( X ). 1

Conditioning reduces entropy H(Y X) = E Z [H(Y X = Z)] = P X (x) P Y X (y x)log 2[P Y X (y x)] x X y Y P Y (y) = x X P X (x)p Y X (y x) hence by concavity H(Y X) H(Y ). Hence I(X; Y ) = H(Y ) H(Y X) 0. Independence bound: = H(X 1,..., X n ) n H(Xi X 1,..., X i 1 ) i=1 n H(Xi ) i=1 Question: H(Y X = x) H(Y )?

Mutual information and transition probability Let us call P Y X (y x) the transition probability from X to Y. Consider a r.v. Z that takes values 0 and 1 with probability θ and 1 θ and s.t. P Y X,Z (y x, 0) = P Y X (y x) P Y X,Z (y x, 1) = P Y X (y x) and Z is independent from X and I(X; (Y, Z)) = I(X; Z) + I(X; Y Z) hence so I(X; (Y, Z)) = I(X; Y ) + I(X; Z Y ) I(X; Y Z) I(X; Y ) θi(x; Y Z = 0)+(1 θ)i(x; Y Z = 1) I(X; Y ) For a fixed input assignment, I(X; Y ) is convex in the transition probabilities

Mutual information and input probability Consider a r.v. Z such that P X Z (x 0) = P (x), P X Z (x 1) = P (x), Z takes values 0 and 1 with probability θ and 1 θ and Z and Y are conditionally independent, given X and I(Y ; Z X) = 0 so I(Y ; (Z, X)) = I(Y ; Z) + I(Y ; X Z) = I(Y ; X) + I(Y ; Z X) I(X; Y Z) I(X; Y ). Mutual information is a concave function of the input probabilities. Exercise: jamming game in which we try to maximize mutual information and jammer attempts to reduce it. What will the policies be?

Markov chain Markov chain: random variables X, Y, Z form a Markov chain in that order X Y Z if the joint PMF can be written as P X,Y,Z (x, y, z) = P X (x)p Y X (y x)p Z Y (z y).

Markov chain Consequences: X Y Z iff X and Z are conditionally independent given Y = P X,Z Y (x, z y) P X,Y,Z (x, y, z) P Y (y) P X,Y (x, y) = P Z Y (z y) P Y (y) = P X Y (x y)p Z Y (z y) so Markov implies conditional independence and vice versa X Y Z Z Y X (see above LHS and last RHS)

Data Processing Theorem If X Y Z then I(X; Y ) I(X; Z) I(X; Y, Z) = I(X; Z) + I(X; Y Z) I(X; Y, Z) = I(X; Y ) + I(X; Z Y ) X and Z are conditionally independent given Y, so I(X; Z Y ) = 0 hence I(X; Z) + I(X; Y Z) = I(X; Y ) so I(X; Y ) I(X; Z) with equality iff I(X; Y Z) = 0 note: X Z Y I(X; Y Z) = 0 Y depends on X only through Z Consequence: you cannot undo degradation

Consequence: Second Law of Thermodynamics The conditional entropy H(X n X 0 ) is nondecreasing as n increases for a stationary Markov process X 0,..., X n Look at the Markov chain X 0 X n 1 X n DPT says I(X 0 ; X n 1 ) I(X 0 ; X n ) H(X n 1 ) H(X n 1 X 0 ) H(X n ) H(X n X 0 ) so H(X n 1 X 0 ) H(X n X 0 ) Note: we still have that H(X n X 0 ) H(X n ).

Fano s lemma Suppose we have r.v.s X and Y, Fano s lemma bounds the error we expect when estimating X from Y We generate an estimator of X that is X = g(y ). Probability of error P e = P r( X = X) Indicator function for error E which is 1 when X = X and 0 otherwise. Thus, P e = P (E = 0) Fano s lemma: H(E) + P e log( X 1) H(X Y )

Proof of Fano s lemma H(E, X Y ) = H(X Y ) + H(E X, Y ) = H(X Y ) H(E, X Y ) = H(E Y ) + H(X E, Y ) H(E Y ) H(E) H(X E, Y ) = P e H(X E = O, Y ) + (1 P e )H(X E = 1, Y ) = P e H(X E = O, Y ) P e H(X E = O) P e log( X 1)

MIT OpenCourseWare http://ocw.mit.edu 6.441 Information Theory Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.