Lecture 2: August 31

Similar documents
Lecture 1: Introduction, Entropy and ML estimation

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Information Theory Primer:

Lecture 5 - Information theory

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Introduction to Information Theory

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Entropies & Information Theory

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

5 Mutual Information and Channel Capacity

Lecture 22: Final Review

ECE 4400:693 - Information Theory

The binary entropy function

Quantitative Biology Lecture 3

Information in Biology

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Information in Biology

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Quantitative Biology II Lecture 4: Variational Methods

Quiz 2 Date: Monday, November 21, 2016

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Information Theory and Communication

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Lecture 5 Channel Coding over Continuous Channels

(Classical) Information Theory III: Noisy channel coding

Information Theory, Statistics, and Decision Trees

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Bioinformatics: Biology X

Information Theory in Intelligent Decision Making

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

CS 630 Basic Probability and Information Theory. Tim Campbell

Information Theory. Week 4 Compressing streams. Iain Murray,

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Principles of Communications

Chaos, Complexity, and Inference (36-462)

Homework Set #2 Data Compression, Huffman code and AEP

Computational Systems Biology: Biology X

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Computing and Communications 2. Information Theory -Entropy

Solutions to Homework Set #3 Channel and Source coding

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Expectation Maximization

Lecture 22: Error exponents in hypothesis testing, GLRT

Noisy-Channel Coding

Shannon s Noisy-Channel Coding Theorem

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Shannon s noisy-channel theorem

Chapter 9 Fundamental Limits in Information Theory

QB LECTURE #4: Motif Finding

Classification & Information Theory Lecture #8

x log x, which is strictly convex, and use Jensen s Inequality:

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Noisy channel communication

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

The Method of Types and Its Application to Information Hiding

ECE Information theory Final

Chapter 2 Review of Classical Information Theory

1 Basic Information Theory

Information Theory and Statistics Lecture 2: Source coding

Hands-On Learning Theory Fall 2016, Lecture 3

ELEC546 Review of Information Theory

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.


EECS 750. Hypothesis Testing with Communication Constraints

Lecture 1. Introduction

Lecture 4 Noisy Channel Coding

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Solutions to Set #2 Data Compression, Huffman code and AEP

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Posterior Regularization

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Lecture 8: Channel Capacity, Continuous Random Variables

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Example: Letter Frequencies

Example: Letter Frequencies

ECE Information theory Final (Fall 2008)

Series 7, May 22, 2018 (EM Convergence)

Gambling and Information Theory

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Transcription:

0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 2. Information Quantities In the previous class, we defined the information content of a random outcome and average information content of a random variable: The Shannon Information Content of a random outcome x which occurs with probability is log 2. The Entropy in bits is the average uncertainty of a random variable X, i.e. a weighted combination of the Shannon information content of each value x that random variable X could take, weighed by the probability of that value/outcome: H(X) = log 2 = E X p[log 2 p(x)] Here X is the collection of all values x that X can take. It is also known as the alphabet over which X is defined. Note that we are focusing on discrete random variables for now. The entropy will turn out to be a fundamental quantity that characterizes the fundamental limit of compression, i.e. the smallest number of bits to which a source distribution or model given by p(x) can be compressed. We now define some more information quantities that will be useful: The joint entropy in bits of two random variables X, Y with joint distribution p(x, y) is H(X, Y ) = ( ) p(x, y) log 2 p(x, y),yɛy The conditional entropy in bits of Y conditioned on X is the average uncertainty about Y after observing X. H(Y X) = H(Y X = x) = ( ) p(y x) log 2 p(y x) yɛy 2-

2-2 Lecture 2: August 3 Given two distributions p, q for a random variable X, the relative entropy between p and q is ( ) ( ) ( ) p D(p q) = E X p [log ] E X p [log ] = E p [log ] = ( ) log q(x) p(x) q q(x) x The base of the log can be 2, if information is measured in bits, or e if information is measured in nats. The relative entropy is also known as the Information divergence or the Kullback-Leibler (KL) divergence. The relative entropy is the cost incurred if we used distribution q to encode X, when the true underlying distribution is p. Consider the example from last lecture, where p(x) uniform({0,,..., 63}) and we need 6 yes/no questions to guess each outcome, hence the average information content or entropy is 6 bits. If instead we consider the model q(x) uniform({0,,..., 27}), then 7 questions are needed for each outcome. The extra price paid to encode an outcome x using model q when x is generated according to the true model p is bit, which is the relative entropy. We will see below that is also the cost incurred in excess risk, under the negative loss likelihood loss, when the true model is p but the estimated model is q. The Mutual Information between X and Y is the KL-divergence between the joint distribution and the product of the marginals. Formally: I(X; Y ) = D(p(x, y) p(y)), (2.) where p(x, y) is the joint distribution of X, Y and, p(y) are the corresponding marginal distributions. Thus, p(y) denotes the joint distribution that would result if X, Y were independent. The mutual information quantifies how much dependence there is between two random variables. If X Y then p(x, y) = p(y) and I(X; Y ) = 0. The mutual information will turn out to be a fundamental quantity that characterizes the fundamental limit of transmission, i.e. the smallest number of bits that can be reliably transmitted through a noisy channel with input X and output Y. 2.2 Connection to Maximum Likelihood Estimation Suppose X = (X,..., X n ) are generated from a distribution p (for example X i p i.i.d.). In maximum likelihood estimation, we want to find a distribution q from some family Q such that the likelihood of the data is maximized. arg max q(x) = arg max log q(x) = arg min log q(x) In machine learning, we often define a loss function. In this case, the loss function is the negative log loss: loss(q, X) = log q(x) = log(/q(x)). The expected value of this loss function is the risk: Risk(q) = E p [log(/q(x))]. We want to find a distribution q that minimizes the risk. However, notice that minimizing the risk with respect to a distribution q is exactly minimizing the relative entropy between p and q. This is because: [ Risk(q) = E p [log(/q(x))] = E p log p(x) ] [ + E p log q(x) ] = D(p q) + Risk(p) p(x) As we will see below, the relative entropy is always non-negative, and hence the risk is minimized by setting q equal to p. Thus the minimum risk R = Risk(p) = H(p), the entropy of distribution p. The excess risk, Risk(q) R is precisely the relative entropy between p and q.

Lecture 2: August 3 2-3 2.3 Fundamental Limits in Information Theory The source coding model is as follows: Source Model X..X n Compressor b,...,b mbits Decompresser Receiver Let the data source be generated according to some distribution p(x). The rate of a source code is defined codelength as the average number of bits used to encode one source symbol, i.e. E p [ #src symbols ] = E p[m/n]. If the rate codelength of a code is less than the source entropy H(X), that is E p [ #src symbols ] < H(X) then perfect reconstruction is not possible. A distribution cannot be compressed below its entropy without loss. We will state and prove it rigorously later in the course. The channel coding model is as follows: b,...,b mbits Channel Encoder X,...,X n Channel p(y x) Y,...,Y n Channel Decoder The rate of a channel code is defined as the average number of bits transmitted per channel use, i.e. #src symbols E p [ codelength ] = E p[m/n]. If the rate of a code is greater than the channel capacity C := max p(x) I(X, Y ), then perfect reconstruction is not possible. The inference problem is similar to the channel coding problem except we do not design the encoder: In the density estimation setting with p θ, θɛθ: θ Channel: p θ (X) X,...,X n Decoder ˆθ We can denote the estimated model as q = pˆθ. Under log loss: Excess Risk(q) = Risk(q) Risk(p) = D(p q) Fundamental limits of inference problems are often characterized by minmax lower bounds, i.e. the smallest possible excess risk that any estimator can achieve for a class of models. For the density estimation problem, the minmax excess risk is inf q sup pɛp D(p q) and we will show that this is equal to the capacity C of the corresponding channel. This would imply that for all estimators q, sup pɛp D(p q) C. We will state and prove these results formally later in the course. Information theory will help us identify these fundamental limits of data compression, transmission and inference; and in some cases also demonstrate that the limits are achievable. The design of efficient encoders / decoders / estimators that achieve these limits is the common objective of Signal Processing and Machine Learning algorithms. 2.4 Useful Properties of Information Quantities. Entropy is always non-negative: H(X) 0, H(X) = 0 X is constant Proof: 0 implies that log 0 For example, consider a binary random variable X Bernoulli(θ) Then θ = 0 or θ =, then

2-4 Lecture 2: August 3 the outcome is certain (a constant) and implies that H(X) = 0. If θ = 2, then H(X) = (which is the maximum entropy for a binary random variable since the distribution is uniform). 2. H(X) log X where X is the set of all outcomes with non-zero probability. Equality is achieved iff X is uniform. Proof: Let u be the uniform distribution over X, i.e. u(x) = X and let be the probability mass function for X. D(p u) = log = log X H(X) u(x) 0 D(p u) = log X H(X) by non negativity of relative entropy (stated and proved below) 3. Chain Rule: H(X, Y ) = H(X) + H(Y X) 4. The following relations hold between entropy, conditional entropy, joint entropy, and mutual information: (a) H(X,Y) = H(X) + H(Y X) = H(Y) + H(Y X) (b) I(X,Y) = H(X) - H(X Y) = H(Y) - H(Y X) = I(Y,X) (c) I(X,Y) = H(X,Y) - H(X Y) - H(Y X) (d) I(X,Y) = H(X) + H(Y) - H(X,Y) 5. (Gibbs Information Inequality) D(p q) 0, = 0 if and only if = q(x) for all x. Proof: Define the support of p to be X = {x : > 0} D(p q) = = log log q(x) log q(x) q(x) = log q(x) log = 0

Lecture 2: August 3 2-5 The first inequality is Jensen s inequality since log is concave. Because log is strictly concave we have equality in the first inequality only if p is a constant distribution or if q(x) is a constant c, for all x (i.e. if q(x) = c). The second inequality is tight only when that constant c = since =. 6. As a corollary, we get that I(X, Y ) = D(p(x, y) p(y)) 0 and = 0 iff X, Y are independent, that is, p(x, y) = p(y). 7. Conditioning cannot increase entropy, i.e. information always helps. with equality iff X and Y are independent. Proof: 0 I(X; Y ) = H(X) H(X Y ) H(X Y ) H(X) For a concave function, the (weighted) average of function values at two points is less than function value at (weighted) average of the two points.