Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Similar documents
Lecture 2: August 31

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Gaussian Mixture Models

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Solutions to Set #2 Data Compression, Huffman code and AEP

Chapter 5: Data Compression

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

1 Introduction to information theory

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Coding of memoryless sources 1/35

Homework Set #2 Data Compression, Huffman code and AEP

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Data Compression. Limit of Information Compression. October, Examples of codes 1

Entropy as a measure of surprise

Information & Correlation

Lecture 5 - Information theory

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Lecture 22: Final Review

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Theory Primer:

Communications Theory and Engineering

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Machine Learning Srihari. Information Theory. Sargur N. Srihari

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

CS225/ECE205A, Spring 2005: Information Theory Wim van Dam Engineering 1, Room 5109

Statistical Machine Learning Lectures 4: Variational Bayes

Information theory and decision tree

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Introduction to Statistical Learning Theory

Expectation Maximization

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Information Theory: Entropy, Markov Chains, and Huffman Coding

Example: Letter Frequencies

Introduction to Probability and Statistics (Continued)

Example: Letter Frequencies

Lecture 5: Asymptotic Equipartition Property

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Chapter 2: Source coding

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Information Theory, Statistics, and Decision Trees

Lecture 1: Introduction, Entropy and ML estimation

Dept. of Linguistics, Indiana University Fall 2015

Information Theory in Intelligent Decision Making

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Robustness and duality of maximum entropy and exponential family distributions

3F1 Information Theory, Lecture 3

Information Theory and Statistics Lecture 2: Source coding

Chapter 5. Data Compression

Lecture 3 : Algorithms for source coding. September 30, 2016

Hands-On Learning Theory Fall 2016, Lecture 3

Lecture 1: Shannon s Theorem

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Information Theory and Model Selection

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

The binary entropy function

Machine learning - HT Maximum Likelihood

Context tree models for source coding

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Stat410 Probability and Statistics II (F16)

lossless, optimal compressor

5 Mutual Information and Channel Capacity

EE 376A: Information Theory Lecture Notes. Prof. Tsachy Weissman TA: Idoia Ochoa, Kedar Tatwawadi

U Logo Use Guidelines

Motivation for Arithmetic Coding

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Exercises with solutions (Set B)

Coding for Discrete Source

Shannon-Fano-Elias coding

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Bayesian Inference Course, WTCN, UCL, March 2013

2.1 Optimization formulation of k-means

Example: Letter Frequencies

Information Theory and Hypothesis Testing

Intro to Information Theory

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Block 2: Introduction to Information Theory

Quantitative Biology Lecture 3

(Classical) Information Theory II: Source coding

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Lecture 5 Channel Coding over Continuous Channels

Introduction to Machine Learning

1/37. Convexity theory. Victor Kitov

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Communication Theory and Engineering

Transcription:

Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18

A Measure of Information? Consider a discrete random variable X. How much information do we gain from observing X? Information degree of surprise from observing X = x. If we know P(X = 0) = 1, then observing X = 0 gives no information. If we know P(X = 0) =.999: Observing X = 0 gives little information. Observing X = 1 gives a lot of surprise / information Information measure h(x) should depend on p(x): Smaller p(x) = More information = Larger h(x) avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 2 / 18

Shannon Information Content of an Outcome Definition Let X X have PMF p(x). The Shannon information content of an outcome x is ( ) 1 h(x) = log, p(x) where the base of the log is 2. Information is measured in bits. (Or nats if the base of the log is e.) Less likely outcome gives more information. Information is additive for independent events: If X and Y are independent, h(x,y) = logp(x,y) = log[p(x)p(y)] = logp(x) logp(y) = h(x) + h(y) David Rosenberg (New York University) DS-GA 1003 June 15, 2015 3 / 18

Entropy Definition Let X X have PMF p(x). The entropy of X is ( ) 1 H(X ) = E p log p(x ) = x Xp(x)logp(x), using convention that 0log 0 = 0, since lim x 0 + x logx = 0. Entropy of X is the expected information gain from observing X. Entropy only depends on distribution p, so we can write H(p). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 4 / 18

Coding Definition A binary source code C is a mapping from X to finite 0/1 sequences. Consider r.v. X X and binary source code C defined as: x p(x) C(x) 1 1/2 0 2 1/4 10 3 1/8 110 4 1/8 111 avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 5 / 18

Expected Code Length Consider r.v. X X and binary source code C defined as: x p(x) C(x) log 1 p(x) 1 1/2 0 log 2 2 = 1 2 1/4 10 log 2 4 = 2 3 1/8 110 log 2 8 = 3 4 1/8 111 log 2 8 = 3 The entropy is H(X ) = Elog[1/p(x)]: H(X ) = 1 2 (1) + 1 4 (2) + 1 8 (3) + 1 (3) = 1.75 bits. 8 The expected code length is L(C) = 1 2 (1) + 1 4 (2) + 1 8 (3) + 1 (3) = 1.75 bits. 8 avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 6 / 18

Prefix Codes A code is a prefix code if no codeword is a prefix of another. Prefix codes can be represented on trees: Each leaf node is a codeword. It s encoding represents the path from root to leaf. From David MacKay s Information Theory, Inference, and Learning Algorithms, Section 5.1. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 7 / 18

Data Compression: What s the Best Prefix Code? For X p(x), we get best compression with codeword lengths l (x) logp(x). Optimal bit length of x is the Shannon Information of x. Then the expected codeword length is L = E[ logp(x )] = H(X ) Entropy H(X ) gives a lower bound on coding performance. Shannon s Theorem says we can achieve H(X ) within 1 bit. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 8 / 18

Shannon s Source Coding Theorem Theorem (Shannon s Source Coding Theorem) The expected length L of any binary prefix code for r.v. X is at least H(X ): L H(X ). There exist codes with lengths l(x) = log 2 p(x) achieving H(X ) L < H(X ) + 1. Notation x = ceil(x) = (smallest integer x) avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 9 / 18

Shannon s Source Coding Theorem: Summary For any X p(x), code with L H(X ). Get arbitrarily close to H(X ) by grouping multiple X s and coding all at once. If we know the distribution of X, we can code optimally. e.g. Use Huffman codes or arithmetic codes. What if we don t know p(x), and we use q(x) instead? David Rosenberg (New York University) DS-GA 1003 June 15, 2015 10 / 18

Coding with the Wrong Distribution: Core Calculation Allow fractional code lengths: l q (x) = logq(x) Then expected length for coding X p(x) using l q (x) is L = E X p(x) l q (X ) = p(x) log q(x) x = [ ] p(x) 1 p(x) log q(x) p(x) x = p(x)log p(x) q(x) + p(x)log 1 p(x) x = KL(p q) + H(p), where KL(p q) is the Kullback-Leibler divergence between p and q. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 11 / 18

Entropy, Cross-Entropy, and KL-Divergence The Kullback-Leibler or KL Diverence is defined by ( ) p(x ) KL(p q) = E p log. q(x ) KL(p q): #(extra bits) needed if we code with q(x) instead of p(x). The cross entropy for p(x) and q(x) is defined as H(p,q) = E p logq(x ). H(p,q): #(bits) needed to code X p(x) using q(x). Summary: H(p,q) = H(p) + KL(p q). avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 12 / 18

Coding with the Wrong Distribution: Integer Lengths Theorem If we code X p(x) using code lengths l(x) = log 2 q(x), the expected code length is bounded as H(p) + KL(p q) E p l(x ) < H(p) + KL(p q) + 1. So with an implementable code (using integer codeword lengths), the expected code length is within 1 bit of what could be achieved with l(x) = log 2 q(x). Proof is a slight tweak on the core calculation. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 13 / 18

Jensen s Inequality Theorem (Jensen s Inequality) If f : X R is a convex function, and X X is a random variable, then Ef (X ) f (EX ). Moreover, if f is strictly convex, then equality implies that X = EX with probability 1 (i.e. X is a constant). e.g. f (x) = x 2 is convex. So EX 2 (EX ) 2. Thus VarX = EX 2 (EX ) 2 0. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 14 / 18

Gibbs Inequality (KL(p q) 0) Theorem (Gibbs Inequality) Let p(x) and q(x) be PMFs on X. Then KL(p q) 0, with equality iff p(x) = q(x) for all x X. KL divergence measures the distance between distributions. Note: KL divergence not a metric. KL divergence is not symmetric. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 15 / 18

Gibbs Inequality: Proof ( )] q(x ) KL(p q) = E p [ log p(x ) [ ( )] q(x ) log E p (Jensen s) p(x ) = log p(x) q(x) p(x) {x p(x)>0} [ ] = log q(x) x X = log 1 = 0. Since log is strictly convex, we have strict equality iff q(x)/p(x) is a constant, which implies q = p. Essentially the same proof for PDFs. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 16 / 18

KL-Divergence for Model Estimation Suppose D = {x 1,...,x n } is a sample from unknown p(x) on X. Hypothesis space: P some set of distributions on X. Idea: Find q P that minimizes KL(p q): arg min q P KL(p, q) = arg mine p [log q P ( )] p(x ) q(x ) Don t know p, so replace expectation by average over D: { 1 n [ ] } p(xi ) arg min log q P n q(x i ) i=1 avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 17 / 18

Estimated KL-Divergence The estimated KL-divergence: 1 n = 1 n n [ ] p(xi ) log q(x i ) n logp(x i ) 1 n i=1 i=1 The minimizer of this over q P is also arg max q P n logq(x i ). i=1 n logq(x i ). This is exactly the objective for the MLE. Minimizing KL between model and truth leads to MLE. avid Rosenberg (New York University) DS-GA 1003 June 15, 2015 18 / 18 i=1