TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Similar documents
TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Entropy as a measure of surprise

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Information Theory and Statistics Lecture 2: Source coding

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Lecture 1 : Data Compression and Entropy

Information & Correlation

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

CS4800: Algorithms & Data Jonathan Ullman

UNIT I INFORMATION THEORY. I k log 2

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Coding of memoryless sources 1/35

1 Introduction to information theory

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Information and Entropy

Chapter 2: Source coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Data Compression. Limit of Information Compression. October, Examples of codes 1

Information Theory. Week 4 Compressing streams. Iain Murray,

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

Lecture 22: Final Review

Data Compression Techniques

Chapter 5: Data Compression

A Mathematical Theory of Communication

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

lossless, optimal compressor

the Information Bottleneck

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Information theory and decision tree

Lecture 1: September 25, A quick reminder about random variables and convexity

6.02 Fall 2012 Lecture #1

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Lecture 16 Oct 21, 2014

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Information Theory, Statistics, and Decision Trees

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

CSE 421 Greedy: Huffman Codes

3F1 Information Theory, Lecture 3

Lecture 4 : Adaptive source coding algorithms

2018/5/3. YU Xiangyu

Digital Image Processing Lectures 25 & 26

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Digital communication system. Shannon s separation principle

Lecture 1: Shannon s Theorem

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Intro to Information Theory

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Homework Set #2 Data Compression, Huffman code and AEP

CSCI 2570 Introduction to Nanocomputing

Advanced Topics in Information Theory

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Digital search trees JASS

Coding for Discrete Source

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

EXERCISE 10 SOLUTIONS

U Logo Use Guidelines

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Quiz 2 Date: Monday, November 21, 2016

Using an innovative coding algorithm for data encryption

if the two intervals overlap at most at end

topics about f-divergence

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Asymptotic redundancy and prolixity

Generalization Bounds

21. Dynamic Programming III. FPTAS [Ottman/Widmayer, Kap. 7.2, 7.3, Cormen et al, Kap. 15,35.5]

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

1. Basics of Information

Generative Adversarial Networks

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Tight Bounds on Minimum Maximum Pointwise Redundancy

CMPT 365 Multimedia Systems. Final Review - 1

Chapter 9 Fundamental Limits in Information Theory

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

DCSP-3: Minimal Length Coding. Jianfeng Feng

Quantum-inspired Huffman Coding

Exercises with solutions (Set B)

3 Greedy Algorithms. 3.1 An activity-selection problem

COMM901 Source Coding and Compression. Quiz 1

3F1 Information Theory, Lecture 3

BASICS OF COMPRESSION THEORY

arxiv: v2 [cs.ds] 28 Jan 2009

Context-Free Languages

Lecture 9. Greedy Algorithm

Transcription:

TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling

Why do we model distributions and conditional distributions using the following objective functions? Θ = argmin Θ Θ = argmin Θ E x D [ln E (x,y) D [ln ] 1 P Θ (x) ] 1 P Θ (y x) Why is bits per word the natural measure of the performance of a language model? How is bits per sample related to actual data compression?

Shannon s Source Coding (Compression) Theorem Consider a data distribution D such as the natural distribution on sentences. Shannon s theorem states that the average compressed size (in bits) under optimal compression when drawing x from D is the entropy H(D) [ ] 1 H(D) = E x D log 2 D(x) Note that if D is the uniform distribution on 2 N items then it takes N bits to name one of the items.

Shannon s Source Coding (Compression) Theorem Consider a probability distribution D on a finite set X. We define a tree T over X to be a binary branching tree whose leaves are labeled with (all) the elements of X. Let d(x; T ) be the depth of the leaf that is labeled with x. We can name each element with a bit string of length d(x; T ). Define d(t ; D) = E x D [d(x; T )] = average compressed size. Theorem: T d(t ; D) H(D) T d(t ; D) H(D) + 1

Huffman Coding Maintain a list of trees T 1,..., T N. Inititally each tree is just one root node labeled with an element of X. Each tree T i has a weight equal to the sum of the probabilities of the nodes on the leaves of that tree. Repeatedly merge the two trees of lowest weight into a single tree until all trees are merged.

Optimality of Huffman Coding Theorem: The Huffman code T for D is optimal for any other tree T we have d(t ; D) d(t ; D). Proof: The algorithm maintains the invariant that there exists an optimal tree including all the subtrees on the list. To prove that a merge operation maintains this invariant we consider any tree containing the given subtrees. Consider the two subtrees T i and T j of minimal weight. Without loss of generality we can assume that T i is at least as deep as T j. Swapping the sibling of T i for T j brings T i and T j together and can only improve the average depth.

Modeling a Distribution Θ = argmin Θ H(D, P Θ ) [ ] 1 H(D, P Θ ) = cross entropy = E x D log 2 P Θ (x)

Distribution Modeling and Data Compression Theorem: For any P Θ there exists a code T such that for all x X ( ) 1 log 2 P Θ (x) d(x; T ) 1 log 2 + 1 P Θ (x) Optimal average compressed size is achieved by [ ] Θ 1 = argmin H(D, P Θ ) = argmin E x D log 2 Θ Θ P Θ (x) Minimizing Cross-Entropy is the same as optimizing data compression is the same as distribution modeling.

Cross Entropy vs. Entropy An LSTM language models allow us to calculate the probability of given sentence. This allows us to measure H(D, P Θ ) by sampling. While we can measure the cross-entropy H(D, P Θ ) we cannot measure the true entropy of the source H(D) which, for language, presumably involves semantic truth. But we can show H(D) H(D, P ) The cross cross entropy to the model upper bounds the true data source entropy.

KL Divergence The KL divergence is [ ] D(x) KL(D, P ) = H(D, P ) H(D) = E x D log 2 P (x) We can show KL(D, P ) 0 using Jensen s inequality applied to the convexity of the negative of the log function.

KL Divergence [ KL(D, P ) = E x D log P (x) ] D(x) [ ] P (x) log E x D D(x) = log x D(x) P (x) D(x) = log x P (x) = 0 KL(D, P ) 0

Rate-Distortion Autoencoders [Kevin Frans]

Rate-Distortion Autoencoders We as- Rate-distortion theory addresses lossy compression. sume An encoder (compression) network z Φ (x) where z Φ (x) is a bit string in a prefix-free code (a code corresponding to the leaves of a binary tree). We write z for the number of bits in the string z. A decoder (decompression) network ˆx Ψ (z) A distortion function L(x, ˆx) Φ, Ψ = argmin Φ,Ψ E x D [ z Φ (x) + λl(x, ˆx Ψ (z Φ (x))) ]

Summary of Distribution Modeling Distribution modeling is important when the distribution being modeled (D(x) or D(y x)) is highly distributed and precise prediction is impossible. Mathematically, distribution modeling (minimizing cross entropy) is the same as optimizing data compression.

Summary of Distribution Modeling Θ = argmin Θ H(D, P Θ ) Conditional version: Θ = argmin Θ E x D H(D(y x), P Θ (y x)) ] H(D, P ) = E x D [log 1 2 P (x) ] H(D) = E x D [log 1 2 D(x) H(D, P ) H(D) KL(D, P ) = H(D, P ) H(D) = E x D [ D(x) log 2 P (x) ] 0

Summary of Distribution Modeling Θ = argmin Θ H(D, P Θ ) Consistency: If there exists Θ with P Θ = D then P Θ = D. This follows from H(D, D) = H(D) H(D, P )

Methods of Modeling Distributions Structured Prediction. P (y x) = softmax y W Θ (x) Φ(y) where this is an exponential softmax. Rate-Distortion Autoencoding. Variational Autoencoding. Generative Adversarial Networks.

END