Information Theory Primer:

Similar documents
Expectation Maximization

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Lecture 5 - Information theory

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Lecture 2: August 31

Lecture 1: Introduction, Entropy and ML estimation

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

5 Mutual Information and Channel Capacity

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Computing and Communications 2. Information Theory -Entropy

ECE 4400:693 - Information Theory

CS 630 Basic Probability and Information Theory. Tim Campbell

The binary entropy function

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Hands-On Learning Theory Fall 2016, Lecture 3

Entropies & Information Theory

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory and Communication

Computational Systems Biology: Biology X

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Density Estimation. Seungjin Choi

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Lecture 22: Final Review

1.6. Information Theory

Information in Biology

Logistic Regression. Seungjin Choi

Lecture 14 February 28

Bayes Decision Theory

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Information in Biology

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Introduction to Machine Learning

Linear Models for Regression

Conditional Independence and Factorization

Linear Models for Regression

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Finite Automata. Seungjin Choi

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Chapter 2 Review of Classical Information Theory

ECE Information theory Final

3F1 Information Theory, Lecture 1

Machine learning - HT Maximum Likelihood

INTRODUCTION TO INFORMATION THEORY

PATTERN RECOGNITION AND MACHINE LEARNING

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Noisy-Channel Coding

Lecture 5 Channel Coding over Continuous Channels

Nonparameteric Regression:

Information Theory in Intelligent Decision Making

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Independent Component Analysis (ICA)

the Information Bottleneck

Nonnegative Matrix Factorization

Classification & Information Theory Lecture #8

Mathematical Preliminaries

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Expectation Propagation Algorithm

Information Theory. Week 4 Compressing streams. Iain Murray,

Introduction to Statistical Learning Theory

Noisy channel communication

x log x, which is strictly convex, and use Jensen s Inequality:

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Quantitative Biology II Lecture 4: Variational Methods

Outline of the Lecture. Background and Motivation. Basics of Information Theory: 1. Introduction. Markku Juntti. Course Overview

Statistical Machine Learning Lectures 4: Variational Bayes

Quiz 2 Date: Monday, November 21, 2016

Bioinformatics: Biology X

Lecture 17: Differential Entropy

Introduction to Machine Learning

Example: Letter Frequencies

Example: Letter Frequencies

Information Theory, Statistics, and Decision Trees

Lecture 18: Quantum Information Theory and Holevo s Bound

QB LECTURE #4: Motif Finding

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

Lecture 1: September 25, A quick reminder about random variables and convexity

Information theory and decision tree

LECTURE 3. Last time:

Series 7, May 22, 2018 (EM Convergence)

Information Theory: Entropy, Markov Chains, and Huffman Coding

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Revision of Lecture 5

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Variational Inference (11/04/13)

1 Introduction to information theory

Week 3: The EM algorithm

Lecture 8: Channel Capacity, Continuous Random Variables

Variable selection and feature construction using methods related to information theory

Transcription:

Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 27

Outline Entropy (Shannon entropy, differential entropy, and conditional entropy) Kullback-Leibler (KL) divergence Mutual information Jensen s inequality and Gibb s inequality 2 / 27

Information Theory Information theory answers two fundamental questions in communication theory What is the ultimate data compression? entropy H. What is the ultimate transmission rate of communication? channel capacity C. In the early 1940 s, it was thought that increasing the transmission rate of information over a communication channel increased the probability of error This is not true. Shannon surprised the communication theory community by proving that this was not true as long as the communication rate was below the channel capacity. Although information theory was developed for communications, it is also important to explain ecological theory of sensory processing. Information theory plays a key role in elucidating the goal of unsupervised learning. 3 / 27

Shannon Entopy 4 / 27

Information and Entropy Information can be thought of as surprise, uncertainty, or unexpectedness. Mathematically it is defined by I = log p i, where p i is the probability that the event labelled i occurs. The rare event gives large information and frequent event produces small information. Entropy is average information, i.e., N H = E [I ] = p i log p i. i=1 5 / 27

Example: Horse Race Suppose we have a horse race with eight horses taking part. Assume that the probabilities of winning for the eight horse are ( 1 2, 1 4, 1 8, 1 16, 1 64, 1 64, 1 64, 1 64 ). Suppose that we wish to send a message to another person indicating which horse won the race. How many bits are required to describe this for each of the horses? 3 bits for any of the horses? 6 / 27

No! The win probabilities are not uniform. It makes sense to use shorter descriptions for the more probable horses and longer descriptions for the less probable ones so that we achieve a lower average description length. For example, we can use the following strings to represent the eight horses: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. The average description length in this case is 2 bits as opposed to 3 bits for the uniform code. We calculate the entropy: H = 1 2 log 2 = 2 bits. 1 2 1 4 log 1 2 4 1 8 log 1 2 8 1 16 log 2 1 16 4 1 64 log 2 The entropy of a random variable is a lower bound on the average number of bits required to represent the random variables and also on the average number of questions needed to identify the variable in a game of twenty questions. 1 64 7 / 27

Shannon Entropy Given a discrete random variable X with image X, (Shannon) Entropy is the average information (a measure of uncertainty) that is defined by H(X ) = x X p(x) log p(x) = E p [ log p(x)]. Properties H(X ) 0. (since each term in the summation is nonnegative) H(X ) = 0 if and only if P[X = x] = 1 for some x X. The entropy is maximal if all the outcomes are equally likely. 8 / 27

Differential Entropy Given a continuous random variable X, Differential entropy is defined as H(p) = p(x) log p(x)dx = E p [log p(x)]. Properties It can be negative. Given a fixed variance, Gaussian distribution achieves the maximal differential entropy. For x N (µ, σ 2 ), H(x) = 1 2 log(2πeσ2 ). For x N (µ, Σ), H(x) = 1 log det (2πeΣ). 2 9 / 27

Conditional Entropy Given two discrete random variables X and Y with images X and Y, respectively, we expand the joint entropy H(X, Y ) = ( ) 1 p(x, y) log p(x, y) x X y Y = ( ) 1 p(x, y) log p(x)p(y x) x X y Y = ( ) 1 p(x, y) log + ( ) 1 p(x, y) log p(x) p(y x) x X y Y x X y Y = ( ) 1 p(x) log + p(x) ( ) 1 p(y x) log p(x) p(y x) x X x X y Y = H(X ) + p(x)h(y X = x). x X }{{} H(Y X ) Chain rule: H(X, Y ) = H(X ) + H(Y X ). 10 / 27

H(X, Y ) H(X ) + H(Y ) H(Y X ) H(Y ) Try to prove these by yourself! 11 / 27

KL Divergence 12 / 27

Relative Entropy (Kullback-Leibler Divergence) Introduced by Solomon Kullback and Richard Leibler in 1951. A measure of how one probability distribution q diverges from the other p D KL [p q] (KL divergence of q(x) from p(x)) Discrete probability distributions p and q: D KL [p q] = x p(x) log p(x) q(x). Probability distributions p and q of continuous random variables: D KL [p q] = p(x) log p(x) q(x) dx. Properties of KL divergence Divergence is not symmetric: DKL [p q] D KL [q p]. Divergence is always nonnegative: DKL [p q] 0 (Gibb s inequality). Divergence is a convex function on the domain of probability distributions. 13 / 27

Theorem (Convexity of divergence) Let p 1, q 1, and p 2, q 2, be probability distributions over a random variable X and λ (0, 1) define p = λp 1 + (1 λ)p 2, q = λq 1 + (1 λ)q 2. Then, D KL [p q] λd KL [p 1 q 1 ] + (1 λ)d KL [p 2 q 2 ]. Proof. It is deferred to the end of this lecture. 14 / 27

Entropy and Divergence The entropy of a random variable X with a probability distribution p(x) is related to how much p(x) diverges from the uniform distribution on the support of X. H(X ) = ( ) 1 p(x) log p(x) x X = ( ) X p(x) log p(x) X x X = log X ( ) p(x) p(x) log 1 x X X = log X D KL [p unif]. The more p(x) diverges the lesser its entropy and vice versa. 15 / 27

Recall D KL [p q] = p(x) log p(x) q(x) dx. Characterizing KL divergence If p and q are high, we are happy If p is high but q isn t, we pay a price If p is low, we do not care D KL = 0, then distributions are equal 16 / 27

17 / 27

18 / 27

KL Divergence of Two Gaussians Two univariate Gaussians (x R) p(x) = N (µ1, σ1) 2 and q(x) = N (µ 2, σ2) 2 Calculated as D KL [p q] = p(x) log p(x) q(x) dx = p(x) log p(x)dx p(x) log q(x)dx = 1 σ1 2 + 1 (µ 2 µ 1) 2 + log σ1 1 2 σ2 2 2 σ2 2 σ 2 2. Two multivariate Gaussians (x R D ) p(x) = N (µ1, Σ 1) and q(x) = N (µ 2, Σ 2) Calculated as D KL [p q] = 1 [ ( ) tr Σ 1 2 Σ 1 + (µ 2 2 µ 1 ) Σ 1 2 (µ 2 µ 1 ) D + log Σ2 ]. Σ 1 19 / 27

Mutual Information 20 / 27

Mutual Information Mutual information is the relative entropy between the joint distribution and the product of marginal distributions, I (x, y) = [ ] p(x, y) p(x, y) log p(x)p(y) x X y Y = D KL [p(x, y) p(x)p(y)] [ ]} p(x, y) = E p(x,y) {log. p(x)p(y) Mutual information can be interpreted as the reduction in the uncertainty of x due to the knowledge of y, i.e., I (x, y) = H(x) H(x y), where H(x y) = E p(x,y) [log p(x y)] is the conditional entropy 21 / 27

Convexity, Jensen s inequality, and Gibb s inequality 22 / 27

Convex Sets and Functions Definition (Convex Sets) Let C be a subset of R m. C is called a convex set if αx + (1 α)y C, x, y C, α [0, 1] Definition (Convex Function) Let C be a convex subset of R m. A function f : C R is called a convex function if f (αx + (1 α)y) αf (x) + (1 α)f (y) x, y C, α [0, 1] 23 / 27

Jensen s Inequality Theorem (Jensen s Inequality) If f (x) is a convex function and x is a random vector, then E [f (x)] f (E [x]). Note: Jensen s inequality can also be rewritten for a concave function, with the direction of the inequality reversed. 24 / 27

Proof of Jensen s Inequality ( N Need to show that N i=1 p if (x i ) f i=1 p ix i ). The proof is based on the recursion, working from the right-hand side of this equation. ( N ) ( ) N f p i x i = f p 1x 1 + p i x i i=1 and so forth. i=2 [ N ] ( N ) ( ) i=2 p 1f (x 1) + p i f p ix i N i=2 i=2 p choose α = p1 N i i=1 p i [ N ] ( N )} i=3 p 1f (x 1) + p i {αf (x 2) + (1 α)f p ix i N i=2 i=3 p i ( ) choose α = p2 N i=2 p i ( N ) N i=3 = p 1f (x 1) + p 2f (x 2) + p i f p ix i N i=3 p, i i=3 25 / 27

Gibb s Inequality Theorem D KL [p q] 0 with equality iff p = q. Proof: Consider the Kullback-Leibler divergence for discrete distributions: D KL [p q] = i p i log p i q i = p i log q i p i i [ ] q i log p i p i i [ ] = log q i = 0. i (by Jensen s inequality) 26 / 27

More on Gibb s Inequality In order to find the distribution p which minimizes D KL [p q], we consider a Lagrangian ( E = D KL [p q] + λ 1 ) p i = ( p i log p i + λ 1 ) p i. q i i i i Compute the partial derivative E p k and set to zero, E p k = log p k log q k + 1 λ = 0, which leads to p k = q k e λ 1. It follows from i p i = 1 that i q ie λ 1 = 1, which leads to λ = 1. Therefore p i = q i. The Hessian, 2 E = 1 p i 2 p i, 2 E p i p j = 0, is positive definite, which shows that p i = q i is a genuine minimum. 27 / 27