Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Similar documents
Lecture 5 - Information theory

Information Theory and Communication

Lecture 2: August 31

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Information Theory Primer:

The binary entropy function

Hands-On Learning Theory Fall 2016, Lecture 3

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

ECE 4400:693 - Information Theory

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Lecture 1: Introduction, Entropy and ML estimation

Dept. of Linguistics, Indiana University Fall 2015

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Communication Theory and Engineering

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

5 Mutual Information and Channel Capacity

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

3F1 Information Theory, Lecture 1

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Computing and Communications 2. Information Theory -Entropy

Example: Letter Frequencies

1/37. Convexity theory. Victor Kitov

Example: Letter Frequencies

Information in Biology

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Machine learning - HT Maximum Likelihood

COMP538: Introduction to Bayesian Networks

Information in Biology

Introduction to Statistical Learning Theory

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

U Logo Use Guidelines

Quantitative Biology II Lecture 4: Variational Methods

1 Basic Information Theory

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

Information theory and decision tree

x log x, which is strictly convex, and use Jensen s Inequality:

Series 7, May 22, 2018 (EM Convergence)

Classification & Information Theory Lecture #8

How to Quantitate a Markov Chain? Stochostic project 1

Lecture 1: September 25, A quick reminder about random variables and convexity

Introduction to Machine Learning

Statistical Machine Learning Lectures 4: Variational Bayes

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

Expectation Maximization

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

A CLASSROOM NOTE: ENTROPY, INFORMATION, AND MARKOV PROPERTY. Zoran R. Pop-Stojanović. 1. Introduction

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Example: Letter Frequencies

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 5 Channel Coding over Continuous Channels

13: Variational inference II

PATTERN RECOGNITION AND MACHINE LEARNING

Noisy-Channel Coding

COMP 328: Machine Learning

Independence. P(A) = P(B) = 3 6 = 1 2, and P(C) = 4 6 = 2 3.

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Principles of Communications

Probabilistic and Bayesian Machine Learning

1 Introduction to information theory

1.6. Information Theory

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Chapter 2 Review of Classical Information Theory

Introduction to Information Theory

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Expectation Maximization

H(X) = plog 1 p +(1 p)log 1 1 p. With a slight abuse of notation, we denote this quantity by H(p) and refer to it as the binary entropy function.

Quantitative Biology Lecture 3

Entropy, Relative Entropy and Mutual Information Exercises

The Shannon s basic inequalities refer to the following fundamental properties of entropy function:

SDS 321: Introduction to Probability and Statistics

COMP538: Introduction to Bayesian Networks

Coding of memoryless sources 1/35

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

LECTURE 3. Last time:

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Introduction to Information Theory and Its Applications

1 Review of The Learning Setting

Entropies & Information Theory

Bayesian Machine Learning - Lecture 7

Information Theory, Statistics, and Decision Trees

3F1 Information Theory, Lecture 3

INTRODUCTION TO INFORMATION THEORY

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Exercises with solutions (Set D)

EE5585 Data Compression May 2, Lecture 27

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Lecture 18: Quantum Information Theory and Holevo s Bound

Transcription:

Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 28

Jensen s Inequality Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 2 / 28

Jensen s Inequality Concave functions A function f is concave on interval I if for any x, y I, λf (x) + (1 λ)f (y) f (λx + (1 λ)y) for anyλ [0, 1] Weighted average of function is upper bounded by function of weighted average. It is strictly concave if the equality holds only when x=y. Nevin L. Zhang (HKUST) Machine Learning 3 / 28

Jensen s Inequality Jensen s Inequality Theorem (1.1) Suppose function f is concave on interval I.Then For any p i [0, 1], n i=1 p i = 1 and x i I. n n p i f (x i ) f ( p i x i ) i=1 Weighted average of function is upper bounded by function of weighted average. If f is strictly CONCAVE, the equality holds iff p i p j 0 implies x i =x j. Exercise: Prove this (using induction). i=1 Nevin L. Zhang (HKUST) Machine Learning 4 / 28

Jensen s Inequality Logarithmic function The logarithmic function is concave in the interval (0, ): Hence n n p i log(x i ) log( p i x i ) i=1 i=1 0 x i In words, exchanging i p i with log increases a quantity. Nevin L. Zhang (HKUST) Machine Learning 5 / 28

Entropy Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 6 / 28

Entropy Entropy The entropy of a random variable X : H(X ) = X P(X ) log 1 P(X ) = E P[log P(X )] with convention that 0 log(1/0) = 0. Base of logarithm is 2, unit is bit. Sometimes, also called the entropy of the distribution. Nevin L. Zhang (HKUST) Machine Learning 7 / 28

Entropy Entropy H(X ) measures uncertainty about X : X binary. The chart on the right shows H(X ) as a function of p=p(x =1). The higher H(X ) is, the more uncertainty about the value of X Nevin L. Zhang (HKUST) Machine Learning 8 / 28

Entropy Entropy Another example: X result of coin tossing Y result of dice throw Z result of randomly pick a card from a deck of 54 Which one has the highest uncertainty? Entropy: H(X ) = 1 2 log 2 + 1 log 2 = 1(log 2) 2 H(Y ) = 1 6 log 6 +... + 1 log 6 = log 6 6 H(Z) = 1 54 log 54 +... + 1 log 54 = log 54 54 Indeed we have: H(X ) < H(Y ) < H(Z). Nevin L. Zhang (HKUST) Machine Learning 9 / 28

Entropy Entropy Proposition (1.2) H(X ) 0 H(X ) = 0 equality iff P(X =x) = 1 for some x Ω X. i.e. iff no uncertainty. H(X ) log( X ) with equality iff P(X =x)=1/ X. Uncertainty is the highest in the case of uniform distribution. Proof: Because log is concave, by Jensen s inequality: H(X ) = X log X 1 P(X )log P(X ) 1 P(X ) P(X ) = log X Nevin L. Zhang (HKUST) Machine Learning 10 / 28

Entropy Conditional entropy The conditional entropy of X given event Y =y: Entropy of the conditional distribution P(X Y = y), i.e. H(X Y =y) = X 1 P(X Y =y)log P(X Y =y) The uncertainty that remains about X when Y is known to be y. It is possible that H(X Y =y) > H(X ) Intuitively Y =y might contradicts our prior knowledge about X and increase our uncertainty about X Exercise: Give example. Nevin L. Zhang (HKUST) Machine Learning 11 / 28

Entropy Conditional entropy The conditional entropy of X given variable Y : H(X Y ) = y Ω Y P(Y = y)h(x Y =y) = P(Y ) 1 P(X Y )log P(X Y ) Y X = 1 P(X, Y )log P(X Y ) X,Y = E[logP(X Y )] The average uncertainty that remains about X when Y is known. Nevin L. Zhang (HKUST) Machine Learning 12 / 28

Entropy Joint entropy The joint entropy of X and Y : H(X, Y ) = 1 P(X, Y )log P(X, Y ) X,Y Chain rule: H(X, Y ) = H(X ) + H(Y X ) = H(Y, X ) = H(Y ) + H(X Y ) Proof: 1 P(X, Y )log P(X, Y ) X,Y = 1 P(X, Y )log P(X )P(Y X ) X,Y = 1 P(X, Y )log P(X ) + 1 P(X, Y )log P(Y X X,Y X,Y = 1 P(X )log + H(Y X ) P(X ) X = H(X ) + H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 13 / 28

Divergence Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 14 / 28

Divergence Kullback-Leibler divergence Relative entropy or Kullback-Leibler divergence Measures how much a distribution Q(X ) differs from a true probability distribution P(X ). K-L divergence of Q from P is defined as follows: KL(P Q) = X P(X )log P(X ) Q(X ) = E P[logP(X )] E P [logq(x )] 0log 0 0 = 0 and plog p 0 = if p 0 Not symmetric. So, not a distance measure mathematically. The second term is called cross entropy: H(P, Q) = E P [logq(x )]. H(P, Q) = KL(P Q) + H(P) Nevin L. Zhang (HKUST) Machine Learning 15 / 28

Divergence KL divergence between P and Q is larger than 0 unless P and Q are Nevin L. Zhang (HKUST) Machine Learning 16 / 28 Kullback-Leibler divergence Theorem (1.2) (Gibbs inequality) KL(P, Q) 0 with equality holds iff P is identical to Q Proof: P(X )log P(X ) Q(X ) X = X log X P(X )log Q(X ) P(X ) P(X ) Q(X ) P(X ) Jensen s inequality = log X Q(X ) = 0.

Divergence A corollary Corollary (1.1) (Gibbs Inequality) H(P, Q) H(P), or P(X ) log Q(X ) P(X ) log P(X ) X X In general, let f (X ) be a non-negative function. Then f (X ) log Q(X ) f (X ) log P (X ) X X where P (X ) = f (X )/ X f (X ). Nevin L. Zhang (HKUST) Machine Learning 17 / 28

Divergence Jensen-Shannon divergence KL is not symmetric: KL(P Q) usually is not equal to reverse KL KL(Q P). Jensen-Shannon divergence is one symmetrized version of KL: JS(P Q) = 1 2 KL(P M) + 1 2 KL(Q M) where M = P+Q 2 Properties: 0 JS(P Q) log 2 JS(P Q) = 0 if P = Q JS(P Q) = log 2 if P and Q has disjoint support. Nevin L. Zhang (HKUST) Machine Learning 18 / 28

Mutual Information Outline 1 Jensen s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 19 / 28

Mutual Information Mutual information The mutual information of X and Y : I (X ; Y ) = H(X ) H(X Y ) Average reduction in uncertainty about X from learning the value of Y, or Average amount of information Y conveys about X. Nevin L. Zhang (HKUST) Machine Learning 20 / 28

Mutual Information Mutual information and KL Divergence Note that: I (X ; Y ) = 1 P(X )log P(X ) 1 P(X, Y )log P(X Y ) X X,Y = 1 P(X, Y )log P(X ) 1 P(X, Y )log P(X Y ) X,Y X,Y = X,Y P(X Y ) P(X, Y )log P(X ) = P(X, Y )log P(X, Y ) P(X )P(Y ) X,Y = KL(P(X, Y ), P(X )P(Y )) Due to equivalent definition: equivalent definition I (X ; Y ) = H(X ) H(X Y ) = I (Y ; X ) = H(Y ) H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 21 / 28

Mutual Information Property of Mutual information Theorem (1.3) with equality holds iff X Y. I (X ; Y ) 0 Interpretation: X and Y are independent iff X contains no information about Y and vice versa. Proof: Follows from previous slide and Theorem 1.2. Nevin L. Zhang (HKUST) Machine Learning 22 / 28

Mutual Information Conditional Entropy Revisited Theorem (1.4) H(X Y ) H(X ) with equality holds iff X Y Observation reduces uncertainty in average except for the case of independence. Proof: Follows from Theorem 1.3. Nevin L. Zhang (HKUST) Machine Learning 23 / 28

Mutual Information Mutual information and Entropy From definition of mutual information I (X ; Y ) = H(X ) H(X Y ) and the chain rule, H(X, Y ) = H(Y ) + H(X Y ) we get H(X ) + H(Y ) = H(X, Y ) + I (X ; Y ) I (X ; Y ) = H(X ) + H(Y ) H(X, Y ) Consequently H(X, Y ) H(X ) + H(Y ) with equality holds iff X Y. Nevin L. Zhang (HKUST) Machine Learning 24 / 28

Mutual Information Mutual information and entropy Venn Diagram: Relationships among joint entropy, conditional entropy, and mutual information H(X ) + H(Y ) = H(X, Y ) + I (X ; Y ) I (X ; Y ) = H(X ) H(X Y ) I (Y ; X ) = H(Y ) H(Y X ) Nevin L. Zhang (HKUST) Machine Learning 25 / 28

Mutual Information Conditional Mutual information The conditional mutual information of X and Y given Z: I (X ; Y Z) = H(X Z) H(X Y, Z) Average amount of information Y conveys about X given Z. Nevin L. Zhang (HKUST) Machine Learning 26 / 28

Mutual Information Conditional mutual information and KL Divergence Note: I (X ; Y Z) = 1 P(X, Z)log P(X Z) 1 P(X, Y, Z)log P(X Y, Z) X,Z X,Y,Z X,Y,Z = 1 P(X, Y, Z)log P(X Z) 1 P(X, Y, Z)log P(X Y, Z) = = Z X,Y,Z P(X Y, Z) P(X, Y, Z)log P(X Z) X,Y,Z P(Z) P(X, Y Z) P(X, Y Z)log P(X Z)P(Y Z) X,Y equivalent definition = Z P(Z)KL(P(X, Y Z), P(X Z)P(Y Z)) 0. Nevin L. Zhang (HKUST) Machine Learning 27 / 28

Mutual Information Property of conditional mutual information Theorem (1.5) with equality hold iff X Y Z. Interpretation: I (X ; Y Z) 0 H(X Z) H(X Y, Z) More observations reduce uncertainty on average except for the case of conditional independence. X and Y are independently given Z iff X contain no information about Y given Z and vice versa: X Y Z I (X ; Y Z) = 0. Another characterization of conditional independence. Nevin L. Zhang (HKUST) Machine Learning 28 / 28