Machine Learning Srihari. Information Theory. Sargur N. Srihari

Similar documents
1.6. Information Theory

Information Theory Primer:

Statistical Machine Learning Lectures 4: Variational Bayes

Quantitative Biology II Lecture 4: Variational Methods

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Introduction to Probability and Statistics (Continued)

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

PATTERN RECOGNITION AND MACHINE LEARNING

Expectation Maximization

Information Theory and Communication

Lecture 2: August 31

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Machine Learning using Bayesian Approaches

Information theory and decision tree

ECE 4400:693 - Information Theory

Information in Biology

Introduction to Statistical Learning Theory

Information in Biology

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Variational Inference and Learning. Sargur N. Srihari

Information Theory in Intelligent Decision Making

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Bayesian Inference Course, WTCN, UCL, March 2013

Lecture 22: Final Review

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Variational Inference. Sargur Srihari

Quantitative Biology Lecture 3

Week 3: The EM algorithm

Principles of Communications

The Expectation Maximization or EM algorithm

Series 7, May 22, 2018 (EM Convergence)

Expectation Maximization

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

An introduction to Variational calculus in Machine Learning

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Bioinformatics: Biology X

Lecture 5 - Information theory

14 : Mean Field Assumption

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Expectation Propagation Algorithm

5 Mutual Information and Channel Capacity

The binary entropy function

Performance Metrics for Machine Learning. Sargur N. Srihari

Lecture 14 February 28

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Signal Processing - Lecture 7

Computational Systems Biology: Biology X

Lecture 1: Introduction, Entropy and ML estimation

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Machine Learning Basics: Maximum Likelihood Estimation

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Machine learning - HT Maximum Likelihood

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Entropies & Information Theory

Introduction to Machine Learning

Hands-On Learning Theory Fall 2016, Lecture 3

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Neural Network Training

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Lecture 1a: Basic Concepts and Recaps

U Logo Use Guidelines

Posterior Regularization

Machine Learning Lecture Notes

Quiz 2 Date: Monday, November 21, 2016

Information Theory and Statistics Lecture 2: Source coding

Chapter 2 Review of Classical Information Theory

Expectation Propagation for Approximate Bayesian Inference

Introduction to Machine Learning

13: Variational inference II

INTRODUCTION TO INFORMATION THEORY

Linear and non-linear programming

topics about f-divergence

Probabilistic and Bayesian Machine Learning

How to Quantitate a Markov Chain? Stochostic project 1

Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems

Curve Fitting Re-visited, Bishop1.2.5

Entropy as a measure of surprise

Probabilistic and Bayesian Machine Learning

(Multivariate) Gaussian (Normal) Probability Densities

Latent Variable Models and EM algorithm

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Introduction to Information Theory

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

CS 630 Basic Probability and Information Theory. Tim Campbell

Classification & Information Theory Lecture #8

the Information Bottleneck

Introduction to Probability and Statistics (Continued)

3F1 Information Theory, Lecture 3

Chapter I: Fundamental Information Theory

Revision of Lecture 5

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Lecture 2. Capacity of the Gaussian channel

Transcription:

Information Theory Sargur N. Srihari 1

Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy 3. Conditional Entropy 4. Kullback-Leibler Divergence (Relative Entropy) 5. Mutual Information 2

Information Measure How much information is received when we observe a specific value for a discrete random variable x? Amount of information is degree of surprise Certain means no information More information when event is unlikely Depends on probability distribution p(x), a quantity h(x) If there are two unrelated events x and y we want h(x,y)= h(x) + h(y) Thus we choose h(x)= - log 2 p(x) Negative assures that information measure is positive Average amount of information transmitted is the expectation wrt p(x) refered to as entropy H(x)=-Σ x p(x) log 2 p(x) 3

Usefulness of Entropy Uniform Distribution Random variable x has 8 possible states, each equally likely We would need 3 bits to transmit Also, H(x) = -8 x (1/8)log 2 (1/8)=3 bits Non-uniform Distribution If x has 8 states with probabilities (1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64) H(x)=2 bits Non-uniform distribution has smaller entropy than uniform Has an interpretation of in terms of disorder 4

Relationship of Entropy to Code Length Take advantage of non-uniform distribution to use shorter codes for more probable events 0 1 If x has 8 states (a,b,c,d,e,f,g,h) with probabilities 1/2 (1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64) 1/4 Can use codes 0,10,110,1110,111100,111110,111111 1/8 average code length = (1/2)1+(1/4)2+(1/8)3+(1/16)4+4(1/64)6 =2 bits Same as entropy of the random variable Shorter code string is not possible due to need to disambiguate string into component parts 11001110 is uniquely decoded as sequence cad 5

Relationship between Entropy and Shortest Coding Length Noiseless coding theorem of Shannon Entropy is a lower bound on number of bits needed to transmit a random variable Natural logarithms are used in relationship to other topics Nats instead of bits 6

History of Entropy: thermodynamics to information theory Entropy is average amount of information needed to specify state of a random variable Concept had much earlier origin in physics Context of equilibrium thermodynamics Later given deeper interpretation as measure of disorder (developments in statistical mechanics) 7

World of Atoms Entropy Persona World of Bits Ludwig Eduard Boltzmann (1844-1906) Created Statistical Mechanics First law: conservation of energy Energy not destroyed but converted from one form to other Second law: principle of decay in nature entropy increases Explains why not all energy is available to do useful work Relate macro state to statistical behavior of microstate Claude Shannon (1916-2001) Stephen Hawking (Gravitational Entropy) 8

Physics view of Entropy N objects into bins so that n i are in i th bin where No of different ways of allocating objects to bins N ways to choose first, N-1 ways for second leads to N.(N-1).. 2.1 = N! We don t distinguish between rearrangements within each bin In i th bin there are n i! ways of reordering objects Total no of ways of allocating N objects to bins is Called Multiplicity (also weight of macrostate) Entropy: scaled log of multiplicity Sterlings approx Which gives as N ln N! NlnN - N H = lim N i n i N ln n i = p N i ln p i W = N! n i! i H = 1 N lnw = 1 N ln N! 1 N Overall distribution, as ratios n i /N, called macrostate Specific arrangement of objects in bin is microstate i n i i ln n i! i = N

Entropy and Histograms If X can take one of M values (bins, states) and p(x=x i )=p i then H(p)=-Σ i p i ln p i Minimum value of entropy is 0 when one of the p i =1 and other p i are 0 noting that lim p 0 p ln p =0 Sharply peaked distribution has low entropy Distribution spread more evenly will have higher entropy 30 bins, higher value for broader distribution 10

Maximum Entropy Configuration Found by maximizing H using Lagrange multiplier to enforce constraint of probabilities Maximize H = p(x i )ln p(x i ) + λ p(x i ) 1 i i Solution: all p(x i ) are equal or p(x i )=1/M M=no of states Maximum value of entropy is ln M To verify it is a maximum, evaluate second H 1 derivative of entropy p(x i ) p(x j ) = I ij where I ij are elements of identity matrix p i 11

Entropy with Continuous Variable Divide x into bins of width Δ For each bin there must exist a value x i such that iδ Gives a discrete distribution with probabilities p(x i )Δ = p(x i )Δ ln p(x i ) lnδ Entropy H Δ = p(x i )Δ ln(p(x i )Δ) i i Omit the second term and consider the limit Δ 0 (i+1)δ p(x)d(x) = p(x i )Δ Known as Differential Entropy Discrete and Continuous forms of entropy differ by quantity ln Δ which diverges Reflects to specify continuous variable very precisely requires a large no of bits 12

Entropy with Multiple Continuous Variables Differential Entropy for multiple continuous variables For what distribution is differential entropy maximized? For discrete distribution, it is uniform For continuous, it is Gaussian as we shall see 13

Entropy as a Functional Ordinary calculus deals with functions A functional is an operator that takes a function as input and returns a scalar A widely used functional in machine learning is entropy H[p(x)] which is a scalar quantity We are interested in the maxima and minima of functionals analogous to those for functions Called calculus of variations 14

Maximizing a Functional Functional: mapping from set of functions to real value For what function is it maximized? Finding shortest curve length between two points on a sphere (geodesic) With no constraints it is a straight line When constrained to lie on a surface solution is less obvious may be several Constraints incorporated using Lagrangian 15

Maximizing Differential Entropy Assuming constraints on first and second moments of p(x) as well as normalization Constrained maximization is performed using Lagrangian multipliers. Maximize following functional wrt p(x): Using the calculus of variations derivative of functional is set to zero giving p(x) = exp{ 1+ λ 1 + λ 2 x + λ 3 (x µ) 2 } Backsubstituting into three constraint equations leads to the result that distribution that maximizes differential 16 entropy is Gaussian!

Differential Entropy of Gaussian Distribution that maximizes Differential Entropy is Gaussian Value of maximum entropy is H(x) = 1 { 2 1+ ln(2πσ 2 )} Entropy increases as variance increases Differential entropy, unlike discrete entropy, can be negative for σ 2 <1/2πe 17

Conditional Entropy If we have joint distribution p(x,y) We draw pairs of values of x and y If value of x is already known, additional information to specify corresponding value of y is ln p(y x) Average additional information needed to specify y is the conditional entropy By product rule H[x,y] = H[y x] + H[x] where H[x,y] is differential entropy of p(x,y) H[x] is differential entropy of p(x) Information needed to describe x and y is given by information needed to describe x plus additional information needed to specify y given x 18

Relative Entropy If we have modeled unknown distribution p(x) by approximating distribution q(x) i.e., q(x) is used to construct a coding scheme of transmitting values of x to a receiver Average additional amount of information required to specify value of x as a result of using q(x) instead of true distribution p(x) is given by relative entropy or K-L divergence Important concept in Bayesian analysis Entropy comes from Information Theory K-L Divergence, or relative entropy, comes from Pattern Recognition, since it is a distance (dissimilarity) measure 19

Relative Entropy or K-L Divergence Additional information required as a result of using q(x) in place of p(x) KL( p q) = p(x)lnq(x)dx ( p(x)ln p(x)dx) = p(x)ln p(x) q(x) dx KL(p q) KL(q p) Not a symmetrical quantity: K-L divergence satisfies KL(p q)>0 with equality iff p(x)=q(x) Proof involves convex functions 20

Convex Function A function f(x) is convex if every chord lies on or above function Any value of x in interval from x=a to x=b can be written as λa+(1-λ)b where 0<λ<1 Corresponding point on chord is λf(a)+(1-λ)f(b) Convexity implies f(λa+(1-λ)b) < λ f(a)+(1-λ)f(b) Point on curve < Point on chord By induction, we get Jensen s inequality M M f λ i x i λ i f (x i ) i=1 i=1 where λ i 0 and λ i i =1 21

Proof of positivity of K-L Divergence Applying to KL definition yields desired result KL( p q) = p(x)ln p(x) q(x) ln q(x)dx = 0 dx 0 22

Mutual Information Given joint distribution of two sets of variables p(x,y) If independent, will factorize as p(x,y)=p(x)p(y) If not independent, whether close to independent is given by KL divergence between joint and product of marginals Called Mutual Information between variables x and y 23

Mutual Information Using Sum and Product Rules I[x,y]= H[x] - H[x y] = H[y] - H[y x] Mutual Information is reduction in uncertainty about x given value of y (or vice versa) Bayesian perspective: if p(x) is prior and p(x y) is posterior, mutual information is reduction in uncertainty after y is observed 24