Information. = more information was provided by the outcome in #2

Similar documents
A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Kotebe Metropolitan University Department of Computer Science and Technology Multimedia (CoSc 4151)

Lecture 7: DecisionTrees

Classification & Information Theory Lecture #8

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

CS 630 Basic Probability and Information Theory. Tim Campbell

INTRODUCTION TO INFORMATION THEORY

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Information in Biology

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 5 - Information theory

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Quantitative Biology II Lecture 4: Variational Methods

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Information in Biology

6.02 Fall 2012 Lecture #1

Information Theory Primer:

STA Module 4 Probability Concepts. Rev.F08 1

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Information Theory and Hypothesis Testing

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Lecture 1: Introduction, Entropy and ML estimation

3F1 Information Theory, Lecture 1

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Computing and Communications 2. Information Theory -Entropy

Exercises with solutions (Set D)

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Chapter I: Fundamental Information Theory

Noisy channel communication

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Bioinformatics: Biology X

Introduction to Information Theory

Lecture 2: August 31

ELEMENT OF INFORMATION THEORY

Probability Theory for Machine Learning. Chris Cremer September 2015

Information Theory, Statistics, and Decision Trees

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Basic Probability. Introduction

An Introduction to Bioinformatics Algorithms Hidden Markov Models

1. Basics of Information

Gambling and Information Theory

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

God doesn t play dice. - Albert Einstein

Chapter Learning Objectives. Random Experiments Dfiii Definition: Dfiii Definition:

X = X X n, + X 2

Discrete Structures for Computer Science

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

Noisy-Channel Coding

CS 361: Probability & Statistics

Hidden Markov Models. Hosein Mohimani GHC7717

QB LECTURE #4: Motif Finding

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

HIDDEN MARKOV MODELS

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Expectation Maximization

What is a random variable

Fundamentals of Probability CE 311S

Some Basic Concepts of Probability and Information Theory: Pt. 2

CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

Information Theory. Week 4 Compressing streams. Iain Murray,

Probabilistic and Bayesian Machine Learning

Quantitative Biology Lecture 3

Communication Theory II

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

CSC 411 Lecture 3: Decision Trees

Topics. Probability Theory. Perfect Secrecy. Information Theory

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

6.02 Fall 2011 Lecture #9

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Computational Systems Biology: Biology X

Section 13.3 Probability

Lecture 14 February 28

Coding of memoryless sources 1/35

Lecture 11: Continuous-valued signals and differential entropy

Lecture 18: Quantum Information Theory and Holevo s Bound

Monty Hall Puzzle. Draw a tree diagram of possible choices (a possibility tree ) One for each strategy switch or no-switch

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

SDS 321: Introduction to Probability and Statistics

Shannon s Noisy-Channel Coding Theorem

Natural Image Statistics and Neural Representations

University of California, Berkeley, Statistics 134: Concepts of Probability. Michael Lugo, Spring Exam 1

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Series 7, May 22, 2018 (EM Convergence)

Lecture 6: Entropy Rate

MAT Mathematics in Today's World

Sequences and Information

The enumeration of all possible outcomes of an experiment is called the sample space, denoted S. E.g.: S={head, tail}

An introduction to basic information theory. Hampus Wessman

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

log 2 N I m m log 2 N + 1 m.

Decision Trees. Tirgul 5

Problems from Probability and Statistical Inference (9th ed.) by Hogg, Tanis and Zimmerman.

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Transcription:

Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information channels and coding will not discuss those here.. Information 2. Entropy 3. Mutual Information 4. Cross Entropy and Learning Mellon 2 IT tutorial, Roni Rosenfeld, 999 Definition of Information (After [Abramson 63]) Let E be some event which occurs with probability P(E). If we are told that E has occurred, then we say that we have received bits of information. I(E) = log 2 P(E) Base of log is unimportant will only change the units We ll stick with bits, and always assume base 2 Can also think of information as amount of surprise in E (e.g. P(E) =, P(E) = 0) Example: result of a fair coin flip (log 2 2 = bit) Example: result of a fair die roll (log 2 6 2.585 bits) Mellon 4 IT tutorial, Roni Rosenfeld, 999 A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Information information knowledge Concerned with abstract possibilities, not their meaning information: reduction in uncertainty Mellon University Mellon Imagine: # you re about to observe the outcome of a coin flip #2 you re about to observe the outcome of a die roll There is more uncertainty in #2 Next:. You observed outcome of # uncertainty reduced to zero. 2. You observed outcome of #2 uncertainty reduced to zero. = more information was provided by the outcome in #2 Mellon 3 IT tutorial, Roni Rosenfeld, 999

Entropy Entropy as a Function of a Probability Distribution A Zero-memory information source S is a source that emits symbols from an alphabet {s, s 2,..., s k } with probabilities {p, p 2,..., p k }, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S? Since the source S is fully characterized byp = {p,... p k } (we don t care what the symbols s i actually are, or what they stand for), entropy can also be thought of as a property of a probability distribution function P: the avg uncertainty in the distribution. So we may also write: Call this Entropy: H(S) = H(P) = H(p, p 2,..., p k ) = i p i log p i H(S) = i p i I(s i ) = i p i log p i = E P [ log p(s) ] (Can be generalized to continuous distributions.) * Mellon 6 IT tutorial, Roni Rosenfeld, 999 Mellon 8 IT tutorial, Roni Rosenfeld, 999 Information is Additive I(k fair coin tosses) = log /2 k = k bits So: random word from a 00,000 word vocabulary: I(word) = log 00, 000 = 6.6 bits A 000 word document from same source: I(document) = 6,60 bits A 480x640 pixel, 6-greyscale video picture: I(picture) = 307,200 log6 =,228,800 bits = A (VGA) picture is worth (a lot more than) a 000 words! (In reality, both are gross overestimates.) Alternative Explanations of Entropy H(S) = i p i log p i. avg amt of info provided per symbol 2. avg amount of surprise when observing a symbol 3. uncertainty an observer has before seeing the symbol 4. avg # of bits needed to communicate each symbol (Shannon: there are codes that will communicate these symbols with efficiency arbitrarily close to H(S) bits/symbol; there are no codes that will do it with efficiency < H(S) bits/symbol) Mellon 5 IT tutorial, Roni Rosenfeld, 999 Mellon 7 IT tutorial, Roni Rosenfeld, 999

Special Case: k = 2 The Entropy of English Flipping a coin with P( head )=p, P( tail )=-p 27 characters (A-Z, space). H(p) = p log p + ( p) log p 00,000 words (avg 5.5 characters each) Assuming independence between successive characters: uniform character distribution: log 27 = 4.75 bits/character true character distribution: 4.03 bits/character Assuming independence between successive words: unifrom word distribution: log 00, 000/6.5 2.55 bits/character true word distribution: 9.45/6.5.45 bits/character Notice: True Entropy of English is much lower! zero uncertainty/information/surprise at edges maximum info at 0.5 ( bit) drops off quickly Mellon 0 IT tutorial, Roni Rosenfeld, 999 Mellon 2 IT tutorial, Roni Rosenfeld, 999 Properties of Entropy Special Case: k = 2 (cont.) Relates to: 20 questions game strategy (halving the space). H(P) = i. Non-negative: H(P) 0 p i log p i 2. Invariant wrt permutation of its inputs: H(p, p 2,..., p k ) = H(p τ(), p τ(2),..., p τ(k) ) So a sequence of (independent) 0 s-and- s can provide up to bit of information per digit, provided the 0 s and s are equally likely at any point. If they are not equally likely, the sequence provides less information and can be compressed. 3. For any other probability distribution {q, q 2,..., q k }: H(P) = i p i log p i < i p i log q i 4. H(P) log k, with equality iff p i = /k i 5. The further P is from uniform, the lower the entropy. Mellon 9 IT tutorial, Roni Rosenfeld, 999 Mellon IT tutorial, Roni Rosenfeld, 999

Joint Probability, Joint Entropy cold mild hot low 0. 0.4 0. 0.6 high 0.2 0. 0. 0.4 0.3 0.5 0.2.0 H(T) = H(0.3,0.5,0.2) =.48548 H(M) = H(0.6,0.4) = 0.97095 H(T) + H(M) = 2.45643 Joint Entropy: consider the space of (t, m) events H(T, M) = t,m P(T = t, M = m) log P(T=t,M=m) H(0.,0.4,0.,0.2,0.,0.) = 2.3293 Notice that H(T, M) < H(T) + H(M)!!! Mellon 4 IT tutorial, Roni Rosenfeld, 999 Conditional Probability, Conditional Entropy P(M = m T = t) cold mild hot low /3 4/5 /2 high 2/3 /5 /2.0.0.0 Conditional Entropy: H(M T = cold) = H(/3,2/3) = 0.98296 H(M T = mild) = H(4/5,/5) = 0.72928 H(M T = hot) = H(/2,/2) =.0 Average Conditional Entropy (aka Equivocation): H(M/T) = t P(T = t) H(M T = t) = 0.3 H(M T = cold) + 0.5 H(M T = mild) + 0.2 H(M T = hot) = 0.8364528 How much is T telling us on average about M? H(M) H(M T) = 0.97095 0.8364528 0.345 bits Mellon 6 IT tutorial, Roni Rosenfeld, 999 Two Sources Temperature T: a random variable taking on values t P(T=hot)=0.3 P(T=mild)=0.5 P(T=cold)=0.2 = H(T)=H(0.3, 0.5, 0.2) =.48548 humidity M: a random variable, taking on values m P(M=low)=0.6 P(M=high)=0.4 = H(M) = H(0.6,0.4) = 0.97095 T, M not independent: P(T = t, M = m) P(T = t) P(M = m) Mellon 3 IT tutorial, Roni Rosenfeld, 999 Conditional Probability, Conditional Entropy Conditional Entropy: P(T = t M = m) cold mild hot low /6 4/6 /6.0 high 2/4 /4 /4.0 H(T M = low) = H(/6,4/6,/6) =.2563 H(T M = high) = H(2/4,/4,/4) =.5 Average Conditional Entropy (aka equivocation): H(T/M) = m P(M = m) H(T M = m) = 0.6 H(T M = low) + 0.4 H(T M = high) =.350978 How much is M telling us on average about T? H(T) H(T M) =.48548.350978 0.345 bits Mellon 5 IT tutorial, Roni Rosenfeld, 999

Mutual Information Visualized A Markov Source Order-k Markov Source: A source that remembers the last k symbols emitted. Ie, the probability of emitting any symbol depends on the last k emitted symbols: P(s T=t s T=t, s T=t 2,..., s T=t k ) So the last k emitted symbols define a state, and there are q k states. First-order markov source: defined by qxq matrix: P(s i s j ) Example: S T=t is position after t random steps H(X, Y ) = H(X) + H(Y ) I(X; Y ) Mellon 8 IT tutorial, Roni Rosenfeld, 999 Mellon 20 IT tutorial, Roni Rosenfeld, 999 Average Mutual Information Three Sources I(X; Y ) = H(X) H(X/Y ) = P(x) log x P(x) P(x, y) log x,y P(x y) = P(x, y) log P(x y) x,y P(x) = P(x, y) P(x, y) log x,y P(x)P(y) Properties of Average Mutual Information: Symmetric (but H(X) H(Y ) and H(X/Y ) H(Y/X)) Non-negative (but H(X) H(X/y) may be negative!) Zero iff X, Y independent Additive (see next slide) Mellon 7 IT tutorial, Roni Rosenfeld, 999 From Blachman: ( / means given. ; means between., means and.) H(X, Y/Z) = H({X, Y } / Z) H(X/Y, Z) = H(X / {Y, Z}) I(X; Y/Z) = H(X/Z) H(X/Y, Z) I(X; Y ; Z) = I(X; Y ) I(X; Y/Z) = Can be negative! = H(X, Y, Z) H(X, Y ) H(X, Z) H(Y, Z) + H(X) + H(Y ) + H I(X; Y, Z) = I(X; Y ) + I(X; Z/Y ) (additivity) But: I(X; Y ) = 0,I(X; Z) = 0 doesn t mean I(X; Y, Z) = 0!!! Mellon 9 IT tutorial, Roni Rosenfeld, 999

Modeling an Arbitrary Source Source D(Y ) with unknown distribution P D (Y ) (recall H(P D ) = E P D [log P D (Y )] ) Goal: Model (approximate) with learned distribution P M (Y ) What s a good model P M (Y )?. RMS error over D s parameters but D is unknown! 2. Predictive Probability: Maximize the expected log-likelihood the model assigns to future data from D A Distance Measure Between Distributions Kullback-Liebler distance: KL(P D ; P M ) = CH(P D ; P M ) H(P D ) = E P D [log P D (Y ) P M (Y ) ] Properties of KL distance:. Non-negative. KL(P D ; P M ) = 0 P D = P M 2. Generally non-symmetric Mellon 22 IT tutorial, Roni Rosenfeld, 999 The following are equivalent:. Maximize Predictive Probability of P M for distribution D 2. Minimize Cross Entropy CH(P D ; P M ) 3. Minimize the distance KL(P D ; P M ) Mellon 24 IT tutorial, Roni Rosenfeld, 999 Approximating with a Markov Source Cross Entropy A non-markovian source can still be approximated by one. Examples: English characters: C = {c, c 2,...}. Uniform: H(C) = log27 = 4.75 bits/char 2. Assuming 0 memory: H(C) = H(0.86,0.064,0.027,...) = 4.03 bits/char 3. Assuming st order: H(C) = H(c i /c i ) = 3.32 bits/char 4. Assuming 2nd order: H(C) = H(c i /c i, c i 2 ) = 3. bits/char 5. Assuming large order: Shannon got down to bit/char M The following are equivalent: = argmax E D [log P M (Y )] M = argmin E D [log M P M (Y ) ] = CH(P D ; P M ) = Cross Entropy. Maximize Predictive Probability of P M 2. Minimize Cross Entropy CH(P D ; P M ) 3. Minimize the difference between P D and P M (in what sense?) Mellon 2 IT tutorial, Roni Rosenfeld, 999 Mellon 23 IT tutorial, Roni Rosenfeld, 999