Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Similar documents
Dept. of Linguistics, Indiana University Fall 2015

CS 630 Basic Probability and Information Theory. Tim Campbell

Lecture 2: August 31

Lecture 1: Introduction, Entropy and ML estimation

Information Theory Primer:

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Classification & Information Theory Lecture #8

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Entropies & Information Theory

Lecture 5 - Information theory

Information in Biology

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Information in Biology

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Information Theory in Intelligent Decision Making

Computing and Communications 2. Information Theory -Entropy

5 Mutual Information and Channel Capacity

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information & Correlation

Information Theory, Statistics, and Decision Trees

ELEMENT OF INFORMATION THEORY

Noisy channel communication

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

How to Quantitate a Markov Chain? Stochostic project 1

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Communication Theory and Engineering

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

3F1 Information Theory, Lecture 1

Chapter 2 Review of Classical Information Theory

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

INTRODUCTION TO INFORMATION THEORY

Chapter 9 Fundamental Limits in Information Theory

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Information Theory and Communication

Computational Systems Biology: Biology X

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

Module 1. Introduction to Digital Communications and Information Theory. Version 2 ECE IIT, Kharagpur

Outline of the Lecture. Background and Motivation. Basics of Information Theory: 1. Introduction. Markku Juntti. Course Overview

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Block 2: Introduction to Information Theory

CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

(Classical) Information Theory III: Noisy channel coding

Machine Learning Srihari. Information Theory. Sargur N. Srihari

CSCI 2570 Introduction to Nanocomputing

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Expectation Maximization

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

The binary entropy function

3F1 Information Theory, Lecture 3

1 Introduction to information theory

Chapter I: Fundamental Information Theory

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Lecture 22: Final Review

Lecture 11: Continuous-valued signals and differential entropy

Exercise 1. = P(y a 1)P(a 1 )

QB LECTURE #4: Motif Finding

Bioinformatics: Biology X

Chaos, Complexity, and Inference (36-462)

A Simple Introduction to Information, Channel Capacity and Entropy

Lecture 18: Quantum Information Theory and Holevo s Bound

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

Towards a Theory of Information Flow in the Finitary Process Soup

ITCT Lecture IV.3: Markov Processes and Sources with Memory

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Principles of Communications

Information theory and decision tree

ECE 4400:693 - Information Theory

(Classical) Information Theory II: Source coding

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

3F1 Information Theory, Lecture 3

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

the Information Bottleneck

Hands-On Learning Theory Fall 2016, Lecture 3

Series 7, May 22, 2018 (EM Convergence)

Lecture 14 February 28

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Example: Letter Frequencies

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

Natural Image Statistics and Neural Representations

Quantitative Biology Lecture 3

Noisy-Channel Coding

Information Theory - Entropy. Figure 3

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

The Method of Types and Its Application to Information Hiding

ECE Information theory Final

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture Notes for Communication Theory

One Lesson of Information Theory

EECS 750. Hypothesis Testing with Communication Constraints

Information Theory: Entropy, Markov Chains, and Huffman Coding

Intro to Information Theory

Transcription:

L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate of communication? Applies to: computer science thermodynamics, economics, computational linguistics,... Background reading: T. Cover, J. Thomas (2006) Elements of. Wiley. / 2 2 / 2 Information & Information as a decrease in uncertainty: We have some uncertainty about a process, e.g., which symbol (A, B, C) will be generated from a device We learn some information, e.g., previous symbol is B How uncertain are we now about which symbol will appear? The more informative our information is, the less uncertain we will be. is the way we measure how informative a random variable is: () H(p) = H(X) = p(x) log 2 p(x) How do we get this formula? / 2 / 2 Motivating Adding probabilities Assume a device that emits one symbol (A) of what we will see is zero Assume a device that emits two symbols (A, B). With one choice (A or B), our uncertainty is one We could use one bit (0 or ) to encode the outcome Assume a device that emits four symbols (A, B, C, D) If we made a decision tree, there would be two levels (starting with Is it A/B or C/D? ): uncertainty is two. Need two bits (00, 0, 0, ) to encode the outcome log 2 M assumes every choice (A, B, C, D) is equally likely This is not the case in general Instead, we look at log 2 p(x), where x is the given choice, to tell us how surprising it is If every choice x is equally likely: p(x) = M (and M = p(x) ) log 2 M = log 2 p(x) = log 2 p(x) We are describing a (base 2) logarithm: log 2 M, where M is the number of symbols 5 / 2 6 / 2

Average surprisal example log 2 p(x) tells us how surprising one particular symbol is. But on average, how surprising is a random variable? Summation gives a weighted average: (2) H(X) = p(x) log 2 p(x) = E(log 2 p(x) ) i.e., sum over all possible outcomes, multiplying surprisal ( log 2 p(x)) by probability of seeing that outcome (p(x)) The amount of surprisal is the amount of information we need in order to not be surprised. H(X) = 0 if the outcome is certain H(X) = if out of 2 outcomes, both are equally likely Roll an -sided die (or pick a character from an alphabet of characters) Because each outcome is equally likely, the entropy is: () H(X) = p(i) log 2 p(i) = log 2 = log 2 = i= i= i.e., bits needed to encode this -character language: A E I O U F G H 000 00 00 0 00 0 0 7 / 2 / 2 calculation : 6 characters prob: / / / / / / entropy: H(X) = p(i) log p(i) i L = [ log + 2 log ] = 2.5 again, we need bits code: 000 00 00 0 00 0 BUT: since the distribution is NOT uniform, we can design a better code... 9 / 2 0 / 2 Designing a better code code: 00 00 0 0 0 0 as first digit: 2-digit char. as first digit: -digit char. More likely characters get shorter codes Task: Code the following word: KATUPATI How many average bits do we need? For two random variables X & Y, joint entropy is the average amount of information needed to specify both values () H(X, Y) = y Y p(x, y) log 2 p(x, y) How much do the two values influence each other? e.g., the average surprisal at seeing two POS tags next to each other / 2 2 / 2

Chain Rule : how much extra information is needed to find Y s value, given that we know X? H(Y X) = p(x)h(y X = x) (5) = p(x)[ p(y x) log 2 p(y x)] = y Y y Y p(x)p(y x) log 2 p(y x) Chain rule for entropy: H(X, Y) = H(X) + H(Y X) H(X,..., X n ) = H(X ) + H(X 2 X ) +H(X X, X 2 ) +... +H(X n X,..., X n ) = y Y p(x, y) log 2 p(y x) / 2 / 2 Syllables in Syllables in (2) Our of simplified Polynesian earlier was too simple; joint entropy will help us with a better Probabilities for letters on a per-syllable basis, using C and V as separate random variables Probabilities for consonants followed by a vowel (P(C,.)) & vowels preceded by a consonant (P(., V)) p t k a i u / / / /2 / / On a per-letter basis, this would be the following (which we are not concerned about here): p t k a i u /6 / /6 / / / More specifically, for CV structures the joint probability P(C, V) is represented: a i p t k 6 6 u 0 6 6 0 6 6 P(C,.) & P(., V) from before are marginal probabilities: P(C = t, V = a) = 2 P(C = t) = 5 / 2 6 / 2 Polynesian Syllables Polynesian Syllables (2) Find H(C, V), how surprised we are on average to see a particular syllable structure (6) H(C, V) = H(C) + H(V C).06 +.75 2. (7) a. H(C) = p(c) log 2 p(c).06 c C b. H(V C) = p(c, v) log p(v c) =.75 c C v V For the calculation of H(V C)... Can calculate the probability p(v c) from the chart on the previous page e.g., p(v = a C = p) = 2 because 6 is half of H(C) = i L p(i) log p(i) = [2 log + log ] = 2 log + (log log ) = 2 + (2 log ) = + 6 log = 9 log.06 7 / 2 / 2

Polynesian Syllables () Polynesian Syllables () H(V C) = p(x, y) log p(y x) x C y V = [/6 log /2 + / log /2 + /6 log /2 +/6 log /2 + /6 log / + 0 log 0 +0 log 0 + /6 log / + /6 log /2] = /6 log2 + / log 2 + /6 log 2 +/6 log 2 + /6 log +/6 log + /6 log 2] = / Exercise: Verify this result by using H(C, V) = H(V) + H(C V) 9 / 2 20 / 2 Pointwise mutual information (I(X; Y)): how related are two different random variables? = Amount of information one random variable contains about another = Reduction in uncertainty of one random variable based on knowledge of other Pointwise mutual information: mutual information for two points (not two distributions), e.g., a two-word collocation : on average what is the common information between X and Y? (9) I(X; Y) = p(x,y) p(x, y) log 2 p(x)p(y) x,y () I(x; y) = log p(x,y) p(x)p(y) Probability of x and y occurring together vs. independently Exercise: Calculate the pointwise mutual information of C = p and V = i from simplified Polynesian example 2 / 2 22 / 2 (2) H(X Y): information needed to specify X when Y is known (0) I(X; Y) = H(X) H(X Y) = H(Y) H(Y X) How far off are two distributions from each other? Take the information needed to specify X and subtract out the information when Y is known... i.e., the information shared by X and Y When X and Y are independent, I(X; Y) = 0 When X and Y are very dependent, I(X; Y) is high Exercise: Calculate I(C; V) for Note that log 2 = 0 and that this happens when p(x, y) = p(x)p(y), or Kullback-Leibler (KL) divergence, provides such a measure (for distributions p and q): () D(p q) = p(x) log p(x) q(x) Informally, this is the distance between p and q If p is correct disribution: average number of bits wasted by using distribution q instead of p 2 / 2 2 / 2

Notes on relative entropy Divergence as mutual information D(p q) = p(x) log p(x) q(x) Often used as a distance measure in machine learning D(p q) is always nonnegative D(p q) = 0 if p = q D(p q) = if x X so that p(x) > 0 and q(x) = 0 Not symmetric: D(p q) D(q p) Our formula for mutual information was: (2) I(X; Y) = p(x, y) log p(x,y) p(x)p(y) x,y meaning that it is the same as measuring how far a joint distribution (p(x, y)) is from independence (p(x)p(y)): () I(X; Y) = D(p(x, y) p(x)p(y)) 25 / 2 26 / 2 The noisy channel The noisy channel theoretically The idea of information comes from Claude Shannon s description of the noisy channel Many natural language tasks can be viewed as: There is a output, which we can observe We want to guess at what the input was... but it has been corrupted along the way Example: machine translation from English to French Assume that the true input is in French But all we can see is the garbled (English) output To what extent can we recover the original French? Some questions behind information theory: How much loss of information can we prevent when we are attempting to compress the data? i.e., how redundant does the data need to be? And what is the theoretical maximum amount of compression? (This is entropy.) How fast can data be transmitted perfectly? A channel has a specific capacity (defined by mutual information) 27 / 2 2 / 2