L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28
Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate of communication? Applies to: computer science thermodynamics, economics, computational linguistics,... Background reading: T. Cover, J. Thomas (2006) Elements of. Wiley. 2 / 28
Information & Information as a decrease in uncertainty: We have some uncertainty about a process, e.g., which symbol (A, B, C) will be generated from a device We learn some information, e.g., previous symbol is B How uncertain are we now about which symbol will appear? The more informative our information is, the less uncertain we will be. 3 / 28
is the way we measure how informative a random variable is: (1) H(p) = H(X) = x X p(x) log 2 p(x) How do we get this formula? 4 / 28
Motivating Assume a device that emits one symbol (A) of what we will see is zero Assume a device that emits two symbols (A, B). With one choice (A or B), our uncertainty is one We could use one bit (0 or 1) to encode the outcome Assume a device that emits four symbols (A, B, C, D) If we made a decision tree, there would be two levels (starting with Is it A/B or C/D? ): uncertainty is two. Need two bits (00, 01, 10, 11) to encode the outcome We are describing a (base 2) logarithm: log 2 M, where M is the number of symbols 5 / 28
Adding probabilities log 2 M assumes every choice (A, B, C, D) is equally likely This is not the case in general Instead, we look at log 2 p(x), where x is the given choice, to tell us how surprising it is If every choice x is equally likely: p(x) = 1 M (and M = 1 p(x) ) log 2 M = log 2 1 p(x) = log 2 p(x) 6 / 28
Average surprisal log 2 p(x) tells us how surprising one particular symbol is. But on average, how surprising is a random variable? Summation gives a weighted average: (2) H(X) = x X p(x) log 2 p(x) = E(log 2 1 p(x) ) i.e., sum over all possible outcomes, multiplying surprisal ( log 2 p(x)) by probability of seeing that outcome (p(x)) The amount of surprisal is the amount of information we need in order to not be surprised. H(X) = 0 if the outcome is certain H(X) = 1 if out of 2 outcomes, both are equally likely 7 / 28
example Roll an 8-sided die (or pick a character from an alphabet of 8 characters) Because each outcome is equally likely, the entropy is: (3) H(X) = 8 p(i) log 2 p(i) = 8 i=1 i=1 1 8 log 2 1 8 = log 2 8 = 3 i.e., 3 bits needed to encode this 8-character language: A E I O U F G H 000 001 010 011 100 101 110 111 8 / 28
Simplified Polynesian Simplified Polynesian: 6 characters char: P T K A I U prob: 1/8 1/4 1/8 1/4 1/8 1/8 9 / 28
Simplified Polynesian calculation entropy: H(X) = p(i) log p(i) i L = [4 1 8 log 1 8 + 2 1 4 log 1 4 ] = 2.5 again, we need 3 bits char: P T K A I U code: 000 001 010 011 100 101 BUT: since the distribution is NOT uniform, we can design a better code... 10 / 28
Simplified Polynesian Designing a better code char: P T K A I U code: 100 00 101 01 110 111 0 as first digit: 2-digit char. 1 as first digit: 3-digit char. More likely characters get shorter codes Task: Code the following word: KATUPATI How many average bits do we need? 11 / 28
For two random variables X & Y, joint entropy is the average amount of information needed to specify both values (4) H(X, Y) = x X y Y p(x, y) log 2 p(x, y) How much do the two values influence each other? e.g., the average surprisal at seeing two POS tags next to each other 12 / 28
: how much extra information is needed to find Y s value, given that we know X? H(Y X) = p(x)h(y X = x) (5) x X = x X = = p(x)[ x X y Y x X y Y y Y p(y x) log 2 p(y x)] p(x)p(y x) log 2 p(y x) p(x, y) log 2 p(y x) 13 / 28
Chain Rule Chain rule for entropy: H(X, Y) = H(X) + H(Y X) H(X 1,..., X n ) = H(X 1 ) + H(X 2 X 1 ) +H(X 3 X 1, X 2 ) +... +H(X n X 1,..., X n 1 ) 14 / 28
Syllables in Simplified Polynesian Our of simplified Polynesian earlier was too simple; joint entropy will help us with a better Probabilities for letters on a per-syllable basis, using C and V as separate random variables Probabilities for consonants followed by a vowel (P(C,.)) & vowels preceded by a consonant (P(., V)) p t k a i u 1/8 3/4 1/8 1/2 1/4 1/4 On a per-letter basis, this would be the following (which we are not concerned about here): p t k a i u 1/16 3/8 1/16 1/4 1/8 1/8 15 / 28
Syllables in Simplified Polynesian (2) More specifically, for CV structures the joint probability P(C, V) is represented: p t k a 1 16 3 8 1 16 1 2 i 1 16 3 16 0 1 4 u 0 3 16 1 16 1 4 1 8 3 4 1 8 P(C,.) & P(., V) from before are marginal probabilities: P(C = t, V = a) = 3 8 P(C = t) = 3 4 16 / 28
Polynesian Syllables Find H(C, V), how surprised we are on average to see a particular syllable structure (6) H(C, V) = H(C) + H(V C) 1.061 + 1.375 2.44 (7) a. H(C) = p(c) log 2 p(c) 1.061 c C b. H(V C) = p(c, v) log p(v c) = 1.375 c C v V For the calculation of H(V C)... Can calculate the probability p(v c) from the chart on the previous page e.g., p(v = a C = p) = 1 2 because 1 16 is half of 1 8 17 / 28
Polynesian Syllables (2) H(C) = i L p(i) log p(i) = [2 1 8 log 1 8 + 3 4 log 3 4 ] = 2 1 8 log 8 + 3 (log 4 log 3) 4 = 2 1 8 3 + 3 (2 log 3) 4 = 3 4 + 6 4 3 4 log 3 = 9 4 3 log 3 1.061 4 18 / 28
Polynesian Syllables (3) H(V C) = p(x, y) log p(y x) x C y V = [1/16 log 1/2 + 3/8 log 1/2 + 1/16 log 1/2 +1/16 log 1/2 + 3/16 log 1/4 + 0 log 0 +0 log 0 + 3/16 log 1/4 + 1/16 log 1/2] = 1/16 log2 + 3/8 log 2 + 1/16 log 2 +1/16 log 2 + 3/16 log 4 +3/16 log 4 + 1/16 log 2] = 11/8 19 / 28
Polynesian Syllables (4) Exercise: Verify this result by using H(C, V) = H(V) + H(C V) 20 / 28
Pointwise mutual information (I(X; Y)): how related are two different random variables? = Amount of information one random variable contains about another = Reduction in uncertainty of one random variable based on knowledge of other Pointwise mutual information: mutual information for two points (not two distributions), e.g., a two-word collocation (8) I(x; y) = log p(x,y) p(x)p(y) Probability of x and y occurring together vs. independently Exercise: Calculate the pointwise mutual information of C = p and V = i from simplified Polynesian example 21 / 28
: on average what is the common information between X and Y? (9) I(X; Y) = p(x, y) log 2 x,y p(x,y) p(x)p(y) 22 / 28
(2) H(X Y): information needed to specify X when Y is known (10) I(X; Y) = H(X) H(X Y) = H(Y) H(Y X) Take the information needed to specify X and subtract out the information when Y is known... i.e., the information shared by X and Y When X and Y are independent, I(X; Y) = 0 When X and Y are very dependent, I(X; Y) is high Exercise: Calculate I(C; V) for Simplified Polynesian Note that log 2 1 = 0 and that this happens when p(x, y) = p(x)p(y) 23 / 28
How far off are two distributions from each other?, or Kullback-Leibler (KL) divergence, provides such a measure (for distributions p and q): (11) D(p q) = x X p(x) log p(x) q(x) Informally, this is the distance between p and q If p is correct disribution: average number of bits wasted by using distribution q instead of p 24 / 28
Notes on relative entropy D(p q) = x X p(x) log p(x) q(x) Often used as a distance measure in machine learning D(p q) is always nonnegative D(p q) = 0 if p = q D(p q) = if x X so that p(x) > 0 and q(x) = 0 Not symmetric: D(p q) D(q p) 25 / 28
Divergence as mutual information Our formula for mutual information was: (12) I(X; Y) = p(x, y) log p(x,y) p(x)p(y) x,y meaning that it is the same as measuring how far a joint distribution (p(x, y)) is from independence (p(x)p(y)): (13) I(X; Y) = D(p(x, y) p(x)p(y)) 26 / 28
The noisy channel The idea of information comes from Claude Shannon s description of the noisy channel Many natural language tasks can be viewed as: There is a output, which we can observe We want to guess at what the input was... but it has been corrupted along the way Example: machine translation from English to French Assume that the true input is in French But all we can see is the garbled (English) output To what extent can we recover the original French? 27 / 28
The noisy channel theoretically Some questions behind information theory: How much loss of information can we prevent when we are attempting to compress the data? i.e., how redundant does the data need to be? And what is the theoretical maximum amount of compression? (This is entropy.) How fast can data be transmitted perfectly? A channel has a specific capacity (defined by mutual information) 28 / 28