Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate of communication? Applies to: computer science thermodynamics, economics, computational linguistics,... Background reading: T. Cover, J. Thomas (2006) Elements of. Wiley. / 2 2 / 2 Information & Information as a decrease in uncertainty: We have some uncertainty about a process, e.g., which symbol (A, B, C) will be generated from a device We learn some information, e.g., previous symbol is B How uncertain are we now about which symbol will appear? The more informative our information is, the less uncertain we will be. is the way we measure how informative a random variable is: () H(p) = H(X) = p(x) log 2 p(x) How do we get this formula? / 2 / 2 Motivating Adding probabilities Assume a device that emits one symbol (A) of what we will see is zero Assume a device that emits two symbols (A, B). With one choice (A or B), our uncertainty is one We could use one bit (0 or ) to encode the outcome Assume a device that emits four symbols (A, B, C, D) If we made a decision tree, there would be two levels (starting with Is it A/B or C/D? ): uncertainty is two. Need two bits (00, 0, 0, ) to encode the outcome log 2 M assumes every choice (A, B, C, D) is equally likely This is not the case in general Instead, we look at log 2 p(x), where x is the given choice, to tell us how surprising it is If every choice x is equally likely: p(x) = M (and M = p(x) ) log 2 M = log 2 p(x) = log 2 p(x) We are describing a (base 2) logarithm: log 2 M, where M is the number of symbols 5 / 2 6 / 2

Average surprisal example log 2 p(x) tells us how surprising one particular symbol is. But on average, how surprising is a random variable? Summation gives a weighted average: (2) H(X) = p(x) log 2 p(x) = E(log 2 p(x) ) i.e., sum over all possible outcomes, multiplying surprisal ( log 2 p(x)) by probability of seeing that outcome (p(x)) The amount of surprisal is the amount of information we need in order to not be surprised. H(X) = 0 if the outcome is certain H(X) = if out of 2 outcomes, both are equally likely Roll an -sided die (or pick a character from an alphabet of characters) Because each outcome is equally likely, the entropy is: () H(X) = p(i) log 2 p(i) = log 2 = log 2 = i= i= i.e., bits needed to encode this -character language: A E I O U F G H 000 00 00 0 00 0 0 7 / 2 / 2 calculation : 6 characters prob: / / / / / / entropy: H(X) = p(i) log p(i) i L = [ log + 2 log ] = 2.5 again, we need bits code: 000 00 00 0 00 0 BUT: since the distribution is NOT uniform, we can design a better code... 9 / 2 0 / 2 Designing a better code code: 00 00 0 0 0 0 as first digit: 2-digit char. as first digit: -digit char. More likely characters get shorter codes Task: Code the following word: KATUPATI How many average bits do we need? For two random variables X & Y, joint entropy is the average amount of information needed to specify both values () H(X, Y) = y Y p(x, y) log 2 p(x, y) How much do the two values influence each other? e.g., the average surprisal at seeing two POS tags next to each other / 2 2 / 2

Chain Rule : how much extra information is needed to find Y s value, given that we know X? H(Y X) = p(x)h(y X = x) (5) = p(x)[ p(y x) log 2 p(y x)] = y Y y Y p(x)p(y x) log 2 p(y x) Chain rule for entropy: H(X, Y) = H(X) + H(Y X) H(X,..., X n ) = H(X ) + H(X 2 X ) +H(X X, X 2 ) +... +H(X n X,..., X n ) = y Y p(x, y) log 2 p(y x) / 2 / 2 Syllables in Syllables in (2) Our of simplified Polynesian earlier was too simple; joint entropy will help us with a better Probabilities for letters on a per-syllable basis, using C and V as separate random variables Probabilities for consonants followed by a vowel (P(C,.)) & vowels preceded by a consonant (P(., V)) p t k a i u / / / /2 / / On a per-letter basis, this would be the following (which we are not concerned about here): p t k a i u /6 / /6 / / / More specifically, for CV structures the joint probability P(C, V) is represented: a i p t k 6 6 u 0 6 6 0 6 6 P(C,.) & P(., V) from before are marginal probabilities: P(C = t, V = a) = 2 P(C = t) = 5 / 2 6 / 2 Polynesian Syllables Polynesian Syllables (2) Find H(C, V), how surprised we are on average to see a particular syllable structure (6) H(C, V) = H(C) + H(V C).06 +.75 2. (7) a. H(C) = p(c) log 2 p(c).06 c C b. H(V C) = p(c, v) log p(v c) =.75 c C v V For the calculation of H(V C)... Can calculate the probability p(v c) from the chart on the previous page e.g., p(v = a C = p) = 2 because 6 is half of H(C) = i L p(i) log p(i) = [2 log + log ] = 2 log + (log log ) = 2 + (2 log ) = + 6 log = 9 log.06 7 / 2 / 2

Polynesian Syllables () Polynesian Syllables () H(V C) = p(x, y) log p(y x) x C y V = [/6 log /2 + / log /2 + /6 log /2 +/6 log /2 + /6 log / + 0 log 0 +0 log 0 + /6 log / + /6 log /2] = /6 log2 + / log 2 + /6 log 2 +/6 log 2 + /6 log +/6 log + /6 log 2] = / Exercise: Verify this result by using H(C, V) = H(V) + H(C V) 9 / 2 20 / 2 Pointwise mutual information (I(X; Y)): how related are two different random variables? = Amount of information one random variable contains about another = Reduction in uncertainty of one random variable based on knowledge of other Pointwise mutual information: mutual information for two points (not two distributions), e.g., a two-word collocation : on average what is the common information between X and Y? (9) I(X; Y) = p(x,y) p(x, y) log 2 p(x)p(y) x,y () I(x; y) = log p(x,y) p(x)p(y) Probability of x and y occurring together vs. independently Exercise: Calculate the pointwise mutual information of C = p and V = i from simplified Polynesian example 2 / 2 22 / 2 (2) H(X Y): information needed to specify X when Y is known (0) I(X; Y) = H(X) H(X Y) = H(Y) H(Y X) How far off are two distributions from each other? Take the information needed to specify X and subtract out the information when Y is known... i.e., the information shared by X and Y When X and Y are independent, I(X; Y) = 0 When X and Y are very dependent, I(X; Y) is high Exercise: Calculate I(C; V) for Note that log 2 = 0 and that this happens when p(x, y) = p(x)p(y), or Kullback-Leibler (KL) divergence, provides such a measure (for distributions p and q): () D(p q) = p(x) log p(x) q(x) Informally, this is the distance between p and q If p is correct disribution: average number of bits wasted by using distribution q instead of p 2 / 2 2 / 2

Notes on relative entropy Divergence as mutual information D(p q) = p(x) log p(x) q(x) Often used as a distance measure in machine learning D(p q) is always nonnegative D(p q) = 0 if p = q D(p q) = if x X so that p(x) > 0 and q(x) = 0 Not symmetric: D(p q) D(q p) Our formula for mutual information was: (2) I(X; Y) = p(x, y) log p(x,y) p(x)p(y) x,y meaning that it is the same as measuring how far a joint distribution (p(x, y)) is from independence (p(x)p(y)): () I(X; Y) = D(p(x, y) p(x)p(y)) 25 / 2 26 / 2 The noisy channel The noisy channel theoretically The idea of information comes from Claude Shannon s description of the noisy channel Many natural language tasks can be viewed as: There is a output, which we can observe We want to guess at what the input was... but it has been corrupted along the way Example: machine translation from English to French Assume that the true input is in French But all we can see is the garbled (English) output To what extent can we recover the original French? Some questions behind information theory: How much loss of information can we prevent when we are attempting to compress the data? i.e., how redundant does the data need to be? And what is the theoretical maximum amount of compression? (This is entropy.) How fast can data be transmitted perfectly? A channel has a specific capacity (defined by mutual information) 27 / 2 2 / 2