CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms Spring 2009 February 3, 2009

Lecture Some note on statistics Bayesian decision theory Concentration bounds Information theory

Introduction Consider the following examples: 1 I saw a girl it the park. (I saw a girl in the park.) 2 The from needs to be completed. (The form needs to be completed). 3 I maybe there tomorrow. (I may be there tomorrow.) 4 Nissan s car plants grew rapidly in the last few years. Questions: How do we identify that there is a problem? How do we fix it?

Introduction In order to address these problems we would like to find a systematic way to study the problem. But How do we model the problem? How do we use this model to solve the problem? The problems described are new to you. You have never seen the sentences (1)-(4). How can you be so sure that you can recognize a problem and correct it?

Introduction It is clear that there exist some regularities in language. How can we exploit these to detect and correct errors? There are general mathematical approaches that can be used to translate this intuition into a model that we can study.

What is coming next? It seems clear that in order to make decisions of this sort, one has to rely on previously observed data. = use statistics In many cases you do not know what is the model which governs the generation of the data (study of distribution-free methods) The main interest here is to say smth about the data just by estimating some statistics Main results in learning theory are based on Large Deviation (or Concentration) Bounds Also a few of the key notions from Information Theory that are relevant in the study of NLP.

Types of Statistical Analysis Descriptive Statistics: In many cases we would like to collect data and learn some interesting things about it. Here, we want to summarize data and described it via different types of measurements. (E.g., we did it for Tom Sawyer book last time) Inferential Statistics: We want to draw conclusions from the data Often, to draw conclusions we need to make assumption, or formulate probability model for the data We will start with descriptive statistics

Descriptive Statistics Recall that the sample mean of data set x 1, x 2,... x n, is defined as n x = x i /n i=1 and the sample standard deviation as s = n (x i x) 2 /(n 1). i=1 What we can say about data if we see only sample mean and sample variance? Chebyshev s Inequality states that for any value of k 1, more than 100(1 1/k 2 ) percent of the data is within ks from the mean. E.g., for k = 2, more than 75% of the data lies within 2s from the sample mean. No assumption about the distribution

Chebyshev s Inequality for Sample Data Let x and s be the sample mean and sample standard deviation of the data set x 1, x 2,... x n, where s > 0. Let S k = {i, 1 i n : x i x < ks} and let N(S k ) be the number of elements in the set S k. Then, for any k 1 N(S k ) > 1 1 n k 2.

Proof n (n 1)s 2 = (x i x) 2 i=1 = i S k (x i x) 2 + i S k (x i x) 2 i S k (x i x) 2 i S k k 2 s 2 = k 2 s 2 (n N(S k )) where the first inequality follows from since all terms being summed are nonnegative, and the second from the definition of S k. Dividing both sides by nk 2 s 2 : N(S k ) n 1 n 1 nk 2 > 1 1 k 2

Limitations and Discussion Chebyshev s inequality holds universally, you should probably expect it to be loose for many distributions Indeed, if the data is distributed normally, we know that 95% of the data lies with x ± 2s. Analogous bound can be formulated for true means and variance (not their estimates): what is the probability of observing a point ks away from the true mean?

Sample Correlation Coefficient Other questions in descriptive statistics: assume that our data set consists of pairs of values that have some relationships to each other. The ith data point is represented as the pair (x i, y i ). A question of interest might be whether the x i s are correlated with the y i s? (that is, whether large x values go together with large y values, etc.)

Sample Correlation Coefficient Let s x and s y denote, respectively, the sample standard deviation of the x values and the y values. The sample correlation coefficient, denoted r, of the data pairs (x i, y i ), i = 1,... n, is defined by: r = n i=1 (x i x)(y i y) (n 1)s x s y = n i=1 (x i x)(y i y) n i=1 (x i x) 2 n i=1 (y i y) 2 When r > 0 we say that the sample data pairs are positively correlated, and when r < 0 we say they are negatively correlated.

Properties of r 1 1 r 1 2 If for constants a, b, with b > 0, y i = a + bx i, i = 1,... n then r = 1 3 If for constants a, b, with b < 0, y i = a + bx i, i = 1,... n then r = 1 4 If r is the sample correlation coefficient for the data pairs (x i, y i )i = 1,,... n then it is also the sample correlation coefficient for the data pairs (a + bx i, c + dy i )i = 1,,... n provided that b, d are both positive or both negative. (Side note: keep in mind correlation and statistical dependence are very different things)

Elements of Probability I hope you know this, but just 2 slides to make sure... The conditional probability of event E given that event F has occurred (P(F) > 0) is defined to be The Product Rule: P(E F) = P(E F) P(F) (1) P(E F) = P(E F)P(F) = P(F E)P(E). This rule can be generalized to multiple events: The Chain Rule: n 1 P(E 1 E 2... E n ) = P(E 1 )P(E 2 E 1 )P(E 3 E 1 E 2 ) P(E n E i ). i=1

Elements of Probability Total Probability: For mutually exclusive event F 1, F 2,... F n with i=1,n P(F i) = 1, P(E) = i P(E F i )P(F i ). The partitioning rule can be phrased also for conditional probabilities: Total Conditional Probability For mutually exclusive event F 1, F 2,... F n with i=1,n P(F i) = 1, P(E G) = P(E F i, G)P(F i G).

Bayesian Decision Theory Let E, F be events. As a simple consequence of the product rule we get the ability to swap the order of dependence between events. Bayes Theorem: (simple form) P(F E) = P(E F) P(F) P(E). General form of Bayes Theorem: P(F i E) = P(E F i)p(f i ) P(E) = P(E F i)p(f i ) i P(E F i)p(f i ). Bayes Rule, which is really a simple consequence of conditioning, is the basis for an important theory of decision making.

Bayesian Decision Theory The key observation is that Bayes rule can be used to determine, given a collection of events, which is the more likely event. Let B 1, B 2,... B k be a collection of events. Assume that we have observed the event A, and we would like to know which of the B i s is more likely given A. This is given by Bayes rule as: i =argmax i P(B i A) = argmax i P(A B i )P(B i ) P(A) = argmax i P(A B i )P(B i ) Note: We do not need probability P(A) for doing this (this is important for some problems, e.g. structured prediction)

Basics of Bayesian Learning The decision rule, we gave, is all there is to probabilistic decision making. Everything else comes from the details of defining the probability space and the events A, B i. The way we model the problem, that is, the way we formalize the decision we are after depends on our definition of the probabilistic model The key issues here would be (1) how to represent the probability distribution, and (2) how to represents the events. The probability space can be defined over sentences and their parse trees. The probability space can be defined over English sentences and their Chinese translation.

Key differences btw probabilistic approaches The key questions, and consequently the differences between models and computational approaches boil down to 1 what are the probability spaces? 2 how are they represented?

Bayes decision making We assume that we observe data S, and have a hypothesis space H, from which we want to choose the Bayes Optimal one, h. We assume a probability distribution over the class of hypotheses H. In addition, we will need to know how this distribution depends on the data observed.

Bayes decision making 1 p(h) - the prior probability of an hypothesis h. (background knowledge) 2 P(S) - probability that this sample of the data is observed (evidence, does not dependent on hypothesis). 3 P(S h) is the probability of observing the sample S, given that the hypothesis h holds. 4 P(h S) is the posterior probability of h.

Goal Typically, we are interested in computing P(h S) or, at least, providing a ranking of P(h S) for different hypotheses h H. However, the Bayesian approach is Hypothesis y = h(x) y = arg max P(y x) = arg max h [y = h(x)]p(h S) Instead of selecting a single hypothesis you use the predictive distribution Not always computationally feasible

Typical Learning Scenario The learning step here is inducing the best h from the data we observe. The learner is trying to find the most probably h H, given the observed data. 1 Such maximally probable hypothesis is called maximum a posteriori (MAP hypothesis. h MAP = argmax h H P(h S) = argmax h H P(S h) P(h) P(S) = argmax h H P(S h)p(h). 2 In some cases, we assume that the hypotheses are equally probable a priori (non-informative prior). In this case: P(h i ) = P(h j ), for all h i, h j H. h ML = argmax h H P(S h). (Maximum Likelihood hypothesis)

Example A given coin is either fair or has a 60% bias in favor of Head. Decide what is the bias of the coin. Two hypotheses: h 1 : P(H) = 0.5; h 2 : P(H) = 0.6 Prior: P(h) : P(h 1 ) = 0.75, P(h 2 ) = 0.25 Now observe the data: say, we tossed one and have H. P(D h) : P(D h 1 ) = 0.5; P(D h 2 ) = 0 P(D) = P(D h 1 )P(h 1 ) + P(D h 2 )P(h 2 ) = 0.5 0.75 + 0.6 0.25 = 0.525 P(h D) : P(h 1 D) = P(D h 1 )P(h 1 )/P(D) = 0.5 0.75/0.525 = 0.714 P(h 2 D) = P(D h 2 )P(h 2 )/P(D) = 0.6 0.25/0.525 = 0.286

Another Example Given a corpus; what is the probability of the word it in the corpus? Bag-of-word model: let m be the number of words in the corpus. Let k be the number of it s in the corpus. The probability is p/k

Example Modeling: Assume a bag-of-words model, with a binomial distribution model. The problem becomes that of: you toss a (p, 1 p) coin m times and get k Heads, m k Tails. What is p? The probability of the data observed is: P(D p) = p k (1 p) m k. We want to find the value of p that maximizes this quantity. p = argmax p P(D p) = argmax p logp(d p) = argmax p [klogp + (m k)log(1 p)]. dlogp(d p) dkp + (m k)(1 p) = dp dp Solving this for p, gives: p = k/m = k p m k 1 p. = 0

Bayesian Approach vs. Maximum Likelihood Problem: Taxicab problem (Minka, 2001): viewing a city from the train, you see a taxi numbered x Assuming taxis are consecutively numbered, how many taxis are in the city? We will discuss how to tackle this task with Bayesian inference You can imagine similar problems in NLP...

Maximum Likelihood Approach We assume that the distribution is uniform: { 1/a if 0 x a p(x a) U(0, a) = 0 otherwise Let us assume slightly more general task: we observed N taxis, m is the largest id among the observed ones p(d a) = { 1 a N a >= m 0 otherwise To maximize p(d a) we select a = m In other words, our best hypothesis would be that we observed the largest id MAP approach (i.e. selecting a single hypothesis using a reasonable prior p(a)) would not help here too

What to do? Full Bayesian inference, i.e. summation over possible hypothesis weighted by p(a D) will give higher estimates of a Experiments with human concept learning also demonstrated that people give higher estimates in similar situations Let us consider this example in detail

Bayesian Analysis We use the Pareto distribution as our prior { Kb K if a b p(a) Pa(b, K ) = a K +1 0 otherwise It is the conjugate prior for the Uniform distribution a prior is conjugate to the likelihood function, if the posterior distribution belongs to the same family of distributions as the prior In our case: p(a D) is also a Pareto distribution This allows us to perform analytical integration instead of numerical estimations How to choose form of a prior distribution?

Bayesian Analysis We use the Pareto distribution as our prior { Kb K if a b p(a) Pa(b, K ) = a K +1 0 otherwise Properties This density asserts that a must be greater that b but not much greater, K controls how much greater

Bayesian Analysis p(a) Pa(b, K ) = { Kb K a K +1 if a b 0 otherwise If K 0 and b 0 we get non-informative prior I.e. prior becomes essentially uniform (formally it cannot be uniform over an unbounded set and be a valid distribution, but this is in limit) It make sense to use non-informative priors if you have no prior knowledge

Bayesian Analysis prior: p(a) Pa(b, K ) = { Kb K a K +1 if a b 0 otherwise likelihood: p(d a) = { 1 a N a >= m 0 otherwise We can write joint distribution of D and a: { Kb K p(d, a) = p(d a)p(a) = an+k +1 if a b, a m 0 otherwise

Bayesian Analysis Joint distribution of D and a: { Kb K p(d, a) = an+k +1 if a b, a m 0 otherwise Now let us compute the evidence p(d) = 0 P(D, a)da = max(b,m) { K if m b (N+K )b = N if m > b Kb K (N+K )m N+K Kb K a N+K +1 da

Bayesian Analysis Evidence: p(d) = { K (N+K )b N Kb K (N+K )m N+K if m b if m > b Evidence as function of m with a Pa(1, 2) prior:

Bayesian Analysis Evidence: p(d) = { K (N+K )b N Kb K (N+K )m N+K if m b if m > b Joint distribution of D and a: { p(d, a) = p(d a)p(a) = Kb K an+k +1 if a b, a m 0 otherwise Let us derive the posterior distribution: p(a D) = p(d, a)/p(d) Pa(c, N + K ), where c = max(m, b)

Bayesian Analysis: what is the answer? p(a D) = Pa(c, N + K ) Back to the taxicab problem Depending on the prior we will get different answers If we have no prior knowledge we can use the non-informative prior (b = 0, K = 0) Then we get p(a D) = Pa(m, N): The mode of posterior is m (that is why the MAP approach does not work too) To minimize the mean squared error we should use mean Nm of the posterior: N 1, but it is infinite for N = 1 To minimize absolute error we should use median, this happens for y = 2 1/N m, i.e. 2m for N = 1

Bayesian Analysis: what is the answer? What about the unbiased estimator of a? The cumulative distribution function of m is: P(max <= m a) = P(all N samples m a) = ( m a )N Get pdf of m by differentiation: p(m a) E[m] = d dm P(max m a) = N a ( m a )N 1 = a 0 p(m a)m da = a 0 N( m a )N da = Na N+1 Then unbiased estimator of a is N+1 N m. For N = 1, the median and the unbiased estimator agree on 2m.

Bayesian Analysis: which next taxi should we expect? To answer this question we need to construct the predictive distribution p(x D) = a p(x a)p(a D) = { N+K (N+K +1)c (N+K )c N+K (N+K +1)x N+K +1 if if x c x > c Note about conjugate priors: this is the same as evidence for one sample x with prior Pa(c, N + K ) For the original taxicab problem (N+1) and K 0 we get: p(x m) = { 1 2m m 2x 2 if x m ifx > m

Bayesian Analysis: Discussion The example demonstrated that we cannot always rely on ML and MAP, or at least we need to be careful It is often possible to perform full Bayesian inference But we should be clever : e.g., use conjugate priors (there are tables of them; not all distribution have them) Criticism of conjugate priors Similar problems in NLP (though we do not normally have uniform distributions there recollect the previous lecture)

Representation of probability distributions Probability distributions can be represented in many ways. The representation is crucially important since it determined what we can say about it and what can be (efficiently) computed about it. A distribution can be represented explicitly, by specifying the probability of each element in the space. It can be specified parametrically: Binomial distribution; Normal distribution; geometric distribution, etc. It can be represented graphically. It can be specified by specifying a process that generates elements in the space. Depending on the process, it may be easy or hard to actually compute the probability of an event in the space.

Example of Representation Consider the following model: We have five characters, A,B,C,D,E. At any point in time we can generate: A, with probability 0.4, B, with probability 0.1, C, with probability 0.2, D, with probability 0.1, E, with probability 0.2. E represents end of sentence. A sentence in this language could be: AAACDCDAADE An unlikely sentence in this language would be: DDDBBBDDDBBBDDDBBB. It is clear that given the model we can compute the probability of each string and make a judgment as to which sentence is more likely.

Another model of Language: Markovian Models can be more complicated; E.g., a probabilistic finite state model: Start state is A, with probability 0.4; B, with probability 0.4, C, with probability 0.2. From A you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. From B you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. From C you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. Etc. In this way we can easily define models that are intractable When you think about language, you need, of course to think about what do A, B,... represent. words, pos tags, structures of some sort, ideas?..

Learning Parameters Often we are only given a family of probability distribution (defined parametrically, as a process or by specifying some conditions as above.) A key problem in this case might be to choose one specific distribution in this family. This can be done by observing data, specifying some property of the distribution that we care about the might distinguish a unique element in the family, or both.

Probabilistic Inequalities and the law of Large Numbers As we will see, large deviation bounds are at the base of the Direct Learning approach. When one wants to directly learn how to make predictions the approach is basically as follows: Look at many examples Discover some regularities in the language Use these regularities to construct a prediction policy

Generalization The key question the direct learning approach needs to address, given the distribution free assumption, is that of generalization. Why would the prediction policy induced from the empirical data be valid on new, previously unobserved data? Assumption: throughout the process, we observe examples sampled by some fixed, but unknown, distribution i.e.,, the distribution the governs the generation of examples during the learning process is the one that governs it during the evaluation process (identically distributed) Assumption: data instances are independently generated Two basic results from statistics are useful in the learning theory (Chernoff Bounds and Union bounds)

Chernoff Bounds Consider a binary random variable X (e.g., the result of a coin toss) which has probability p of being head (1) Two important questions are: How do we estimate p? How confident are we in this estimate? Consider m samples, {x 1, x 2,...}, drawn independently from X. A natural estimate of the underlying probability p is the relative frequency of x i = 1. That is ˆp = i x i /m. Notice that this is also the maximum likelihood estimate according to the binomial distribution.

Chernoff Bounds The law of large numbers: ˆp will converge to p with m. How quickly? Chernoff Theorem gives a bound on this

Chernoff Bounds: Theorem For all p [0, 1], ɛ > 0, P[ p ˆp > ɛ] 2e 2mɛ2 (2) where the probability P is taken over the distribution of training samples of size m generated with underlying parameter p. Exponential rate of convergence Some numbers: Take m = 1000, ɛ = 0.05, then e 2mɛ2 = e 5 1/148. That is: if we repeatedly take samples of size 1000 and compute the relative frequency, then in 147 out of 148 samples, our estimate will be within 0.05 from the true probability. We need to be quite unlucky for that not to happen. There exist some other large deviation bounds (aka concentration bounds): Hoeffding, McDiaramid, Azuma, etc. How do you use a bound of this sort to justify generalization?

Union Bounds A second useful result, a very simple one in fact, is that of the Union Bounds: For any n events {A 1, A 2,... A n } and for any distribution P whose sample space includes all A i, P[A 1 A 2... A n ] i P[A i ] (3) The Union Bound follows directly from the axioms of probability theory. For example, for n = 2 P[A 1 A 2 ] = P[A 1 ] + P[A 2 ] P[A 1 A 2 ] P[A 1 ] + A[A 2 ]. The general result follows by induction on n

Today s summary Descriptive statistics (e.g., Chebyshev s bound) Bayesian Decision Theory, Bayesian Inference Concentration Bounds We will talk about Information Theory next time but we will also start smoothing/language modeling

Assignments Reminder: Feb, 13 (midnight) is the deadline for the first critical review Reminder: Let me know about your choice of a paper to present by next Wednesday (Feb, 4)

Information Theory: Noisy Channel model One of the important computational abstractions made in NLP is the of the Noisy Channel Model (Shannon). Actually, just an application of Bayes rule, but it is sometimes a useful way to think about and model problems. Shannon: modeling the process of communication across channel, optimizing throughput and accuracy Tradeoff between compression and redundancy Assumption: Probabilistic process that governs the generation of language (messages) A Probabilistic corruption process, modeled via the notion of a channel that affects the generated language structure before it reaches the listener.

Noisy Channel: Definitions Let I be the input space, a space of some structures in the language. Let P be a probability distribution over I; that is, this is our language model. The noisy channel has a probability distribution P(o i) of outputting the structure o when receiving the input Input (i) -> Noisy Channel -> o -> decoder -> i

Noisy Channel: more The goal is to decode: I = argmax i I P(i o) = argmax i I P(o i)p(i)/p(o) = argmax i I P(o i)p(i). In addition to the language model, we also need to know the distribution of the noisy channel. Example: Translation: According to this model, English is nothing but corrupted French (true text (i): French, output (o): English).

More examples

Noisy Channel model: Conclusion The key achievement of Shannon: the ability to quantify the theoretically best data compression possible and the best transmission rate possible and the relation between them. Mathematically, these correspond to the Entropy of the transmitted distribution and the Capacity of the channel. Roughly: each channel is associated with a capacity and if you transmit your information at a rate that is lower then it, the decoding error can be as small as possible (Shannon s theorem 1948). The channel capacity is given using the notion of the mutual information between the random variables I and O. That is the reason that Entropy and Mutual Information became important notions in NLP.

Entropy For a give random variable X, how much information is conveyed in the message that X = x? How to quantify it? We can first agree that the amount of information in the message that X = x should depend on how likely it was the X would equal x. Examples: There is more information in a message that there will be a tornado tomorrow than that in a message that there will be no tornado (rare event) if X represents the sum of two fair dice, then there seems to be more information in the message that X = 12 than there would be in the message that X = 7 since the former event happens with probability 1/36 and the latter - 1/6.

Data Transmission Intuition Intuition: a Good compression algorithm (e.g., Huffman coding) encodes more frequent events with shorter codes; for less frequent events more bits need to be transmitted (longer messages contain more information) Does language compress information in the similar way? To some degree: rare words are generally longer (Zipf, 1938; Guiter, 1974) language often evolves by replacing frequent long words with shorter words (e.g, bicycle -> bike) There are interesting analysis on application of information theory to language evolution (e.g., Plotkin and Nowak, 2000)

Information: Mathematical Form Let s denote by I(p) the amount of information contained in the message that an event whose probability is p has occurred: Clearly I(p) should be non negative, decreasing function of p. To determine it s form, let X and Y be independent random variables, and suppose that P{X = x} = p and P{Y = y} = q. Knowledge that X equals x does not affect the probability that Y will equal y the amount of information in the message that X = x and Y = y is I(p) + I(q). On the other hand: P{X = x, Y = y} = P{X = x}p{y = y} = pq which implies that this amount of information is I(pq) Therefore, the function I should satisfy the identity I(pq) = I(p) + I(q)

Information: Mathematical Form I(pq) = I(p) + I(q) If we define I(2 p ) = G(p) we get from the above that G(p+q) = I(2 (p+q) ) = I(2 p 2 q ) = I(2 p )+I(2 q ) = G(p)+G(q) We get G(p) = cp for some constant c, or I(2 p ) = cp, Letting z = 2 p I(z) = c log 2 (q) for some positive constant c. Assume that c = 1 and say that the information is measured in bits.

Entropy Consider a random variable X, which takes values x 1,... x n with respective probabilities p 1,... p n. log(p i ) represents the information conveyed by the message that X is equal to x i it follows that the expected amount of information that will be conveyed when the value of X is transmitted: H(X) = n p i log 2 (p i ) i=1 This quantity is know in information theory as the entropy of the random variable X.

Entropy: Formal Definition Let p be a probability distribution over a discrete domain X. the entropy of p is H(p) = x X p(x) log p(x). Notice that we can think of the entropy as H(p) = E p log p(x). Since 0 p(x) 1, 1/p(x) > 1, log 1/p(x) > 0 and therefore 0 < H(p) <.

Entropy: Intuition The entropy of the random variable measures the uncertainty of the random variable: minimal value (of 0) if one event has probability 1, others - 0 no uncertainty; maximal and equal to log X when the distribution is uniform. We will see the notion of Entropy in three different contexts. As a measure for information content in a distributional representation. As a measure for information; how much information about X do we get from knowing about Y? As a way to characterize and choose probability distributional ultimately, classifiers.

Entropy: Example Consider a simple language L over a vocabulary of size 8. Assume that the distribution over L is uniform: p(l) = 1/8, l L. Then, H(L) = 8 p(l)logp(l) = log 1 8 = 3 l=1 The most efficient encoding for transmission: 3 bits for each symbols. An optimal code sends a message of probability p in logp bits. If the distribution over L is: {1/2, 1/8, 1/8, 1/8, 1/32, 1/32, 1/32, 1/32} Then H(L) = 8 l=1 p(l)logp(l) = 1/2 1 + 3(1/8 3) + 4(1/32 5) = 1/2 + 9/8 + 20/32 = 2.25.

Joint and Conditional Entropy Let p be a joint probability distribution over a pair of discrete random variables X, Y. The average amount of information needed to specify both their values is the joint entropy: H(p) = H(X, Y ) = y Y p(x, y) log p(x, y). x X The conditional entropy of Y given X, for p(x, y) expresses the average additional information one needs to supply about Y given that X is known: H(Y X) = p(x)h(y X = x) = x X x X p(x)[ y Y p(y x)logp(y x)] = y Y p(x, y) log p(y x) x X

Chain Rule The chain rule for entropy is given by: H(X, Y ) = H(X) + H(Y X). More generally: H(X 1, X 2,... X n ) = H(X 1 ) + H(X 2 X 1 ) +... + H(X 1, X 2,... X n 1 ).

Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total 0.6 0.4

Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: In this case p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x 0.3 0.2 0.5 Y = y 0.3 0.2 0.5 total 0.6 0.4 H(p) = x X p(x) log p(x) = 0.3 log 0.3 + 0.3 log 0.3 + 0.2 log 0.2 + 0.2 log 0.2 = 1.366/ ln 2 = 1.97

Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: In this case p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x 0.5 0.3 0.8 Y = y 0.1 0.1 0.2 total 0.6 0.4 H(p) = x X p(x) log p(x) = 0.5 log 0.5 + 0.3 log 0.3 + 0.1 log 0.1 + 0.1 log 0.1 = 1.16/ ln 2 = 1.68

Selecting a Distribution: A simple example In the first case we distributed unknown probability uniformly: H(p) = 1.97. In the second case, it was not uniform: H(p) = 1.68 Note: the maximum entropy for a distribution over 4 elements is log 1/4 = 2

Maximum Entropy Principle We should try to select the most uniform distribution which satisfies the constraints we observe: selecting a distribution which has the maximum entropy Intuitively, we minimizes the number of additional constraints one can assume about the distribution In general, there is no closed-form solution for this problem The distribution obtained by the MaxEnt principle is the same as the on obtained by Maximum Likelihood estimation for some restricted class of models We will talk about this when we discuss the logistic regression This is important because we often compute from data statistics of individual features rather than statistics of their combinations. How to combine them to get a model?

Mutual Information We showed: which implies: H(X, Y ) = H(X) + H(Y X) = H(Y ) + H(X Y ), H(X) H(X Y ) = H(Y ) H(Y X). This difference is called the mutual information between X and Y and is denoted I(X : Y ): the reduction in the uncertainty of one RV due to knowing about the other. = x,y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = = H(X) + H(Y ) H(X, Y ) = = x p(x)log 1 p(x) + x p(y)log 1 p(y) 1 x,y p(x, y)log p(x,y) = 1 p(x, y)log p(x) + 1 x,y p(x, y)log p(y) 1 x,y p(x, y)log p(x,y) = = + x,y p(x, y)log p(x,y) p(x)p(y)

Mutual Information I(X; Y ) = x,y p(x, y) p(x, y)log p(x)p(y) I(X;Y) can be best thought as a measure of independence between X and Y : If X and Y are independent than I(X; Y ) = 0 If X and Y are dependent than I(X; Y ) grows with the degree of dependence but also with the entropy. I(X; X) = H(X) H(X X) = H(X) It is also possible to talk about conditional mutual information: I(X; Y Z ) = H(X Z ) H(X Y, Z ).

Comments on MI We have defined mutual information between random variables. In many applications this is being abused and people use it to measure point-wise mutual information. The point-wise mutual information between these two points can be defined to be: p(x, y) I(x, y) = log p(x)p(y). This can be thought of as some statistical measure relating the events x and y.

Comments on MI For example, consider looking at pair of consecutive words w and w : p(w) - the frequency of the word w in a corpus, p(w ) - the frequency of the word w in a corpus p(w, w ) - frequency of the pair w, w in the corpus. and then, consider I(w, w ) as the amount of information about the occurrence of w in location i given that we know that w occurs at i + 1. This is not a good measure, and we will see examples for it (statistics such as χ 2 represent better what we want normally want)

Comments on MI: An example 1 Assume the w, w occur only together. Then: I(w, w ) = log p(w, w ) p(w)p(w ) = log p(w) p(w)p(w ) = log 1 p(w ) So, the mutual information increases among perfectly dependent bigrams, when they become rarer. 2 If they are completely independent, I(w, w ) = log p(w, w ) p(w)p(w = log1 = 0. ) So, it can be thought of as a good measure for independence, but not a good measure for dependence, since it depends on the score of individual words.

Relative Entropy or Kullback-Liebler Divergence Let p, q be two probability distributions over a discrete domain X. The relative entropy between p and q is D(p, q) = D(p q) = x X p(x) log p(x) q(x) Notice that D(p q) is not symmetric. It could be viewed as D(p q) = E p log p(x) q(x), the expectation according to p of log p/q. It is therefore unbounded and not defined if p gives positive support to instances q does not. It is 0 when p = 0.

KL Divergence: Properties Lemma. For any probability distributions p, q, D(p q) 0 and equality holds if and only if p = q. The KL divergence is not a metric, being non symmetric and does not satisfy the triangle inequality. We can still think about it as a distance between distributions. MI can be thought of as the KL divergence between the joint probability distribution on X, Y and the product distribution defined on X, Y. I(X; Y ) = D(p(x, y) p(x)p(y)). or, in o.w., it is the measure of how far the true joint is from independence.

Cross Entropy We can now think on the notion of the entropy of the language. Assume that you know the distribution; then you can think of the entropy as a measure of how surprised you are when you observe the next character in the language. Let p the true distribution, and q some model that we estimate from a sample generated by p: The Cross Entropy between a random variable X distributed according to p and another probability distribution q is given by: H(X, q) = H(X) + D(p q) = x p(x) log p(x) + x p(x) log p(x) q(x) = x p(x) log q(x).

Cross Entropy H(X, q) = x p(x) log q(x) Notice that this is just the expectation of 1/q(x) with respect to p(x): H(X, q) = E p 1 log q(x).

Cross Entropy We do not need to know p if we make some assumptions If we assume that p (our language) is stationary (probabilities assigned to sequences are invariant with respect to shifts) and ergodic (process the is not sensitive to initial conditions) then Shannon-McMillan-Breiman theorem shows that: H(X, q) lim n 1 n log q(w 1w 2... w n ). We can define a model q and approximate the entropy of a language models (hope: the lower the entropy is, the better the model is) How can we do it in practice?

Cross Entropy for Different Models Model Cross Entropy (bits) 0-th order 4.76 (= log 27; uniform over 27 characters) 1-st order 4.03 2-nd order 2.8 The cross entropy can be used as an evaluation metric for language models (or the perplexity 2 H ) Reducing the perplexity thus means improving the language model. Notice that that does not always mean better predictions. Though sometimes we need distributions not predictions...

Information Theory: Summary Noisy Channel Model: many applications in NLP Entropy: measure of uncertainty, max-ent principle Mutual Information: measure of r.v. dependence KL divergence: distance between distributions Cross Entropy: language model evaluation