CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

Size: px
Start display at page:

Download "CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms"

Transcription

1 CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms Spring 2009 February 3, 2009

2 Lecture Some note on statistics Bayesian decision theory Concentration bounds Information theory

3 Introduction Consider the following examples: 1 I saw a girl it the park. (I saw a girl in the park.) 2 The from needs to be completed. (The form needs to be completed). 3 I maybe there tomorrow. (I may be there tomorrow.) 4 Nissan s car plants grew rapidly in the last few years. Questions: How do we identify that there is a problem? How do we fix it?

4 Introduction In order to address these problems we would like to find a systematic way to study the problem. But How do we model the problem? How do we use this model to solve the problem? The problems described are new to you. You have never seen the sentences (1)-(4). How can you be so sure that you can recognize a problem and correct it?

5 Introduction It is clear that there exist some regularities in language. How can we exploit these to detect and correct errors? There are general mathematical approaches that can be used to translate this intuition into a model that we can study.

6 What is coming next? It seems clear that in order to make decisions of this sort, one has to rely on previously observed data. = use statistics In many cases you do not know what is the model which governs the generation of the data (study of distribution-free methods) The main interest here is to say smth about the data just by estimating some statistics Main results in learning theory are based on Large Deviation (or Concentration) Bounds Also a few of the key notions from Information Theory that are relevant in the study of NLP.

7 Types of Statistical Analysis Descriptive Statistics: In many cases we would like to collect data and learn some interesting things about it. Here, we want to summarize data and described it via different types of measurements. (E.g., we did it for Tom Sawyer book last time) Inferential Statistics: We want to draw conclusions from the data Often, to draw conclusions we need to make assumption, or formulate probability model for the data We will start with descriptive statistics

8 Descriptive Statistics Recall that the sample mean of data set x 1, x 2,... x n, is defined as n x = x i /n i=1 and the sample standard deviation as s = n (x i x) 2 /(n 1). i=1 What we can say about data if we see only sample mean and sample variance? Chebyshev s Inequality states that for any value of k 1, more than 100(1 1/k 2 ) percent of the data is within ks from the mean. E.g., for k = 2, more than 75% of the data lies within 2s from the sample mean. No assumption about the distribution

9 Chebyshev s Inequality for Sample Data Let x and s be the sample mean and sample standard deviation of the data set x 1, x 2,... x n, where s > 0. Let S k = {i, 1 i n : x i x < ks} and let N(S k ) be the number of elements in the set S k. Then, for any k 1 N(S k ) > 1 1 n k 2.

10 Proof n (n 1)s 2 = (x i x) 2 i=1 = i S k (x i x) 2 + i S k (x i x) 2 i S k (x i x) 2 i S k k 2 s 2 = k 2 s 2 (n N(S k )) where the first inequality follows from since all terms being summed are nonnegative, and the second from the definition of S k. Dividing both sides by nk 2 s 2 : N(S k ) n 1 n 1 nk 2 > 1 1 k 2

11 Limitations and Discussion Chebyshev s inequality holds universally, you should probably expect it to be loose for many distributions Indeed, if the data is distributed normally, we know that 95% of the data lies with x ± 2s. Analogous bound can be formulated for true means and variance (not their estimates): what is the probability of observing a point ks away from the true mean?

12 Sample Correlation Coefficient Other questions in descriptive statistics: assume that our data set consists of pairs of values that have some relationships to each other. The ith data point is represented as the pair (x i, y i ). A question of interest might be whether the x i s are correlated with the y i s? (that is, whether large x values go together with large y values, etc.)

13 Sample Correlation Coefficient Let s x and s y denote, respectively, the sample standard deviation of the x values and the y values. The sample correlation coefficient, denoted r, of the data pairs (x i, y i ), i = 1,... n, is defined by: r = n i=1 (x i x)(y i y) (n 1)s x s y = n i=1 (x i x)(y i y) n i=1 (x i x) 2 n i=1 (y i y) 2 When r > 0 we say that the sample data pairs are positively correlated, and when r < 0 we say they are negatively correlated.

14 Properties of r 1 1 r 1 2 If for constants a, b, with b > 0, y i = a + bx i, i = 1,... n then r = 1 3 If for constants a, b, with b < 0, y i = a + bx i, i = 1,... n then r = 1 4 If r is the sample correlation coefficient for the data pairs (x i, y i )i = 1,,... n then it is also the sample correlation coefficient for the data pairs (a + bx i, c + dy i )i = 1,,... n provided that b, d are both positive or both negative. (Side note: keep in mind correlation and statistical dependence are very different things)

15 Elements of Probability I hope you know this, but just 2 slides to make sure... The conditional probability of event E given that event F has occurred (P(F) > 0) is defined to be The Product Rule: P(E F) = P(E F) P(F) (1) P(E F) = P(E F)P(F) = P(F E)P(E). This rule can be generalized to multiple events: The Chain Rule: n 1 P(E 1 E 2... E n ) = P(E 1 )P(E 2 E 1 )P(E 3 E 1 E 2 ) P(E n E i ). i=1

16 Elements of Probability Total Probability: For mutually exclusive event F 1, F 2,... F n with i=1,n P(F i) = 1, P(E) = i P(E F i )P(F i ). The partitioning rule can be phrased also for conditional probabilities: Total Conditional Probability For mutually exclusive event F 1, F 2,... F n with i=1,n P(F i) = 1, P(E G) = P(E F i, G)P(F i G).

17 Bayesian Decision Theory Let E, F be events. As a simple consequence of the product rule we get the ability to swap the order of dependence between events. Bayes Theorem: (simple form) P(F E) = P(E F) P(F) P(E). General form of Bayes Theorem: P(F i E) = P(E F i)p(f i ) P(E) = P(E F i)p(f i ) i P(E F i)p(f i ). Bayes Rule, which is really a simple consequence of conditioning, is the basis for an important theory of decision making.

18 Bayesian Decision Theory The key observation is that Bayes rule can be used to determine, given a collection of events, which is the more likely event. Let B 1, B 2,... B k be a collection of events. Assume that we have observed the event A, and we would like to know which of the B i s is more likely given A. This is given by Bayes rule as: i =argmax i P(B i A) = argmax i P(A B i )P(B i ) P(A) = argmax i P(A B i )P(B i ) Note: We do not need probability P(A) for doing this (this is important for some problems, e.g. structured prediction)

19 Basics of Bayesian Learning The decision rule, we gave, is all there is to probabilistic decision making. Everything else comes from the details of defining the probability space and the events A, B i. The way we model the problem, that is, the way we formalize the decision we are after depends on our definition of the probabilistic model The key issues here would be (1) how to represent the probability distribution, and (2) how to represents the events. The probability space can be defined over sentences and their parse trees. The probability space can be defined over English sentences and their Chinese translation.

20 Key differences btw probabilistic approaches The key questions, and consequently the differences between models and computational approaches boil down to 1 what are the probability spaces? 2 how are they represented?

21 Bayes decision making We assume that we observe data S, and have a hypothesis space H, from which we want to choose the Bayes Optimal one, h. We assume a probability distribution over the class of hypotheses H. In addition, we will need to know how this distribution depends on the data observed.

22 Bayes decision making 1 p(h) - the prior probability of an hypothesis h. (background knowledge) 2 P(S) - probability that this sample of the data is observed (evidence, does not dependent on hypothesis). 3 P(S h) is the probability of observing the sample S, given that the hypothesis h holds. 4 P(h S) is the posterior probability of h.

23 Goal Typically, we are interested in computing P(h S) or, at least, providing a ranking of P(h S) for different hypotheses h H. However, the Bayesian approach is Hypothesis y = h(x) y = arg max P(y x) = arg max h [y = h(x)]p(h S) Instead of selecting a single hypothesis you use the predictive distribution Not always computationally feasible

24 Typical Learning Scenario The learning step here is inducing the best h from the data we observe. The learner is trying to find the most probably h H, given the observed data. 1 Such maximally probable hypothesis is called maximum a posteriori (MAP hypothesis. h MAP = argmax h H P(h S) = argmax h H P(S h) P(h) P(S) = argmax h H P(S h)p(h). 2 In some cases, we assume that the hypotheses are equally probable a priori (non-informative prior). In this case: P(h i ) = P(h j ), for all h i, h j H. h ML = argmax h H P(S h). (Maximum Likelihood hypothesis)

25 Example A given coin is either fair or has a 60% bias in favor of Head. Decide what is the bias of the coin. Two hypotheses: h 1 : P(H) = 0.5; h 2 : P(H) = 0.6 Prior: P(h) : P(h 1 ) = 0.75, P(h 2 ) = 0.25 Now observe the data: say, we tossed one and have H. P(D h) : P(D h 1 ) = 0.5; P(D h 2 ) = 0 P(D) = P(D h 1 )P(h 1 ) + P(D h 2 )P(h 2 ) = = P(h D) : P(h 1 D) = P(D h 1 )P(h 1 )/P(D) = /0.525 = P(h 2 D) = P(D h 2 )P(h 2 )/P(D) = /0.525 = 0.286

26 Another Example Given a corpus; what is the probability of the word it in the corpus? Bag-of-word model: let m be the number of words in the corpus. Let k be the number of it s in the corpus. The probability is p/k

27 Example Modeling: Assume a bag-of-words model, with a binomial distribution model. The problem becomes that of: you toss a (p, 1 p) coin m times and get k Heads, m k Tails. What is p? The probability of the data observed is: P(D p) = p k (1 p) m k. We want to find the value of p that maximizes this quantity. p = argmax p P(D p) = argmax p logp(d p) = argmax p [klogp + (m k)log(1 p)]. dlogp(d p) dkp + (m k)(1 p) = dp dp Solving this for p, gives: p = k/m = k p m k 1 p. = 0

28 Bayesian Approach vs. Maximum Likelihood Problem: Taxicab problem (Minka, 2001): viewing a city from the train, you see a taxi numbered x Assuming taxis are consecutively numbered, how many taxis are in the city? We will discuss how to tackle this task with Bayesian inference You can imagine similar problems in NLP...

29 Maximum Likelihood Approach We assume that the distribution is uniform: { 1/a if 0 x a p(x a) U(0, a) = 0 otherwise Let us assume slightly more general task: we observed N taxis, m is the largest id among the observed ones p(d a) = { 1 a N a >= m 0 otherwise To maximize p(d a) we select a = m In other words, our best hypothesis would be that we observed the largest id MAP approach (i.e. selecting a single hypothesis using a reasonable prior p(a)) would not help here too

30 What to do? Full Bayesian inference, i.e. summation over possible hypothesis weighted by p(a D) will give higher estimates of a Experiments with human concept learning also demonstrated that people give higher estimates in similar situations Let us consider this example in detail

31 Bayesian Analysis We use the Pareto distribution as our prior { Kb K if a b p(a) Pa(b, K ) = a K +1 0 otherwise It is the conjugate prior for the Uniform distribution a prior is conjugate to the likelihood function, if the posterior distribution belongs to the same family of distributions as the prior In our case: p(a D) is also a Pareto distribution This allows us to perform analytical integration instead of numerical estimations How to choose form of a prior distribution?

32 Bayesian Analysis We use the Pareto distribution as our prior { Kb K if a b p(a) Pa(b, K ) = a K +1 0 otherwise Properties This density asserts that a must be greater that b but not much greater, K controls how much greater

33 Bayesian Analysis p(a) Pa(b, K ) = { Kb K a K +1 if a b 0 otherwise If K 0 and b 0 we get non-informative prior I.e. prior becomes essentially uniform (formally it cannot be uniform over an unbounded set and be a valid distribution, but this is in limit) It make sense to use non-informative priors if you have no prior knowledge

34 Bayesian Analysis prior: p(a) Pa(b, K ) = { Kb K a K +1 if a b 0 otherwise likelihood: p(d a) = { 1 a N a >= m 0 otherwise We can write joint distribution of D and a: { Kb K p(d, a) = p(d a)p(a) = an+k +1 if a b, a m 0 otherwise

35 Bayesian Analysis Joint distribution of D and a: { Kb K p(d, a) = an+k +1 if a b, a m 0 otherwise Now let us compute the evidence p(d) = 0 P(D, a)da = max(b,m) { K if m b (N+K )b = N if m > b Kb K (N+K )m N+K Kb K a N+K +1 da

36 Bayesian Analysis Evidence: p(d) = { K (N+K )b N Kb K (N+K )m N+K if m b if m > b Evidence as function of m with a Pa(1, 2) prior:

37 Bayesian Analysis Evidence: p(d) = { K (N+K )b N Kb K (N+K )m N+K if m b if m > b Joint distribution of D and a: { p(d, a) = p(d a)p(a) = Kb K an+k +1 if a b, a m 0 otherwise Let us derive the posterior distribution: p(a D) = p(d, a)/p(d) Pa(c, N + K ), where c = max(m, b)

38 Bayesian Analysis: what is the answer? p(a D) = Pa(c, N + K ) Back to the taxicab problem Depending on the prior we will get different answers If we have no prior knowledge we can use the non-informative prior (b = 0, K = 0) Then we get p(a D) = Pa(m, N): The mode of posterior is m (that is why the MAP approach does not work too) To minimize the mean squared error we should use mean Nm of the posterior: N 1, but it is infinite for N = 1 To minimize absolute error we should use median, this happens for y = 2 1/N m, i.e. 2m for N = 1

39 Bayesian Analysis: what is the answer? What about the unbiased estimator of a? The cumulative distribution function of m is: P(max <= m a) = P(all N samples m a) = ( m a )N Get pdf of m by differentiation: p(m a) E[m] = d dm P(max m a) = N a ( m a )N 1 = a 0 p(m a)m da = a 0 N( m a )N da = Na N+1 Then unbiased estimator of a is N+1 N m. For N = 1, the median and the unbiased estimator agree on 2m.

40 Bayesian Analysis: which next taxi should we expect? To answer this question we need to construct the predictive distribution p(x D) = a p(x a)p(a D) = { N+K (N+K +1)c (N+K )c N+K (N+K +1)x N+K +1 if if x c x > c Note about conjugate priors: this is the same as evidence for one sample x with prior Pa(c, N + K ) For the original taxicab problem (N+1) and K 0 we get: p(x m) = { 1 2m m 2x 2 if x m ifx > m

41 Bayesian Analysis: Discussion The example demonstrated that we cannot always rely on ML and MAP, or at least we need to be careful It is often possible to perform full Bayesian inference But we should be clever : e.g., use conjugate priors (there are tables of them; not all distribution have them) Criticism of conjugate priors Similar problems in NLP (though we do not normally have uniform distributions there recollect the previous lecture)

42 Representation of probability distributions Probability distributions can be represented in many ways. The representation is crucially important since it determined what we can say about it and what can be (efficiently) computed about it. A distribution can be represented explicitly, by specifying the probability of each element in the space. It can be specified parametrically: Binomial distribution; Normal distribution; geometric distribution, etc. It can be represented graphically. It can be specified by specifying a process that generates elements in the space. Depending on the process, it may be easy or hard to actually compute the probability of an event in the space.

43 Example of Representation Consider the following model: We have five characters, A,B,C,D,E. At any point in time we can generate: A, with probability 0.4, B, with probability 0.1, C, with probability 0.2, D, with probability 0.1, E, with probability 0.2. E represents end of sentence. A sentence in this language could be: AAACDCDAADE An unlikely sentence in this language would be: DDDBBBDDDBBBDDDBBB. It is clear that given the model we can compute the probability of each string and make a judgment as to which sentence is more likely.

44 Another model of Language: Markovian Models can be more complicated; E.g., a probabilistic finite state model: Start state is A, with probability 0.4; B, with probability 0.4, C, with probability 0.2. From A you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. From B you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. From C you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. Etc. In this way we can easily define models that are intractable When you think about language, you need, of course to think about what do A, B,... represent. words, pos tags, structures of some sort, ideas?..

45 Learning Parameters Often we are only given a family of probability distribution (defined parametrically, as a process or by specifying some conditions as above.) A key problem in this case might be to choose one specific distribution in this family. This can be done by observing data, specifying some property of the distribution that we care about the might distinguish a unique element in the family, or both.

46 Probabilistic Inequalities and the law of Large Numbers As we will see, large deviation bounds are at the base of the Direct Learning approach. When one wants to directly learn how to make predictions the approach is basically as follows: Look at many examples Discover some regularities in the language Use these regularities to construct a prediction policy

47 Generalization The key question the direct learning approach needs to address, given the distribution free assumption, is that of generalization. Why would the prediction policy induced from the empirical data be valid on new, previously unobserved data? Assumption: throughout the process, we observe examples sampled by some fixed, but unknown, distribution i.e.,, the distribution the governs the generation of examples during the learning process is the one that governs it during the evaluation process (identically distributed) Assumption: data instances are independently generated Two basic results from statistics are useful in the learning theory (Chernoff Bounds and Union bounds)

48 Chernoff Bounds Consider a binary random variable X (e.g., the result of a coin toss) which has probability p of being head (1) Two important questions are: How do we estimate p? How confident are we in this estimate? Consider m samples, {x 1, x 2,...}, drawn independently from X. A natural estimate of the underlying probability p is the relative frequency of x i = 1. That is ˆp = i x i /m. Notice that this is also the maximum likelihood estimate according to the binomial distribution.

49 Chernoff Bounds The law of large numbers: ˆp will converge to p with m. How quickly? Chernoff Theorem gives a bound on this

50 Chernoff Bounds: Theorem For all p [0, 1], ɛ > 0, P[ p ˆp > ɛ] 2e 2mɛ2 (2) where the probability P is taken over the distribution of training samples of size m generated with underlying parameter p. Exponential rate of convergence Some numbers: Take m = 1000, ɛ = 0.05, then e 2mɛ2 = e 5 1/148. That is: if we repeatedly take samples of size 1000 and compute the relative frequency, then in 147 out of 148 samples, our estimate will be within 0.05 from the true probability. We need to be quite unlucky for that not to happen. There exist some other large deviation bounds (aka concentration bounds): Hoeffding, McDiaramid, Azuma, etc. How do you use a bound of this sort to justify generalization?

51 Union Bounds A second useful result, a very simple one in fact, is that of the Union Bounds: For any n events {A 1, A 2,... A n } and for any distribution P whose sample space includes all A i, P[A 1 A 2... A n ] i P[A i ] (3) The Union Bound follows directly from the axioms of probability theory. For example, for n = 2 P[A 1 A 2 ] = P[A 1 ] + P[A 2 ] P[A 1 A 2 ] P[A 1 ] + A[A 2 ]. The general result follows by induction on n

52 Today s summary Descriptive statistics (e.g., Chebyshev s bound) Bayesian Decision Theory, Bayesian Inference Concentration Bounds We will talk about Information Theory next time but we will also start smoothing/language modeling

53 Assignments Reminder: Feb, 13 (midnight) is the deadline for the first critical review Reminder: Let me know about your choice of a paper to present by next Wednesday (Feb, 4)

54 Information Theory: Noisy Channel model One of the important computational abstractions made in NLP is the of the Noisy Channel Model (Shannon). Actually, just an application of Bayes rule, but it is sometimes a useful way to think about and model problems. Shannon: modeling the process of communication across channel, optimizing throughput and accuracy Tradeoff between compression and redundancy Assumption: Probabilistic process that governs the generation of language (messages) A Probabilistic corruption process, modeled via the notion of a channel that affects the generated language structure before it reaches the listener.

55 Noisy Channel: Definitions Let I be the input space, a space of some structures in the language. Let P be a probability distribution over I; that is, this is our language model. The noisy channel has a probability distribution P(o i) of outputting the structure o when receiving the input Input (i) -> Noisy Channel -> o -> decoder -> i

56 Noisy Channel: more The goal is to decode: I = argmax i I P(i o) = argmax i I P(o i)p(i)/p(o) = argmax i I P(o i)p(i). In addition to the language model, we also need to know the distribution of the noisy channel. Example: Translation: According to this model, English is nothing but corrupted French (true text (i): French, output (o): English).

57 More examples

58 Noisy Channel model: Conclusion The key achievement of Shannon: the ability to quantify the theoretically best data compression possible and the best transmission rate possible and the relation between them. Mathematically, these correspond to the Entropy of the transmitted distribution and the Capacity of the channel. Roughly: each channel is associated with a capacity and if you transmit your information at a rate that is lower then it, the decoding error can be as small as possible (Shannon s theorem 1948). The channel capacity is given using the notion of the mutual information between the random variables I and O. That is the reason that Entropy and Mutual Information became important notions in NLP.

59 Entropy For a give random variable X, how much information is conveyed in the message that X = x? How to quantify it? We can first agree that the amount of information in the message that X = x should depend on how likely it was the X would equal x. Examples: There is more information in a message that there will be a tornado tomorrow than that in a message that there will be no tornado (rare event) if X represents the sum of two fair dice, then there seems to be more information in the message that X = 12 than there would be in the message that X = 7 since the former event happens with probability 1/36 and the latter - 1/6.

60 Data Transmission Intuition Intuition: a Good compression algorithm (e.g., Huffman coding) encodes more frequent events with shorter codes; for less frequent events more bits need to be transmitted (longer messages contain more information) Does language compress information in the similar way? To some degree: rare words are generally longer (Zipf, 1938; Guiter, 1974) language often evolves by replacing frequent long words with shorter words (e.g, bicycle -> bike) There are interesting analysis on application of information theory to language evolution (e.g., Plotkin and Nowak, 2000)

61 Information: Mathematical Form Let s denote by I(p) the amount of information contained in the message that an event whose probability is p has occurred: Clearly I(p) should be non negative, decreasing function of p. To determine it s form, let X and Y be independent random variables, and suppose that P{X = x} = p and P{Y = y} = q. Knowledge that X equals x does not affect the probability that Y will equal y the amount of information in the message that X = x and Y = y is I(p) + I(q). On the other hand: P{X = x, Y = y} = P{X = x}p{y = y} = pq which implies that this amount of information is I(pq) Therefore, the function I should satisfy the identity I(pq) = I(p) + I(q)

62 Information: Mathematical Form I(pq) = I(p) + I(q) If we define I(2 p ) = G(p) we get from the above that G(p+q) = I(2 (p+q) ) = I(2 p 2 q ) = I(2 p )+I(2 q ) = G(p)+G(q) We get G(p) = cp for some constant c, or I(2 p ) = cp, Letting z = 2 p I(z) = c log 2 (q) for some positive constant c. Assume that c = 1 and say that the information is measured in bits.

63 Entropy Consider a random variable X, which takes values x 1,... x n with respective probabilities p 1,... p n. log(p i ) represents the information conveyed by the message that X is equal to x i it follows that the expected amount of information that will be conveyed when the value of X is transmitted: H(X) = n p i log 2 (p i ) i=1 This quantity is know in information theory as the entropy of the random variable X.

64 Entropy: Formal Definition Let p be a probability distribution over a discrete domain X. the entropy of p is H(p) = x X p(x) log p(x). Notice that we can think of the entropy as H(p) = E p log p(x). Since 0 p(x) 1, 1/p(x) > 1, log 1/p(x) > 0 and therefore 0 < H(p) <.

65 Entropy: Intuition The entropy of the random variable measures the uncertainty of the random variable: minimal value (of 0) if one event has probability 1, others - 0 no uncertainty; maximal and equal to log X when the distribution is uniform. We will see the notion of Entropy in three different contexts. As a measure for information content in a distributional representation. As a measure for information; how much information about X do we get from knowing about Y? As a way to characterize and choose probability distributional ultimately, classifiers.

66 Entropy: Example Consider a simple language L over a vocabulary of size 8. Assume that the distribution over L is uniform: p(l) = 1/8, l L. Then, H(L) = 8 p(l)logp(l) = log 1 8 = 3 l=1 The most efficient encoding for transmission: 3 bits for each symbols. An optimal code sends a message of probability p in logp bits. If the distribution over L is: {1/2, 1/8, 1/8, 1/8, 1/32, 1/32, 1/32, 1/32} Then H(L) = 8 l=1 p(l)logp(l) = 1/ (1/8 3) + 4(1/32 5) = 1/2 + 9/8 + 20/32 = 2.25.

67 Joint and Conditional Entropy Let p be a joint probability distribution over a pair of discrete random variables X, Y. The average amount of information needed to specify both their values is the joint entropy: H(p) = H(X, Y ) = y Y p(x, y) log p(x, y). x X The conditional entropy of Y given X, for p(x, y) expresses the average additional information one needs to supply about Y given that X is known: H(Y X) = p(x)h(y X = x) = x X x X p(x)[ y Y p(y x)logp(y x)] = y Y p(x, y) log p(y x) x X

68 Chain Rule The chain rule for entropy is given by: H(X, Y ) = H(X) + H(Y X). More generally: H(X 1, X 2,... X n ) = H(X 1 ) + H(X 2 X 1 ) H(X 1, X 2,... X n 1 ).

69 Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total

70 Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: In this case p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total H(p) = x X p(x) log p(x) = 0.3 log log log log 0.2 = 1.366/ ln 2 = 1.97

71 Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: In this case p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total H(p) = x X p(x) log p(x) = 0.5 log log log log 0.1 = 1.16/ ln 2 = 1.68

72 Selecting a Distribution: A simple example In the first case we distributed unknown probability uniformly: H(p) = In the second case, it was not uniform: H(p) = 1.68 Note: the maximum entropy for a distribution over 4 elements is log 1/4 = 2

73 Maximum Entropy Principle We should try to select the most uniform distribution which satisfies the constraints we observe: selecting a distribution which has the maximum entropy Intuitively, we minimizes the number of additional constraints one can assume about the distribution In general, there is no closed-form solution for this problem The distribution obtained by the MaxEnt principle is the same as the on obtained by Maximum Likelihood estimation for some restricted class of models We will talk about this when we discuss the logistic regression This is important because we often compute from data statistics of individual features rather than statistics of their combinations. How to combine them to get a model?

74 Mutual Information We showed: which implies: H(X, Y ) = H(X) + H(Y X) = H(Y ) + H(X Y ), H(X) H(X Y ) = H(Y ) H(Y X). This difference is called the mutual information between X and Y and is denoted I(X : Y ): the reduction in the uncertainty of one RV due to knowing about the other. = x,y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = = H(X) + H(Y ) H(X, Y ) = = x p(x)log 1 p(x) + x p(y)log 1 p(y) 1 x,y p(x, y)log p(x,y) = 1 p(x, y)log p(x) + 1 x,y p(x, y)log p(y) 1 x,y p(x, y)log p(x,y) = = + x,y p(x, y)log p(x,y) p(x)p(y)

75 Mutual Information I(X; Y ) = x,y p(x, y) p(x, y)log p(x)p(y) I(X;Y) can be best thought as a measure of independence between X and Y : If X and Y are independent than I(X; Y ) = 0 If X and Y are dependent than I(X; Y ) grows with the degree of dependence but also with the entropy. I(X; X) = H(X) H(X X) = H(X) It is also possible to talk about conditional mutual information: I(X; Y Z ) = H(X Z ) H(X Y, Z ).

76 Comments on MI We have defined mutual information between random variables. In many applications this is being abused and people use it to measure point-wise mutual information. The point-wise mutual information between these two points can be defined to be: p(x, y) I(x, y) = log p(x)p(y). This can be thought of as some statistical measure relating the events x and y.

77 Comments on MI For example, consider looking at pair of consecutive words w and w : p(w) - the frequency of the word w in a corpus, p(w ) - the frequency of the word w in a corpus p(w, w ) - frequency of the pair w, w in the corpus. and then, consider I(w, w ) as the amount of information about the occurrence of w in location i given that we know that w occurs at i + 1. This is not a good measure, and we will see examples for it (statistics such as χ 2 represent better what we want normally want)

78 Comments on MI: An example 1 Assume the w, w occur only together. Then: I(w, w ) = log p(w, w ) p(w)p(w ) = log p(w) p(w)p(w ) = log 1 p(w ) So, the mutual information increases among perfectly dependent bigrams, when they become rarer. 2 If they are completely independent, I(w, w ) = log p(w, w ) p(w)p(w = log1 = 0. ) So, it can be thought of as a good measure for independence, but not a good measure for dependence, since it depends on the score of individual words.

79 Relative Entropy or Kullback-Liebler Divergence Let p, q be two probability distributions over a discrete domain X. The relative entropy between p and q is D(p, q) = D(p q) = x X p(x) log p(x) q(x) Notice that D(p q) is not symmetric. It could be viewed as D(p q) = E p log p(x) q(x), the expectation according to p of log p/q. It is therefore unbounded and not defined if p gives positive support to instances q does not. It is 0 when p = 0.

80 KL Divergence: Properties Lemma. For any probability distributions p, q, D(p q) 0 and equality holds if and only if p = q. The KL divergence is not a metric, being non symmetric and does not satisfy the triangle inequality. We can still think about it as a distance between distributions. MI can be thought of as the KL divergence between the joint probability distribution on X, Y and the product distribution defined on X, Y. I(X; Y ) = D(p(x, y) p(x)p(y)). or, in o.w., it is the measure of how far the true joint is from independence.

81 Cross Entropy We can now think on the notion of the entropy of the language. Assume that you know the distribution; then you can think of the entropy as a measure of how surprised you are when you observe the next character in the language. Let p the true distribution, and q some model that we estimate from a sample generated by p: The Cross Entropy between a random variable X distributed according to p and another probability distribution q is given by: H(X, q) = H(X) + D(p q) = x p(x) log p(x) + x p(x) log p(x) q(x) = x p(x) log q(x).

82 Cross Entropy H(X, q) = x p(x) log q(x) Notice that this is just the expectation of 1/q(x) with respect to p(x): H(X, q) = E p 1 log q(x).

83 Cross Entropy We do not need to know p if we make some assumptions If we assume that p (our language) is stationary (probabilities assigned to sequences are invariant with respect to shifts) and ergodic (process the is not sensitive to initial conditions) then Shannon-McMillan-Breiman theorem shows that: H(X, q) lim n 1 n log q(w 1w 2... w n ). We can define a model q and approximate the entropy of a language models (hope: the lower the entropy is, the better the model is) How can we do it in practice?

84 Cross Entropy for Different Models Model Cross Entropy (bits) 0-th order 4.76 (= log 27; uniform over 27 characters) 1-st order nd order 2.8 The cross entropy can be used as an evaluation metric for language models (or the perplexity 2 H ) Reducing the perplexity thus means improving the language model. Notice that that does not always mean better predictions. Though sometimes we need distributions not predictions...

85 Information Theory: Summary Noisy Channel Model: many applications in NLP Entropy: measure of uncertainty, max-ent principle Mutual Information: measure of r.v. dependence KL divergence: distance between distributions Cross Entropy: language model evaluation

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples: Sept. 20-22 2005 1 CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall 2005 In this lecture we introduce some of the mathematical tools that are at the base of the technical approaches

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Oct CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall The from needs to be completed. (The form needs to be completed).

Oct CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall The from needs to be completed. (The form needs to be completed). Oct. 20 2005 1 CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall 2005 In this lecture we study maximum entropy models and how to use them to model natural language classification problems. Maximum

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019 Lecture 11: Information theory DANIEL WELLER THURSDAY, FEBRUARY 21, 2019 Agenda Information and probability Entropy and coding Mutual information and capacity Both images contain the same fraction of black

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Probability, Entropy, and Inference / More About Inference

Probability, Entropy, and Inference / More About Inference Probability, Entropy, and Inference / More About Inference Mário S. Alvim (msalvim@dcc.ufmg.br) Information Theory DCC-UFMG (2018/02) Mário S. Alvim (msalvim@dcc.ufmg.br) Probability, Entropy, and Inference

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

Lecture 11: Continuous-valued signals and differential entropy

Lecture 11: Continuous-valued signals and differential entropy Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Entropies & Information Theory

Entropies & Information Theory Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

An introduction to basic information theory. Hampus Wessman

An introduction to basic information theory. Hampus Wessman An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing Lecture 7 Agenda for the lecture M-ary hypothesis testing and the MAP rule Union bound for reducing M-ary to binary hypothesis testing Introduction of the channel coding problem 7.1 M-ary hypothesis testing

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Information. = more information was provided by the outcome in #2

Information. = more information was provided by the outcome in #2 Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information channels and coding will not discuss those here.. Information 2. Entropy 3. Mutual

More information

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Mellon University Mellon Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Chaos, Complexity, and Inference (36-462)

Chaos, Complexity, and Inference (36-462) Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 10: Oct 3 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H. Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16 EE539R: Problem Set 3 Assigned: 24/08/6, Due: 3/08/6. Cover and Thomas: Problem 2.30 (Maimum Entropy): Solution: We are required to maimize H(P X ) over all distributions P X on the non-negative integers

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Information in Biology

Information in Biology Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Advanced Probabilistic Modeling in R Day 1

Advanced Probabilistic Modeling in R Day 1 Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, 2015 1/24 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes

More information

Information in Biology

Information in Biology Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve

More information

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when

More information

Lecture 18: Quantum Information Theory and Holevo s Bound

Lecture 18: Quantum Information Theory and Holevo s Bound Quantum Computation (CMU 1-59BB, Fall 2015) Lecture 1: Quantum Information Theory and Holevo s Bound November 10, 2015 Lecturer: John Wright Scribe: Nicolas Resch 1 Question In today s lecture, we will

More information

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3 Outline Computer Science 48 More on Perfect Secrecy, One-Time Pad, Mike Jacobson Department of Computer Science University of Calgary Week 3 2 3 Mike Jacobson (University of Calgary) Computer Science 48

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

Entropy as a measure of surprise

Entropy as a measure of surprise Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

Chapter 9 Fundamental Limits in Information Theory

Chapter 9 Fundamental Limits in Information Theory Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Exercise 1: Basics of probability calculus

Exercise 1: Basics of probability calculus : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional

More information

02 Background Minimum background on probability. Random process

02 Background Minimum background on probability. Random process 0 Background 0.03 Minimum background on probability Random processes Probability Conditional probability Bayes theorem Random variables Sampling and estimation Variance, covariance and correlation Probability

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

Introduction to Machine Learning

Introduction to Machine Learning Uncertainty Introduction to Machine Learning CS4375 --- Fall 2018 a Bayesian Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell Most real-world problems deal with

More information

Noisy channel communication

Noisy channel communication Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University

More information

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is

More information

Chapter 2: Source coding

Chapter 2: Source coding Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

KDF2C QUANTITATIVE TECHNIQUES FOR BUSINESSDECISION. Unit : I - V

KDF2C QUANTITATIVE TECHNIQUES FOR BUSINESSDECISION. Unit : I - V KDF2C QUANTITATIVE TECHNIQUES FOR BUSINESSDECISION Unit : I - V Unit I: Syllabus Probability and its types Theorems on Probability Law Decision Theory Decision Environment Decision Process Decision tree

More information

Noisy-Channel Coding

Noisy-Channel Coding Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/05264298 Part II Noisy-Channel Coding Copyright Cambridge University Press 2003.

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019 Lecture 10: Probability distributions DANIEL WELLER TUESDAY, FEBRUARY 19, 2019 Agenda What is probability? (again) Describing probabilities (distributions) Understanding probabilities (expectation) Partial

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler Review: Probability BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de October 21, 2016 Today probability random variables Bayes rule expectation

More information

Shannon s Noisy-Channel Coding Theorem

Shannon s Noisy-Channel Coding Theorem Shannon s Noisy-Channel Coding Theorem Lucas Slot Sebastian Zur February 2015 Abstract In information theory, Shannon s Noisy-Channel Coding Theorem states that it is possible to communicate over a noisy

More information

1 INFO 2950, 2 4 Feb 10

1 INFO 2950, 2 4 Feb 10 First a few paragraphs of review from previous lectures: A finite probability space is a set S and a function p : S [0, 1] such that p(s) > 0 ( s S) and s S p(s) 1. We refer to S as the sample space, subsets

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Information Theory and Hypothesis Testing

Information Theory and Hypothesis Testing Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Week 4 Compressing streams. Iain Murray, Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

LECTURE 13. Last time: Lecture outline

LECTURE 13. Last time: Lecture outline LECTURE 13 Last time: Strong coding theorem Revisiting channel and codes Bound on probability of error Error exponent Lecture outline Fano s Lemma revisited Fano s inequality for codewords Converse to

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts

More information

Introduction to Information Entropy Adapted from Papoulis (1991)

Introduction to Information Entropy Adapted from Papoulis (1991) Introduction to Information Entropy Adapted from Papoulis (1991) Federico Lombardo Papoulis, A., Probability, Random Variables and Stochastic Processes, 3rd edition, McGraw ill, 1991. 1 1. INTRODUCTION

More information

QB LECTURE #4: Motif Finding

QB LECTURE #4: Motif Finding QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015 2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding 3 Transcription Initiation Chromatin

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Lecture 1: Probability Fundamentals

Lecture 1: Probability Fundamentals Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability

More information