CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms
|
|
- Benedict Chambers
- 6 years ago
- Views:
Transcription
1 CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms Spring 2009 February 3, 2009
2 Lecture Some note on statistics Bayesian decision theory Concentration bounds Information theory
3 Introduction Consider the following examples: 1 I saw a girl it the park. (I saw a girl in the park.) 2 The from needs to be completed. (The form needs to be completed). 3 I maybe there tomorrow. (I may be there tomorrow.) 4 Nissan s car plants grew rapidly in the last few years. Questions: How do we identify that there is a problem? How do we fix it?
4 Introduction In order to address these problems we would like to find a systematic way to study the problem. But How do we model the problem? How do we use this model to solve the problem? The problems described are new to you. You have never seen the sentences (1)-(4). How can you be so sure that you can recognize a problem and correct it?
5 Introduction It is clear that there exist some regularities in language. How can we exploit these to detect and correct errors? There are general mathematical approaches that can be used to translate this intuition into a model that we can study.
6 What is coming next? It seems clear that in order to make decisions of this sort, one has to rely on previously observed data. = use statistics In many cases you do not know what is the model which governs the generation of the data (study of distribution-free methods) The main interest here is to say smth about the data just by estimating some statistics Main results in learning theory are based on Large Deviation (or Concentration) Bounds Also a few of the key notions from Information Theory that are relevant in the study of NLP.
7 Types of Statistical Analysis Descriptive Statistics: In many cases we would like to collect data and learn some interesting things about it. Here, we want to summarize data and described it via different types of measurements. (E.g., we did it for Tom Sawyer book last time) Inferential Statistics: We want to draw conclusions from the data Often, to draw conclusions we need to make assumption, or formulate probability model for the data We will start with descriptive statistics
8 Descriptive Statistics Recall that the sample mean of data set x 1, x 2,... x n, is defined as n x = x i /n i=1 and the sample standard deviation as s = n (x i x) 2 /(n 1). i=1 What we can say about data if we see only sample mean and sample variance? Chebyshev s Inequality states that for any value of k 1, more than 100(1 1/k 2 ) percent of the data is within ks from the mean. E.g., for k = 2, more than 75% of the data lies within 2s from the sample mean. No assumption about the distribution
9 Chebyshev s Inequality for Sample Data Let x and s be the sample mean and sample standard deviation of the data set x 1, x 2,... x n, where s > 0. Let S k = {i, 1 i n : x i x < ks} and let N(S k ) be the number of elements in the set S k. Then, for any k 1 N(S k ) > 1 1 n k 2.
10 Proof n (n 1)s 2 = (x i x) 2 i=1 = i S k (x i x) 2 + i S k (x i x) 2 i S k (x i x) 2 i S k k 2 s 2 = k 2 s 2 (n N(S k )) where the first inequality follows from since all terms being summed are nonnegative, and the second from the definition of S k. Dividing both sides by nk 2 s 2 : N(S k ) n 1 n 1 nk 2 > 1 1 k 2
11 Limitations and Discussion Chebyshev s inequality holds universally, you should probably expect it to be loose for many distributions Indeed, if the data is distributed normally, we know that 95% of the data lies with x ± 2s. Analogous bound can be formulated for true means and variance (not their estimates): what is the probability of observing a point ks away from the true mean?
12 Sample Correlation Coefficient Other questions in descriptive statistics: assume that our data set consists of pairs of values that have some relationships to each other. The ith data point is represented as the pair (x i, y i ). A question of interest might be whether the x i s are correlated with the y i s? (that is, whether large x values go together with large y values, etc.)
13 Sample Correlation Coefficient Let s x and s y denote, respectively, the sample standard deviation of the x values and the y values. The sample correlation coefficient, denoted r, of the data pairs (x i, y i ), i = 1,... n, is defined by: r = n i=1 (x i x)(y i y) (n 1)s x s y = n i=1 (x i x)(y i y) n i=1 (x i x) 2 n i=1 (y i y) 2 When r > 0 we say that the sample data pairs are positively correlated, and when r < 0 we say they are negatively correlated.
14 Properties of r 1 1 r 1 2 If for constants a, b, with b > 0, y i = a + bx i, i = 1,... n then r = 1 3 If for constants a, b, with b < 0, y i = a + bx i, i = 1,... n then r = 1 4 If r is the sample correlation coefficient for the data pairs (x i, y i )i = 1,,... n then it is also the sample correlation coefficient for the data pairs (a + bx i, c + dy i )i = 1,,... n provided that b, d are both positive or both negative. (Side note: keep in mind correlation and statistical dependence are very different things)
15 Elements of Probability I hope you know this, but just 2 slides to make sure... The conditional probability of event E given that event F has occurred (P(F) > 0) is defined to be The Product Rule: P(E F) = P(E F) P(F) (1) P(E F) = P(E F)P(F) = P(F E)P(E). This rule can be generalized to multiple events: The Chain Rule: n 1 P(E 1 E 2... E n ) = P(E 1 )P(E 2 E 1 )P(E 3 E 1 E 2 ) P(E n E i ). i=1
16 Elements of Probability Total Probability: For mutually exclusive event F 1, F 2,... F n with i=1,n P(F i) = 1, P(E) = i P(E F i )P(F i ). The partitioning rule can be phrased also for conditional probabilities: Total Conditional Probability For mutually exclusive event F 1, F 2,... F n with i=1,n P(F i) = 1, P(E G) = P(E F i, G)P(F i G).
17 Bayesian Decision Theory Let E, F be events. As a simple consequence of the product rule we get the ability to swap the order of dependence between events. Bayes Theorem: (simple form) P(F E) = P(E F) P(F) P(E). General form of Bayes Theorem: P(F i E) = P(E F i)p(f i ) P(E) = P(E F i)p(f i ) i P(E F i)p(f i ). Bayes Rule, which is really a simple consequence of conditioning, is the basis for an important theory of decision making.
18 Bayesian Decision Theory The key observation is that Bayes rule can be used to determine, given a collection of events, which is the more likely event. Let B 1, B 2,... B k be a collection of events. Assume that we have observed the event A, and we would like to know which of the B i s is more likely given A. This is given by Bayes rule as: i =argmax i P(B i A) = argmax i P(A B i )P(B i ) P(A) = argmax i P(A B i )P(B i ) Note: We do not need probability P(A) for doing this (this is important for some problems, e.g. structured prediction)
19 Basics of Bayesian Learning The decision rule, we gave, is all there is to probabilistic decision making. Everything else comes from the details of defining the probability space and the events A, B i. The way we model the problem, that is, the way we formalize the decision we are after depends on our definition of the probabilistic model The key issues here would be (1) how to represent the probability distribution, and (2) how to represents the events. The probability space can be defined over sentences and their parse trees. The probability space can be defined over English sentences and their Chinese translation.
20 Key differences btw probabilistic approaches The key questions, and consequently the differences between models and computational approaches boil down to 1 what are the probability spaces? 2 how are they represented?
21 Bayes decision making We assume that we observe data S, and have a hypothesis space H, from which we want to choose the Bayes Optimal one, h. We assume a probability distribution over the class of hypotheses H. In addition, we will need to know how this distribution depends on the data observed.
22 Bayes decision making 1 p(h) - the prior probability of an hypothesis h. (background knowledge) 2 P(S) - probability that this sample of the data is observed (evidence, does not dependent on hypothesis). 3 P(S h) is the probability of observing the sample S, given that the hypothesis h holds. 4 P(h S) is the posterior probability of h.
23 Goal Typically, we are interested in computing P(h S) or, at least, providing a ranking of P(h S) for different hypotheses h H. However, the Bayesian approach is Hypothesis y = h(x) y = arg max P(y x) = arg max h [y = h(x)]p(h S) Instead of selecting a single hypothesis you use the predictive distribution Not always computationally feasible
24 Typical Learning Scenario The learning step here is inducing the best h from the data we observe. The learner is trying to find the most probably h H, given the observed data. 1 Such maximally probable hypothesis is called maximum a posteriori (MAP hypothesis. h MAP = argmax h H P(h S) = argmax h H P(S h) P(h) P(S) = argmax h H P(S h)p(h). 2 In some cases, we assume that the hypotheses are equally probable a priori (non-informative prior). In this case: P(h i ) = P(h j ), for all h i, h j H. h ML = argmax h H P(S h). (Maximum Likelihood hypothesis)
25 Example A given coin is either fair or has a 60% bias in favor of Head. Decide what is the bias of the coin. Two hypotheses: h 1 : P(H) = 0.5; h 2 : P(H) = 0.6 Prior: P(h) : P(h 1 ) = 0.75, P(h 2 ) = 0.25 Now observe the data: say, we tossed one and have H. P(D h) : P(D h 1 ) = 0.5; P(D h 2 ) = 0 P(D) = P(D h 1 )P(h 1 ) + P(D h 2 )P(h 2 ) = = P(h D) : P(h 1 D) = P(D h 1 )P(h 1 )/P(D) = /0.525 = P(h 2 D) = P(D h 2 )P(h 2 )/P(D) = /0.525 = 0.286
26 Another Example Given a corpus; what is the probability of the word it in the corpus? Bag-of-word model: let m be the number of words in the corpus. Let k be the number of it s in the corpus. The probability is p/k
27 Example Modeling: Assume a bag-of-words model, with a binomial distribution model. The problem becomes that of: you toss a (p, 1 p) coin m times and get k Heads, m k Tails. What is p? The probability of the data observed is: P(D p) = p k (1 p) m k. We want to find the value of p that maximizes this quantity. p = argmax p P(D p) = argmax p logp(d p) = argmax p [klogp + (m k)log(1 p)]. dlogp(d p) dkp + (m k)(1 p) = dp dp Solving this for p, gives: p = k/m = k p m k 1 p. = 0
28 Bayesian Approach vs. Maximum Likelihood Problem: Taxicab problem (Minka, 2001): viewing a city from the train, you see a taxi numbered x Assuming taxis are consecutively numbered, how many taxis are in the city? We will discuss how to tackle this task with Bayesian inference You can imagine similar problems in NLP...
29 Maximum Likelihood Approach We assume that the distribution is uniform: { 1/a if 0 x a p(x a) U(0, a) = 0 otherwise Let us assume slightly more general task: we observed N taxis, m is the largest id among the observed ones p(d a) = { 1 a N a >= m 0 otherwise To maximize p(d a) we select a = m In other words, our best hypothesis would be that we observed the largest id MAP approach (i.e. selecting a single hypothesis using a reasonable prior p(a)) would not help here too
30 What to do? Full Bayesian inference, i.e. summation over possible hypothesis weighted by p(a D) will give higher estimates of a Experiments with human concept learning also demonstrated that people give higher estimates in similar situations Let us consider this example in detail
31 Bayesian Analysis We use the Pareto distribution as our prior { Kb K if a b p(a) Pa(b, K ) = a K +1 0 otherwise It is the conjugate prior for the Uniform distribution a prior is conjugate to the likelihood function, if the posterior distribution belongs to the same family of distributions as the prior In our case: p(a D) is also a Pareto distribution This allows us to perform analytical integration instead of numerical estimations How to choose form of a prior distribution?
32 Bayesian Analysis We use the Pareto distribution as our prior { Kb K if a b p(a) Pa(b, K ) = a K +1 0 otherwise Properties This density asserts that a must be greater that b but not much greater, K controls how much greater
33 Bayesian Analysis p(a) Pa(b, K ) = { Kb K a K +1 if a b 0 otherwise If K 0 and b 0 we get non-informative prior I.e. prior becomes essentially uniform (formally it cannot be uniform over an unbounded set and be a valid distribution, but this is in limit) It make sense to use non-informative priors if you have no prior knowledge
34 Bayesian Analysis prior: p(a) Pa(b, K ) = { Kb K a K +1 if a b 0 otherwise likelihood: p(d a) = { 1 a N a >= m 0 otherwise We can write joint distribution of D and a: { Kb K p(d, a) = p(d a)p(a) = an+k +1 if a b, a m 0 otherwise
35 Bayesian Analysis Joint distribution of D and a: { Kb K p(d, a) = an+k +1 if a b, a m 0 otherwise Now let us compute the evidence p(d) = 0 P(D, a)da = max(b,m) { K if m b (N+K )b = N if m > b Kb K (N+K )m N+K Kb K a N+K +1 da
36 Bayesian Analysis Evidence: p(d) = { K (N+K )b N Kb K (N+K )m N+K if m b if m > b Evidence as function of m with a Pa(1, 2) prior:
37 Bayesian Analysis Evidence: p(d) = { K (N+K )b N Kb K (N+K )m N+K if m b if m > b Joint distribution of D and a: { p(d, a) = p(d a)p(a) = Kb K an+k +1 if a b, a m 0 otherwise Let us derive the posterior distribution: p(a D) = p(d, a)/p(d) Pa(c, N + K ), where c = max(m, b)
38 Bayesian Analysis: what is the answer? p(a D) = Pa(c, N + K ) Back to the taxicab problem Depending on the prior we will get different answers If we have no prior knowledge we can use the non-informative prior (b = 0, K = 0) Then we get p(a D) = Pa(m, N): The mode of posterior is m (that is why the MAP approach does not work too) To minimize the mean squared error we should use mean Nm of the posterior: N 1, but it is infinite for N = 1 To minimize absolute error we should use median, this happens for y = 2 1/N m, i.e. 2m for N = 1
39 Bayesian Analysis: what is the answer? What about the unbiased estimator of a? The cumulative distribution function of m is: P(max <= m a) = P(all N samples m a) = ( m a )N Get pdf of m by differentiation: p(m a) E[m] = d dm P(max m a) = N a ( m a )N 1 = a 0 p(m a)m da = a 0 N( m a )N da = Na N+1 Then unbiased estimator of a is N+1 N m. For N = 1, the median and the unbiased estimator agree on 2m.
40 Bayesian Analysis: which next taxi should we expect? To answer this question we need to construct the predictive distribution p(x D) = a p(x a)p(a D) = { N+K (N+K +1)c (N+K )c N+K (N+K +1)x N+K +1 if if x c x > c Note about conjugate priors: this is the same as evidence for one sample x with prior Pa(c, N + K ) For the original taxicab problem (N+1) and K 0 we get: p(x m) = { 1 2m m 2x 2 if x m ifx > m
41 Bayesian Analysis: Discussion The example demonstrated that we cannot always rely on ML and MAP, or at least we need to be careful It is often possible to perform full Bayesian inference But we should be clever : e.g., use conjugate priors (there are tables of them; not all distribution have them) Criticism of conjugate priors Similar problems in NLP (though we do not normally have uniform distributions there recollect the previous lecture)
42 Representation of probability distributions Probability distributions can be represented in many ways. The representation is crucially important since it determined what we can say about it and what can be (efficiently) computed about it. A distribution can be represented explicitly, by specifying the probability of each element in the space. It can be specified parametrically: Binomial distribution; Normal distribution; geometric distribution, etc. It can be represented graphically. It can be specified by specifying a process that generates elements in the space. Depending on the process, it may be easy or hard to actually compute the probability of an event in the space.
43 Example of Representation Consider the following model: We have five characters, A,B,C,D,E. At any point in time we can generate: A, with probability 0.4, B, with probability 0.1, C, with probability 0.2, D, with probability 0.1, E, with probability 0.2. E represents end of sentence. A sentence in this language could be: AAACDCDAADE An unlikely sentence in this language would be: DDDBBBDDDBBBDDDBBB. It is clear that given the model we can compute the probability of each string and make a judgment as to which sentence is more likely.
44 Another model of Language: Markovian Models can be more complicated; E.g., a probabilistic finite state model: Start state is A, with probability 0.4; B, with probability 0.4, C, with probability 0.2. From A you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. From B you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. From C you can go to A, B or C, with probabilities 0.5, 0.2, 0.3. Etc. In this way we can easily define models that are intractable When you think about language, you need, of course to think about what do A, B,... represent. words, pos tags, structures of some sort, ideas?..
45 Learning Parameters Often we are only given a family of probability distribution (defined parametrically, as a process or by specifying some conditions as above.) A key problem in this case might be to choose one specific distribution in this family. This can be done by observing data, specifying some property of the distribution that we care about the might distinguish a unique element in the family, or both.
46 Probabilistic Inequalities and the law of Large Numbers As we will see, large deviation bounds are at the base of the Direct Learning approach. When one wants to directly learn how to make predictions the approach is basically as follows: Look at many examples Discover some regularities in the language Use these regularities to construct a prediction policy
47 Generalization The key question the direct learning approach needs to address, given the distribution free assumption, is that of generalization. Why would the prediction policy induced from the empirical data be valid on new, previously unobserved data? Assumption: throughout the process, we observe examples sampled by some fixed, but unknown, distribution i.e.,, the distribution the governs the generation of examples during the learning process is the one that governs it during the evaluation process (identically distributed) Assumption: data instances are independently generated Two basic results from statistics are useful in the learning theory (Chernoff Bounds and Union bounds)
48 Chernoff Bounds Consider a binary random variable X (e.g., the result of a coin toss) which has probability p of being head (1) Two important questions are: How do we estimate p? How confident are we in this estimate? Consider m samples, {x 1, x 2,...}, drawn independently from X. A natural estimate of the underlying probability p is the relative frequency of x i = 1. That is ˆp = i x i /m. Notice that this is also the maximum likelihood estimate according to the binomial distribution.
49 Chernoff Bounds The law of large numbers: ˆp will converge to p with m. How quickly? Chernoff Theorem gives a bound on this
50 Chernoff Bounds: Theorem For all p [0, 1], ɛ > 0, P[ p ˆp > ɛ] 2e 2mɛ2 (2) where the probability P is taken over the distribution of training samples of size m generated with underlying parameter p. Exponential rate of convergence Some numbers: Take m = 1000, ɛ = 0.05, then e 2mɛ2 = e 5 1/148. That is: if we repeatedly take samples of size 1000 and compute the relative frequency, then in 147 out of 148 samples, our estimate will be within 0.05 from the true probability. We need to be quite unlucky for that not to happen. There exist some other large deviation bounds (aka concentration bounds): Hoeffding, McDiaramid, Azuma, etc. How do you use a bound of this sort to justify generalization?
51 Union Bounds A second useful result, a very simple one in fact, is that of the Union Bounds: For any n events {A 1, A 2,... A n } and for any distribution P whose sample space includes all A i, P[A 1 A 2... A n ] i P[A i ] (3) The Union Bound follows directly from the axioms of probability theory. For example, for n = 2 P[A 1 A 2 ] = P[A 1 ] + P[A 2 ] P[A 1 A 2 ] P[A 1 ] + A[A 2 ]. The general result follows by induction on n
52 Today s summary Descriptive statistics (e.g., Chebyshev s bound) Bayesian Decision Theory, Bayesian Inference Concentration Bounds We will talk about Information Theory next time but we will also start smoothing/language modeling
53 Assignments Reminder: Feb, 13 (midnight) is the deadline for the first critical review Reminder: Let me know about your choice of a paper to present by next Wednesday (Feb, 4)
54 Information Theory: Noisy Channel model One of the important computational abstractions made in NLP is the of the Noisy Channel Model (Shannon). Actually, just an application of Bayes rule, but it is sometimes a useful way to think about and model problems. Shannon: modeling the process of communication across channel, optimizing throughput and accuracy Tradeoff between compression and redundancy Assumption: Probabilistic process that governs the generation of language (messages) A Probabilistic corruption process, modeled via the notion of a channel that affects the generated language structure before it reaches the listener.
55 Noisy Channel: Definitions Let I be the input space, a space of some structures in the language. Let P be a probability distribution over I; that is, this is our language model. The noisy channel has a probability distribution P(o i) of outputting the structure o when receiving the input Input (i) -> Noisy Channel -> o -> decoder -> i
56 Noisy Channel: more The goal is to decode: I = argmax i I P(i o) = argmax i I P(o i)p(i)/p(o) = argmax i I P(o i)p(i). In addition to the language model, we also need to know the distribution of the noisy channel. Example: Translation: According to this model, English is nothing but corrupted French (true text (i): French, output (o): English).
57 More examples
58 Noisy Channel model: Conclusion The key achievement of Shannon: the ability to quantify the theoretically best data compression possible and the best transmission rate possible and the relation between them. Mathematically, these correspond to the Entropy of the transmitted distribution and the Capacity of the channel. Roughly: each channel is associated with a capacity and if you transmit your information at a rate that is lower then it, the decoding error can be as small as possible (Shannon s theorem 1948). The channel capacity is given using the notion of the mutual information between the random variables I and O. That is the reason that Entropy and Mutual Information became important notions in NLP.
59 Entropy For a give random variable X, how much information is conveyed in the message that X = x? How to quantify it? We can first agree that the amount of information in the message that X = x should depend on how likely it was the X would equal x. Examples: There is more information in a message that there will be a tornado tomorrow than that in a message that there will be no tornado (rare event) if X represents the sum of two fair dice, then there seems to be more information in the message that X = 12 than there would be in the message that X = 7 since the former event happens with probability 1/36 and the latter - 1/6.
60 Data Transmission Intuition Intuition: a Good compression algorithm (e.g., Huffman coding) encodes more frequent events with shorter codes; for less frequent events more bits need to be transmitted (longer messages contain more information) Does language compress information in the similar way? To some degree: rare words are generally longer (Zipf, 1938; Guiter, 1974) language often evolves by replacing frequent long words with shorter words (e.g, bicycle -> bike) There are interesting analysis on application of information theory to language evolution (e.g., Plotkin and Nowak, 2000)
61 Information: Mathematical Form Let s denote by I(p) the amount of information contained in the message that an event whose probability is p has occurred: Clearly I(p) should be non negative, decreasing function of p. To determine it s form, let X and Y be independent random variables, and suppose that P{X = x} = p and P{Y = y} = q. Knowledge that X equals x does not affect the probability that Y will equal y the amount of information in the message that X = x and Y = y is I(p) + I(q). On the other hand: P{X = x, Y = y} = P{X = x}p{y = y} = pq which implies that this amount of information is I(pq) Therefore, the function I should satisfy the identity I(pq) = I(p) + I(q)
62 Information: Mathematical Form I(pq) = I(p) + I(q) If we define I(2 p ) = G(p) we get from the above that G(p+q) = I(2 (p+q) ) = I(2 p 2 q ) = I(2 p )+I(2 q ) = G(p)+G(q) We get G(p) = cp for some constant c, or I(2 p ) = cp, Letting z = 2 p I(z) = c log 2 (q) for some positive constant c. Assume that c = 1 and say that the information is measured in bits.
63 Entropy Consider a random variable X, which takes values x 1,... x n with respective probabilities p 1,... p n. log(p i ) represents the information conveyed by the message that X is equal to x i it follows that the expected amount of information that will be conveyed when the value of X is transmitted: H(X) = n p i log 2 (p i ) i=1 This quantity is know in information theory as the entropy of the random variable X.
64 Entropy: Formal Definition Let p be a probability distribution over a discrete domain X. the entropy of p is H(p) = x X p(x) log p(x). Notice that we can think of the entropy as H(p) = E p log p(x). Since 0 p(x) 1, 1/p(x) > 1, log 1/p(x) > 0 and therefore 0 < H(p) <.
65 Entropy: Intuition The entropy of the random variable measures the uncertainty of the random variable: minimal value (of 0) if one event has probability 1, others - 0 no uncertainty; maximal and equal to log X when the distribution is uniform. We will see the notion of Entropy in three different contexts. As a measure for information content in a distributional representation. As a measure for information; how much information about X do we get from knowing about Y? As a way to characterize and choose probability distributional ultimately, classifiers.
66 Entropy: Example Consider a simple language L over a vocabulary of size 8. Assume that the distribution over L is uniform: p(l) = 1/8, l L. Then, H(L) = 8 p(l)logp(l) = log 1 8 = 3 l=1 The most efficient encoding for transmission: 3 bits for each symbols. An optimal code sends a message of probability p in logp bits. If the distribution over L is: {1/2, 1/8, 1/8, 1/8, 1/32, 1/32, 1/32, 1/32} Then H(L) = 8 l=1 p(l)logp(l) = 1/ (1/8 3) + 4(1/32 5) = 1/2 + 9/8 + 20/32 = 2.25.
67 Joint and Conditional Entropy Let p be a joint probability distribution over a pair of discrete random variables X, Y. The average amount of information needed to specify both their values is the joint entropy: H(p) = H(X, Y ) = y Y p(x, y) log p(x, y). x X The conditional entropy of Y given X, for p(x, y) expresses the average additional information one needs to supply about Y given that X is known: H(Y X) = p(x)h(y X = x) = x X x X p(x)[ y Y p(y x)logp(y x)] = y Y p(x, y) log p(y x) x X
68 Chain Rule The chain rule for entropy is given by: H(X, Y ) = H(X) + H(Y X). More generally: H(X 1, X 2,... X n ) = H(X 1 ) + H(X 2 X 1 ) H(X 1, X 2,... X n 1 ).
69 Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total
70 Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: In this case p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total H(p) = x X p(x) log p(x) = 0.3 log log log log 0.2 = 1.366/ ln 2 = 1.97
71 Selecting a Distribution: A simple example Consider the problem of selecting a probability distribution over Z = X C, where X = {x, y} and C = {0, 1}. There could be many over this space of 4 elements. Assume, by observing a sample S we determine that: We need to fill this table: In this case p{c = 0} p(x, 0) + p(y, 0) = 0.6. C = 0 C = 1 total X = x Y = y total H(p) = x X p(x) log p(x) = 0.5 log log log log 0.1 = 1.16/ ln 2 = 1.68
72 Selecting a Distribution: A simple example In the first case we distributed unknown probability uniformly: H(p) = In the second case, it was not uniform: H(p) = 1.68 Note: the maximum entropy for a distribution over 4 elements is log 1/4 = 2
73 Maximum Entropy Principle We should try to select the most uniform distribution which satisfies the constraints we observe: selecting a distribution which has the maximum entropy Intuitively, we minimizes the number of additional constraints one can assume about the distribution In general, there is no closed-form solution for this problem The distribution obtained by the MaxEnt principle is the same as the on obtained by Maximum Likelihood estimation for some restricted class of models We will talk about this when we discuss the logistic regression This is important because we often compute from data statistics of individual features rather than statistics of their combinations. How to combine them to get a model?
74 Mutual Information We showed: which implies: H(X, Y ) = H(X) + H(Y X) = H(Y ) + H(X Y ), H(X) H(X Y ) = H(Y ) H(Y X). This difference is called the mutual information between X and Y and is denoted I(X : Y ): the reduction in the uncertainty of one RV due to knowing about the other. = x,y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = = H(X) + H(Y ) H(X, Y ) = = x p(x)log 1 p(x) + x p(y)log 1 p(y) 1 x,y p(x, y)log p(x,y) = 1 p(x, y)log p(x) + 1 x,y p(x, y)log p(y) 1 x,y p(x, y)log p(x,y) = = + x,y p(x, y)log p(x,y) p(x)p(y)
75 Mutual Information I(X; Y ) = x,y p(x, y) p(x, y)log p(x)p(y) I(X;Y) can be best thought as a measure of independence between X and Y : If X and Y are independent than I(X; Y ) = 0 If X and Y are dependent than I(X; Y ) grows with the degree of dependence but also with the entropy. I(X; X) = H(X) H(X X) = H(X) It is also possible to talk about conditional mutual information: I(X; Y Z ) = H(X Z ) H(X Y, Z ).
76 Comments on MI We have defined mutual information between random variables. In many applications this is being abused and people use it to measure point-wise mutual information. The point-wise mutual information between these two points can be defined to be: p(x, y) I(x, y) = log p(x)p(y). This can be thought of as some statistical measure relating the events x and y.
77 Comments on MI For example, consider looking at pair of consecutive words w and w : p(w) - the frequency of the word w in a corpus, p(w ) - the frequency of the word w in a corpus p(w, w ) - frequency of the pair w, w in the corpus. and then, consider I(w, w ) as the amount of information about the occurrence of w in location i given that we know that w occurs at i + 1. This is not a good measure, and we will see examples for it (statistics such as χ 2 represent better what we want normally want)
78 Comments on MI: An example 1 Assume the w, w occur only together. Then: I(w, w ) = log p(w, w ) p(w)p(w ) = log p(w) p(w)p(w ) = log 1 p(w ) So, the mutual information increases among perfectly dependent bigrams, when they become rarer. 2 If they are completely independent, I(w, w ) = log p(w, w ) p(w)p(w = log1 = 0. ) So, it can be thought of as a good measure for independence, but not a good measure for dependence, since it depends on the score of individual words.
79 Relative Entropy or Kullback-Liebler Divergence Let p, q be two probability distributions over a discrete domain X. The relative entropy between p and q is D(p, q) = D(p q) = x X p(x) log p(x) q(x) Notice that D(p q) is not symmetric. It could be viewed as D(p q) = E p log p(x) q(x), the expectation according to p of log p/q. It is therefore unbounded and not defined if p gives positive support to instances q does not. It is 0 when p = 0.
80 KL Divergence: Properties Lemma. For any probability distributions p, q, D(p q) 0 and equality holds if and only if p = q. The KL divergence is not a metric, being non symmetric and does not satisfy the triangle inequality. We can still think about it as a distance between distributions. MI can be thought of as the KL divergence between the joint probability distribution on X, Y and the product distribution defined on X, Y. I(X; Y ) = D(p(x, y) p(x)p(y)). or, in o.w., it is the measure of how far the true joint is from independence.
81 Cross Entropy We can now think on the notion of the entropy of the language. Assume that you know the distribution; then you can think of the entropy as a measure of how surprised you are when you observe the next character in the language. Let p the true distribution, and q some model that we estimate from a sample generated by p: The Cross Entropy between a random variable X distributed according to p and another probability distribution q is given by: H(X, q) = H(X) + D(p q) = x p(x) log p(x) + x p(x) log p(x) q(x) = x p(x) log q(x).
82 Cross Entropy H(X, q) = x p(x) log q(x) Notice that this is just the expectation of 1/q(x) with respect to p(x): H(X, q) = E p 1 log q(x).
83 Cross Entropy We do not need to know p if we make some assumptions If we assume that p (our language) is stationary (probabilities assigned to sequences are invariant with respect to shifts) and ergodic (process the is not sensitive to initial conditions) then Shannon-McMillan-Breiman theorem shows that: H(X, q) lim n 1 n log q(w 1w 2... w n ). We can define a model q and approximate the entropy of a language models (hope: the lower the entropy is, the better the model is) How can we do it in practice?
84 Cross Entropy for Different Models Model Cross Entropy (bits) 0-th order 4.76 (= log 27; uniform over 27 characters) 1-st order nd order 2.8 The cross entropy can be used as an evaluation metric for language models (or the perplexity 2 H ) Reducing the perplexity thus means improving the language model. Notice that that does not always mean better predictions. Though sometimes we need distributions not predictions...
85 Information Theory: Summary Noisy Channel Model: many applications in NLP Entropy: measure of uncertainty, max-ent principle Mutual Information: measure of r.v. dependence KL divergence: distance between distributions Cross Entropy: language model evaluation
1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:
Sept. 20-22 2005 1 CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall 2005 In this lecture we introduce some of the mathematical tools that are at the base of the technical approaches
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationOct CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall The from needs to be completed. (The form needs to be completed).
Oct. 20 2005 1 CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall 2005 In this lecture we study maximum entropy models and how to use them to model natural language classification problems. Maximum
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationLecture 11: Information theory THURSDAY, FEBRUARY 21, 2019
Lecture 11: Information theory DANIEL WELLER THURSDAY, FEBRUARY 21, 2019 Agenda Information and probability Entropy and coding Mutual information and capacity Both images contain the same fraction of black
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationProbability, Entropy, and Inference / More About Inference
Probability, Entropy, and Inference / More About Inference Mário S. Alvim (msalvim@dcc.ufmg.br) Information Theory DCC-UFMG (2018/02) Mário S. Alvim (msalvim@dcc.ufmg.br) Probability, Entropy, and Inference
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationLecture 11: Continuous-valued signals and differential entropy
Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationEntropies & Information Theory
Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information
More informationInformation Theory, Statistics, and Decision Trees
Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationAn introduction to basic information theory. Hampus Wessman
An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on
More informationNatural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module
More informationLecture 7. Union bound for reducing M-ary to binary hypothesis testing
Lecture 7 Agenda for the lecture M-ary hypothesis testing and the MAP rule Union bound for reducing M-ary to binary hypothesis testing Introduction of the channel coding problem 7.1 M-ary hypothesis testing
More informationIntroduction to Bayesian Learning
Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline
More informationInformation. = more information was provided by the outcome in #2
Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information channels and coding will not discuss those here.. Information 2. Entropy 3. Mutual
More informationA Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University
A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Mellon University Mellon Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information
More informationBioinformatics: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationChaos, Complexity, and Inference (36-462)
Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationEE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018
Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationBayesian Learning Features of Bayesian learning methods:
Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more
More information10-704: Information Processing and Learning Fall Lecture 10: Oct 3
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of
More information1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.
Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationEE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16
EE539R: Problem Set 3 Assigned: 24/08/6, Due: 3/08/6. Cover and Thomas: Problem 2.30 (Maimum Entropy): Solution: We are required to maimize H(P X ) over all distributions P X on the non-negative integers
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationInformation in Biology
Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationAdvanced Probabilistic Modeling in R Day 1
Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, 2015 1/24 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes
More informationInformation in Biology
Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve
More informationMultimedia Communications. Mathematical Preliminaries for Lossless Compression
Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when
More informationLecture 18: Quantum Information Theory and Holevo s Bound
Quantum Computation (CMU 1-59BB, Fall 2015) Lecture 1: Quantum Information Theory and Holevo s Bound November 10, 2015 Lecturer: John Wright Scribe: Nicolas Resch 1 Question In today s lecture, we will
More informationOutline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3
Outline Computer Science 48 More on Perfect Secrecy, One-Time Pad, Mike Jacobson Department of Computer Science University of Calgary Week 3 2 3 Mike Jacobson (University of Calgary) Computer Science 48
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More informationEntropy as a measure of surprise
Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More informationMACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be
More informationChapter 9 Fundamental Limits in Information Theory
Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationIntroduction to Machine Learning
Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationExercise 1: Basics of probability calculus
: Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional
More information02 Background Minimum background on probability. Random process
0 Background 0.03 Minimum background on probability Random processes Probability Conditional probability Bayes theorem Random variables Sampling and estimation Variance, covariance and correlation Probability
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,
More informationRelationship between Least Squares Approximation and Maximum Likelihood Hypotheses
Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a
More informationIntroduction to Machine Learning
Uncertainty Introduction to Machine Learning CS4375 --- Fall 2018 a Bayesian Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell Most real-world problems deal with
More informationNoisy channel communication
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University
More informationInformation Theory CHAPTER. 5.1 Introduction. 5.2 Entropy
Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is
More informationChapter 2: Source coding
Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent
More information1 Introduction to information theory
1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through
More informationLecture 9: Bayesian Learning
Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal
More informationKDF2C QUANTITATIVE TECHNIQUES FOR BUSINESSDECISION. Unit : I - V
KDF2C QUANTITATIVE TECHNIQUES FOR BUSINESSDECISION Unit : I - V Unit I: Syllabus Probability and its types Theorems on Probability Law Decision Theory Decision Environment Decision Process Decision tree
More informationNoisy-Channel Coding
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/05264298 Part II Noisy-Channel Coding Copyright Cambridge University Press 2003.
More informationLecture 7: DecisionTrees
Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:
More informationLecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019
Lecture 10: Probability distributions DANIEL WELLER TUESDAY, FEBRUARY 19, 2019 Agenda What is probability? (again) Describing probabilities (distributions) Understanding probabilities (expectation) Partial
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationReview: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler
Review: Probability BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de October 21, 2016 Today probability random variables Bayes rule expectation
More informationShannon s Noisy-Channel Coding Theorem
Shannon s Noisy-Channel Coding Theorem Lucas Slot Sebastian Zur February 2015 Abstract In information theory, Shannon s Noisy-Channel Coding Theorem states that it is possible to communicate over a noisy
More information1 INFO 2950, 2 4 Feb 10
First a few paragraphs of review from previous lectures: A finite probability space is a set S and a function p : S [0, 1] such that p(s) > 0 ( s S) and s S p(s) 1. We refer to S as the sample space, subsets
More informationInformation Theory. Coding and Information Theory. Information Theory Textbooks. Entropy
Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationInformation Theory and Hypothesis Testing
Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationInformation Theory. Week 4 Compressing streams. Iain Murray,
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]
More informationProbability and Statistics
Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT
More informationLECTURE 13. Last time: Lecture outline
LECTURE 13 Last time: Strong coding theorem Revisiting channel and codes Bound on probability of error Error exponent Lecture outline Fano s Lemma revisited Fano s inequality for codewords Converse to
More informationLanguage Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016
Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationIntroduction to Information Entropy Adapted from Papoulis (1991)
Introduction to Information Entropy Adapted from Papoulis (1991) Federico Lombardo Papoulis, A., Probability, Random Variables and Stochastic Processes, 3rd edition, McGraw ill, 1991. 1 1. INTRODUCTION
More informationQB LECTURE #4: Motif Finding
QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015 2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding 3 Transcription Initiation Chromatin
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationDiscrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14
CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationLecture 1: Probability Fundamentals
Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability
More information