Ling/CMSC 773 Take-Home Midterm Spring 2008

Size: px
Start display at page:

Download "Ling/CMSC 773 Take-Home Midterm Spring 2008"

Transcription

1 Solution set. This document is a solution set. It will only be available to students temporarily. It may not be kept, nor shown to anyone outside the current class at any time. Typical correct answers are given along with some guidance as to point values for grading; notwithstanding that explicit guidance, all grading is discretionary and based on the level of understanding and competence explicitly demonstrated in answers. Ling/CMSC 773 Take-Home Midterm Spring

2 1 Short Answer Questions [30 points total] Answer each question clearly and briefly. (Even if you make sure to be pretty thorough which you should, in order to obtain partial credit these are short answers, not long essays.) (a) Give a constituency tree for Many students said they enjoyed the interesting exam (using the usual sort of notation from class, please, nothing linguistically fancy). Give a head table for the context-free rules needed to construct the tree, i.e. context-free rules annotated to show which constituent is the head. Then show explicitly, step-by-step, how to convert your constituency tree into an unlabeled dependency tree. Answer: Tree and head table should be reasonable, and students must show or describe explicitly the intermediate steps of (a) propagating the heads up the tree, and (b) using the propagated heads to extract the dependency relations. (Guidance: 2 points for each of those two criteria, otherwise discretionary.) (b) For this question, assume data sparseness is not a problem, i.e. assume all probabilities can be estimated accurately. Consider the difficulty that a simple probabilistic CFG has in distinguishing the likelihood of the sentences people eat roasted peanuts and peanuts eat roasted people. Briefly describe one solution to this problem and its advantages and disadvantages. Now consider the sentence Children sometimes munch chocolate soldiers on Christmas morning. With the solution you just described, is the probability of this sentence likely to have a value that accords with your intuition? Explain why or why not. Answer: Recall that the difficulty in this example is that the two sentences have the same probability, since the expansions from N to peanuts and from N to people take place without regard to where they are in the tree. An obvious solution to pick is fully lexicalizing the grammar, i.e. propagating terminal symbols up into the nonterminals so that you d have rules like VP eat V eat NP people. This has the advantage of making lexically based co-occurrence probabilities relevant in the grammar, e.g. the above rule would be much less likely than VP eat V eat NP peanuts, so the two sentences trees would have different probabilities. However, chocolate soldier is still a problem NP for that solution: the problem would show up in the probability for VP eat V munch NP soldiers, which would be low assuming any reasonably realistic training set. (The corpus is a proxy for knowledge about what happens in the world, and generally soldiers don t get munched.) Chocolate soldiers behave differently from real soldiers both linguistically and in the world. The linguistically interesting aspect of this example is that it violates the assumption which holds true pretty generally that the distribution of a phrase is heavily dominated by the head of the phrase. (c) Suppose you have an unfair coin that comes up heads 7/8 of the time when you flip it. Suppose you flip the coin 1000 times and report the outcome each time, paying one penny per bit for your communication. (In cash, due before you get to send the message.) If you re thrifty (you want to pay as little as possible) and smart (you ve taken this class), what s the lowest amount you could expect to get away with paying? Justify your answer. Answer: If we were communicating the results of 1000 trials all at once, we could choose a code where heads gets a shorter representation than tails. In that case, the entropy of the distribution provides a lower bound for the average message length; that s H(1/8,7/8) = - (1/8 * log2(1/8)) + (7/8 * log2(7/8)) = 0.54, so the total payment would have a lower bound of $5.40. However, the problem says you ve got to communicate the outcome each time. Unless you ve proposed some way to communicate less than a single bit of information in a message (none that I m aware of), there s no way to encode heads in less than a bit. This means that the best you can do is sending 1 for heads and 0 for tails (or vice versa), for a total best cost of $ (Grading guidance: 3/5 for using the entropy to get $5.40. Another point for recognizing the fraction-of-abit problem but getting something other than $10.00.) (d) A former Amazon employee once said: We sold more books today that didn t sell at all yesterday than we sold today of all the books that did sell yesterday. (To state this in plainer English, If you look at 2

3 our infrequently-sold books, the total number we sell (let s call it N) is very large, even if each individual book is only bought by a few people. And if you look at N, it s bigger than the total sales of the frequently sold books! ) George Zipf would be unsurprised. Briefly explain why. Answer: This quote was taken from The Long Tail at Wikipedia. The Long Tail is a property of many distributions, including Zipf s. From Wikipedia: In these distributions a high-frequency or high-amplitude population is followed by a low-frequency or low-amplitude population which gradually tails off. In many cases the infrequent or low-amplitude events... can cumulatively outnumber or outweigh the initial portion... such that in aggregate they comprise the majority. (Guidance: Some reference to the long tail property of Zipfian distributions must be made.) (e) Briefly explain/illustrate how a measure of uncertainty can also be considered a measure of surprise and a measure of quantity of information. Answer: This can be illustrated by considering a scenario in which outcomes are communicated to between parties who share knowledge of the probability distribution over those outcomes, e.g. the result of a horse race (if one imagines the same horses racing over and over again, with a fixed probability distribution for which horse will win). If the outcome is completely certain, then uncertainty is 0, and so is the the quantity of information that needs to be communicated, equally so the recipient s level of surprise. If the result is maximally uncertain, i.e. a uniform distribution, then the recipient has absolutely no information in advance, so one could not be any more surprised by an outcome and the quantity of information communicated is at its maximum. (Guidance: All three characterizations must be mentioned.) (f) Here s an argument: Consider the sentences (i) Hold the newsreader s nose squarely, waiter, or friendly milk will countermand my trousers, and (ii) Countermand friendly, hold milk my, newsreader s nose or squarely the trousers waiter will. Neither (it s fair to assume) has ever occurred in the prior experience of a speaker of English, and yet any such speaker would readily identify the former as grammatical and the latter as ungrammatical. Therefore one cannot associate the notion grammaticality of a sentence in English with the notion likelihood of a sentence in English. Briefly but convincingly demolish this argument. Answer: The student should be able to generalize from the argument involving colorless green ideas sleep furiously to this example. The argument is done nicely in Section 4.2 of the Abney (1996) reading, and in the (optional) Pereira (2000) reading, Pereira shows that using the aggregate bigram model (Question 2!), the grammatical and ungrammatical versions of the sentence differ in probability by five orders of magnitude. Applying those arguments would be nice, but I m ok with any reasonable discussion showing that the student knows maximum likelihood estimation of a bigram model is a naïve way to assign probabilities to sentences (i) and (ii), and better correlations between probability and grammaticality intuitions can be obtained with more sophisticated probabilistic modeling (e.g. good smoothing, hidden classes, backoff to part-of-speech tags). 3

4 2 EM [20 points] Consider the following hidden model variation on a bigram model for word sequences. As in the usual bigram model, we express the probability of an entire sequence w 1 w 2... w T by (1) T p(w 1 w 2... w T ) = p(w 1 ) p(w t w t 1 ). t=2 However, the parameters used in the product are defined as follows: (2) p(w t w t 1 ) = C p(w t c)p(c w t 1 ) i=1 In plain English, the generative story for this model is the following. Instead of generating the next word w t based on the previous word w t 1, as in a usual bigram model, we generate a class c based on w t 1, and then we generate w t based on c. 1 So the probability of choosing the next word w t is a sum of probabilities, one for each hidden class. The probability contributed to the choice of w t for each class c, namely p(w t c)p(c w t 1 ), represents the joint probability of two events: picking c based on w t 1, and then picking w t based on c. Since c is hidden, we have to sum up over the different possibilities. Intuitively, the hidden class c can be viewed as capturing the general properties of w t 1 that are relevant for generating the next word. Or you can think of going through hidden class c to get from w t 1 to w t. As you can see, there are two sets of parameters in this model. The first set is the word-to-class probabilities p(c w i ), and the second set is the class-to-word probabilities p(w j c). (Where both w i and w j range over the entire vocabulary, and c ranges over the set of C hidden classes.) Because of the way the model is structured, there s an EM algorithm for estimating these parameters that is much simpler than the Forward-Backward algorithm. In particular, there s no need at all for dynamic programming. 2 For the sake of consistent notation, please use N(x, y) as your notation for counts. E.g. N(w i, w j ) would be the number of times w i is followed by w j. (a) (2 points) In addition to capturing generalizations (by associating words with abstract classes that are learned automatically), an advantage of this model is that in practice it has far fewer parameters than a usual (non-hidden) bigram model. Let the size of the vocabulary be V and the number of classes be C. How many classes can you use in the hidden model before the total number of parameters exceeds the number of parameters in the usual (non-hidden) bigram model? If you re comparing the total number of parameters in the two models, What does the comparison look like for typical values of C = 32 and V = 50000? Answer: A bigram model has V 2 parameters (or more properly V (V + 1) if you include the parameters for starting the string with w i for each w i, but O(V 2 ) in any case so I won t nitpick on that point). The hidden model has O(2CV ) parameters. Minus nitpicking, the latter reaches the former when C exceeds V 2. It s many orders of magnitude larger in the usual circumstances: versus (Guidance: require explicit mentions of V 2 and 2CV, and correct explicit comparison using typical values.) (b) (2 points) Suppose the classes c were observable rather than hidden. Express the maximum likelihood estimate for the probability p(w j c) in terms of counts. Answer: p(w j c) = N(wj,c) w j N(w j,c). Expressing denominator as N(c) is ok. 1 Also, as usual, we can assume w 1 is always a special start word that always starts an observed sequence with probability 1. 2 Intuitively, this is because the choice of the hidden class at every step depends only on what s observable, i.e. hidden classes are independent of each other given the intervening word. 4

5 (c) (2 points) Suppose the classes c were observable rather than hidden. Express the maximum likelihood estimate for the probability p(c w i ) in terms of counts. Answer: p(c w i ) = N(c,wi) c N(c,w i). Expressing denominator as N(w i) is ok. (d) (7 points) Recall that for many cases of EM algorithms, the basic structure of the algorithm can be described as follows: 1. Set initial values for parameters µ 2. E-step: Figure out expected counts of relevant events, where those events typically involve both observable and hidden values, using the current parameters µ to determine what s expected. 3. M-step: Use the expected counts to compute µ new, a new set of parameters. In the EM algorithms we ve studied, this is a maximum likelihood estimate, i.e. just normalizing (expected) counts. 4. If the probabilities have converged (or if we ve done some maximum number of iterations), stop; otherwise let µ = µ new and go back to the E-step for the next iteration. If you read about the EM algorithm for this model, the updating of the parameters is described as having the following M-step for the two sets of parameters. w p new (c w i ) = j N(w i, w j )p(c w i, w j ) (3) c w j N(w i, w j )p(c w i, w j ) w p new (w j c) = i N(w i, w j )p(c w i, w j ) (4) w i N(w i, w j )p(c w i, w j ) The E-step is described as: (5) w j p(c w i, w j ) = p(w j c)p(c w i ) c p(w j c )p(c w i ) In words, p(c w i, w j ) can be thought of as the probability that c was the hidden state that got used when going from w i to w j. Explain why equation 4 is the way to calculate the new value for p new (w j c) given the previous guesses for the parameters. Partial credit will be assigned for good explanations, but a perfect answer will show via equations (along with written explanation) how to derive equation 4 using the non-hidden maximum likelihood estimate (part b) as the starting point. (I.e. the form of the explanation will follow the same general schema we used for deriving the update for HMM transition probabilities a ij, starting with the maximum likelihood estimate you d use if all the state transitions were visible. Though, as was mentioned, no dynamic program is needed.) Answer: What s most important here is recognizing that although you want to assign probabilities involving c, you can only observe transitions from w i to w j ; therefore the key step is distributing the credit for every such transition to all the c s that transition could have gone through. Although p(c w i, w j ) was given to you, it can be derived as follows by taking observable maximum likelihood estimation as the starting point. If we could observe the c, we would compute (6) p(c w i, w j ) = N(w i, c, w j ) c N(w i, c, w j ). Since we can t, we provide an estimate in terms of expectations using the current probability estimates: (7) p(c w i, w j ) = E[N(w i, c, w j )] E[ c N(w i, c, w j )]. 5

6 Now, there are N(w i, c, w j ) transitions from w i to w j that go through c. This count can be seen as the total number of opportunities (transitions from w i to w j ) times the expected probability of that opportunity going through c. According to the model, that probability would be Pr(c w i ) Pr(w j c). This gives us (8) p(c w i, w j ) = N(w i, w j )p(w j c)p(c w i ) c N(w i, w j )p(w j c )p(c w i ). We can pull N(w i, w j ) outside the sum in the denominator, and then cancel with the same value in the numerator, yielding the expression for p(c w i, w j ) given above. Now we are in a position to derive p new (w j c). The logic is quite similar. We d like p new (w j c) = N(w j, c) w N(w j j, c) but the c are unobserved, so we need to use expectations based on the current probabilities, which we can express as p new (w j c) = E[N(w j, c)] E[ w N(w j j, c)]. Considering the numerator, the number of times we go to w j, and do so via c, is a sum of counts over all the w i we could have come from, i.e. w i N(w i, c, w j ). We can now play the usual trick of expressing that total count in terms of the number of observed opportunities to go through c (which is the number of times we go from w i to w j, i.e. N(w i, w j )) and the probability of such an opportunity actually going through c, which is p(c w i, w j ). Thus, we have reduced the problem of the expected value for the numerator to a problem of figuring out the expectation of p(c w i, w j ). But that s solved: Eq. (5) shows how to compute that value in terms of the data and the current model s estimates for p(c w j ) and p(w i c). The denominator follows trivially. (e) (7 points) Same as part (d), but explain why equation 3 is the way to update p new (c w i ). Answer: The logic is identical to part (d). Moving from observable to expectations, you get p(c w i ) = E[N(c, w i)] E[ c N(c, w i )]. Sum N(w i, c, w j ), over w j to get the joint count you want in the numerator, so now the problem is again reduced to estimating N(w i, c, w j ) in terms of the observed N(w i, c, w j ) and the current model probability estimates, same as above. Since we re conditioning on w i, the denominator must sum the numerator over classes c. For those who have ever looked more formally at EM: you will not receive credit for deriving the EM updates from the model s likelihood function, using Lagrange multipliers to optimize, etc. I don t want to see a Q function. I m looking for an understanding of the updates in relation to the non-hidden maximum likelihood estimate, as was our focus in discussions of the Forward-Backward and Inside-Outside algorithms. 6

7 3 N-gram models [15 points] Consider the following problem, known as language identification : given a previously unseen string of natural language text, what language is it written in? This is often solved using n-gram models. Assume you have training samples containing 20,000 words of running text in each of k languages, {L 1,..., L k }. Furthermore, to keep things simple, assume that all of these languages are alphabetic (i.e. not ideographic like many Asian languages) and that all data is in Unicode, and that all k languages use significantly overlapping alphabets (e.g. you could suppose they all use a roman alphabet with accents/diacritics).. If you need to for concreteness, you can assume the genre is newswire, and that there is no markup. Now, imagine that you are asked to design a solution to the language identification problem based on n-grams. (a) (5 points) Would you use n-grams composed of words or characters? Justify your answer in quantitative terms, clearly stating any assumptions you make (e.g. about sizes of vocabularies, sizes of alphabets, or any other relevant quantities). Answer: I expect characters, because of data sparseness; any argument for words would have to be quite innovative and strong. Answers should involve comparisons of V n parameters for alphabets (typically, say, 50) versus vocabularies (typically at least in the tens of thousands). (Guidance: -2 points if no explicit discussion of what n would be in this scenario: n matters, since sufficiently wide character n-grams can be just as sparse as small word n-grams.) (b) (5 points) Explain in detail how, given a new piece of text in one of the k languages, you would take a Bayesian approach to identifying which language L i it is written in.you can assume that all k languages are a priori equally likely. Answer: If O is the new observed text, the goal is to choose L i maximizing p(l i O). Using Bayes Rule, the assumption of uniform priors, and the fact that p(o) is constant, this can be shown to be equivalent to maximizing p(o L i ) or, to write it another way, p i (O) where p i is a model trained on a sample of L i. (Guidance: 2 points for just computing the probabilities p i (s) for string s, and picking the i with the highest probability, without Bayes. Evaluating perplexity/cross entropy rather than likelihood is also ok.) (c) (5 points) Suppose you have read about two different kinds of n-gram model (e.g. a bigram model and a trigram model, or a standard bigram model and the aggregate bigram model of the previous question) and you want to know which kind performs better on this problem. Under the assumption that you have only the small amount of data in each language as described above, explain in detail how you would conduct an evaluation to assess which of the two kinds of models is better for this problem. Answer: Minimally, any evaluation must have a clear distinction between training and test data (Guidance: deduct 4 points if absent). In this scenario, one would use m-fold cross-validation because there is so little data available. (Guidance: deduct 2 points if absent.) 7

8 4 Relative Entropy [20 points] Selectional preferences are semantic constraints that a predicate places on its arguments. For example, the verb drink prefers that its objects be in the semantic class beverage, which is why John drank the wine sounds a lot better than, say, John drank the toaster. In the early 1990s, the following was proposed as a probabilistic model of selectional preferences: S object (v) = D(p(c v) p(c)) = Pr(object is in class c verb is v) log c Classes A object (v, c) = Pr(object is in class c verb is v) log Pr(object is in class c verb is v). Pr(object is in class c) Pr(object is in class c verb is v). Pr(object is in class c) The value S object (v) was referred to as the selectional preference strength of verb v, and the value A object (v, c) was called the selectional association between the verb v and a particular semantic category c. (a) Assuming probabilities are estimated accurately (for example, pretend the classes are observable), and assuming that verb-object frequencies in the corpus are a reasonably accurate reflection of the real world, would the model correctly predict that A object (drink, beverage) should be higher than A object (drink, appliance)? Explain why or why not. (b) Below, the graph on the left shows A object (v 1, c) and the graph on the right shows A object (v 2, c), for two verbs v 1 and v 2. (The x-axis ranges over possible semantic categories c, and each bar gives the value of A object (v, c).) As you can see, one of the verbs is eat and the other one is find. Even without the labels, you would have been able to identify which was which, based on what you know about the model (and the assumptions we ve made). Explain how, with explicit reference to the formal definition of the model. (c) The person who proposed this model claimed that S object (v) models, quantitively, how much information the verb carries about the semantic category of its direct object. Explain why this claim is true. Answer: 8

9 (a) Assuming a corpus where probabilities are estimated accurately and reflect the way things work in the real world, the prediction would be correct. Consider the component probabilities in the definition of A object. Given that the verb is drink, the probability of a beverage as direct object is high; but the marginal probability of beverages as direct objects is low (not likely across all verbs). So the ratio will have a high value, as will the value it s being multiplied by. In contrast, for appliance direct objects, the conditional probability given the verb drink is low, so the ratio is going to have a smallish value (even if the marginal probability of appliance is also low). Notice how similar this is to the homework problem where you were asked to look at the way certain terms did and did not contribute to the relative entropy between two probability distributions. (b) Verb find is on the left and eat is on the right. Since find is not particularly selective about the semantic class of its direct object (you can find happiness, a job, a planet, your keys, your long lost uncle, etc.), no one semantic category is likely to particularly dominate the conditional probability distribution Pr(object is in class c verb is v), which is consistent with the flat, relatively undifferentiated profile on the left. (Notice that if c is relatively independent of v, we expect Pr(c v) Pr(c), which implies Pr(c v)log Pr(c v) Pr(c) Pr(c v)log1 = 0.) In contrast, the verb eat is highly selective about the semantic category of its direct objects, which is reflected in the spiky profile on the right hand side. (But note that neither of these profiles represents class frequency, which should be obvious since there are negative values for some classes.) As an aside, notice that although both profiles have some negative values, the magnitudes of negative values are very small relative to the magnitudes of positive values. This, too, relates to the homework problems that involved relative entropy. It also provides a visual, intuitive explanation for the somewhat counterintuitive fact that D(p q) can t ever be negative. (c) The truth of the claim follows from the definition of S object as the relative entropy (K-L divergence) between two particular distributions. Probability Pr(object is in class c verb is v) is the probability of the class given the verb. Probability Pr(object is in class c) can be seen as the probability of a particular semantic class in direct object position in the absence of information about the specific verb. Now, recall that D(p q) can be interpreted as the number of bits wasted, on average, by encoding events using distribution q as the basis for the encoding rather than the true distribution p. In this case, we are comparing the true distribution, when you know what the verb is, versus a less informed model where you ignore the verb. Therefore the relative entropy can be interpreted as the number of extra bits you need on average to identify c, if you don t know v. Conversely, it s the information you gain about semantic class c if you choose to use the distribution where you do know what the verb is. Therefore it s fair to say that S object measures the quantity of information the verb carries about the semantic category of the direct object. Incidentally, this model was first proposed in Resnik (1993), Selection and Information: A Class-based Approach to Lexical Relationships, Ph.D. dissertation, Computer and Information Science, University of Pennsylvania. (Guidance: most credit for the right general argument but deduct if information is not characterized in the sense of information theory or if K-L divergence is not included in the discussion.) 9

10 5 Parsing and Evaluation [15 points] You have been hired by a AwesomeSearchEngine.com to make recommendations concerning language technology. They have been approached by Whizdee, Inc., a startup company, and in a sales presentation Whizdee s sales representative says the following: We re doing very exciting work on parsing, and our results are very impressive. In one experiment, we trained one of our parsers on 80% of the Penn Treebank and tested on the other 20%, and its labeled recall on constituent boundaries was 98.2%. We also did an experiment, training on the same data, where the labeled precision for constituent boundaries was 97.6%. With numbers like that, how could you lose? (a) Consider the true parse T, and the system parse P, below: [T] (S (NP the prof) (VP (VPRT looked up) (NP the grade)) (ADVP today)) [P] (S (NP the prof) (VP looked (PP up (NP the grade)) (ADVP today))) What are the values for labeled precision and labeled recall? Note that I ve omitted all part of speech labels because they re not used in constituent recall/precision calculations. Every uppercase symbol in the parse is a constituent label. (b) Should people be impressed by Whizdee s numbers? Explain why or why not. (c) Whizdee s decided you re so smart that they, too, want to pay you as a consultant. They ve got a questionanswering search engine that uses a parser, and a contract from the New York Times. They paid minimum wage to impoverished linguistics grad students in New York City to create parse trees for 20,000 New York Times articles written between March 15, 1997 and March 15, Whizdee plans to make their system available to the public starting April 1, but the New York Times insists on a formal parser evaluation first. One of Whizdee s scientists says that they should evaluate their parser by doing 10-fold cross validation. Another of their scientists says that they should evaluate it by training on the data up to March 15, 2006 and testing on the rest. Explain what the two competing evaluation approaches are and discuss the advantages and disadvantages of each approach. Solution (a) Labeled precision and labeled recall for parse trees are computed by comparing the spans covered by the constituents along with the labels. For example, both P and T contain,np:0-2 (a noun phrase spanning positions 0-2, where 0 is the position just before the and 2 is the position just after prof). So this constituent would be a true positive. VP:2-7 in P would be a false positive: it matches the label of the VP in T, but not its span (the true VP goes from 2 to 6). In this example, there are 4 true positives (S, two NPs, and ADVP), and each of P and T contain 6 constituents, so precision and recall both work out to be 4 6 = 2 3. (b) They did two experiments and both were reported incompletely. My intent in the question was for you to observe that for most tasks, precision and recall, measured independently, can be traded off against each other. For example, a high precision result can be obtained trivially by labeling only constituents that are very unlikely to be wrong as an extreme case, giving parse trees that contain nothing but an S covering the whole sentence. (Look, ma, 100% precision!) So I was ok with responses encouraging skepticism on these grounds (and, say, suggesting F-measure as a harder-to-game combined measure.) On the other hand, if you think about it, a falsely high recall result for parse trees is much harder to obtain, especially for labeled constituents: since you have to choose a unique label per 10

11 constituent and you also need a valid parse tree, you can t just throw in all possible constituents. (For example, in P, the preposition up could not be part of both a VPRT looked up and the PP up the grade.) So I d also give full credit (and a pat on the back) for the argument that high labeled recall could be a credible indication that the parser is doing a good job. (c) Both approaches properly separate training and test data. Ten-fold cross validation divides the full data set into folds f 1,..., f 10, and performs ten training-test splits: each fold f i is treated as test data using {f j j i} as training data. The ten evaluation results are then averaged to give you a single figure of merit. This has the advantage of providing a better estimate of performance, since you can compute both the mean and the standard deviation. It has the disadvantage of being a bit more costly and complicated and, more important, it s not very well suited to the stated task. The parser is going to need to perform well on news data generated after the parser is trained (taking into account new people, places, etc., as well as new expressions in the language as they are introduced) and the cross-validation strategy doesn t take chronological ordering into account. The alternative here could be described as a chronological train/test split. The advantage is that it is a better match for the task: unless there s going to be constant retraining (requiring constant treebanking), the application will be using a parser that was trained on older data and is being run on chronologically newer data. The main disadvantage is that a single train/test split gives you less information than cross validation since you only have the one data point. 11

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

Quantitative Understanding in Biology 1.7 Bayesian Methods

Quantitative Understanding in Biology 1.7 Bayesian Methods Quantitative Understanding in Biology 1.7 Bayesian Methods Jason Banfelder October 25th, 2018 1 Introduction So far, most of the methods we ve looked at fall under the heading of classical, or frequentist

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Final You have approximately 2 hours 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

Chapter 14 (Partially) Unsupervised Parsing

Chapter 14 (Partially) Unsupervised Parsing Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Chapter 1 Review of Equations and Inequalities

Chapter 1 Review of Equations and Inequalities Chapter 1 Review of Equations and Inequalities Part I Review of Basic Equations Recall that an equation is an expression with an equal sign in the middle. Also recall that, if a question asks you to solve

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

More Smoothing, Tuning, and Evaluation

More Smoothing, Tuning, and Evaluation More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w

More information

Solving Equations by Adding and Subtracting

Solving Equations by Adding and Subtracting SECTION 2.1 Solving Equations by Adding and Subtracting 2.1 OBJECTIVES 1. Determine whether a given number is a solution for an equation 2. Use the addition property to solve equations 3. Determine whether

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10 EECS 70 Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10 Introduction to Basic Discrete Probability In the last note we considered the probabilistic experiment where we flipped

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

CS 188 Introduction to AI Fall 2005 Stuart Russell Final NAME: SID#: Section: 1 CS 188 Introduction to AI all 2005 Stuart Russell inal You have 2 hours and 50 minutes. he exam is open-book, open-notes. 100 points total. Panic not. Mark your answers ON HE EXAM

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Reading for Lecture 6 Release v10

Reading for Lecture 6 Release v10 Reading for Lecture 6 Release v10 Christopher Lee October 11, 2011 Contents 1 The Basics ii 1.1 What is a Hypothesis Test?........................................ ii Example..................................................

More information

CISC4090: Theory of Computation

CISC4090: Theory of Computation CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Introduction to AI Learning Bayesian networks. Vibhav Gogate Introduction to AI Learning Bayesian networks Vibhav Gogate Inductive Learning in a nutshell Given: Data Examples of a function (X, F(X)) Predict function F(X) for new examples X Discrete F(X): Classification

More information

Lecture 22: Quantum computational complexity

Lecture 22: Quantum computational complexity CPSC 519/619: Quantum Computation John Watrous, University of Calgary Lecture 22: Quantum computational complexity April 11, 2006 This will be the last lecture of the course I hope you have enjoyed the

More information

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On Basics of Proofs The Putnam is a proof based exam and will expect you to write proofs in your solutions Similarly, Math 96 will also require you to write proofs in your homework solutions If you ve seen

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Roger Levy Probabilistic Models in the Study of Language draft, October 2, Roger Levy Probabilistic Models in the Study of Language draft, October 2, 2012 224 Chapter 10 Probabilistic Grammars 10.1 Outline HMMs PCFGs ptsgs and ptags Highlight: Zuidema et al., 2008, CogSci; Cohn

More information

Discrete Structures Proofwriting Checklist

Discrete Structures Proofwriting Checklist CS103 Winter 2019 Discrete Structures Proofwriting Checklist Cynthia Lee Keith Schwarz Now that we re transitioning to writing proofs about discrete structures like binary relations, functions, and graphs,

More information

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models Contents Mathematical Reasoning 3.1 Mathematical Models........................... 3. Mathematical Proof............................ 4..1 Structure of Proofs........................ 4.. Direct Method..........................

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 August 28, 2003 These supplementary notes review the notion of an inductive definition and give

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes

More information

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

2. Probability. Chris Piech and Mehran Sahami. Oct 2017 2. Probability Chris Piech and Mehran Sahami Oct 2017 1 Introduction It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to build up from first

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

CS1800: Strong Induction. Professor Kevin Gold

CS1800: Strong Induction. Professor Kevin Gold CS1800: Strong Induction Professor Kevin Gold Mini-Primer/Refresher on Unrelated Topic: Limits This is meant to be a problem about reasoning about quantifiers, with a little practice of other skills, too

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk Language Modelling 2 A language model is a probability

More information

1.1 The Language of Mathematics Expressions versus Sentences

1.1 The Language of Mathematics Expressions versus Sentences The Language of Mathematics Expressions versus Sentences a hypothetical situation the importance of language Study Strategies for Students of Mathematics characteristics of the language of mathematics

More information

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module No. # 01 Lecture No. # 08 Shannon s Theory (Contd.)

More information

Modeling Environment

Modeling Environment Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Presuppositions (introductory comments)

Presuppositions (introductory comments) 1 Presuppositions (introductory comments) Some examples (1) a. The person who broke the typewriter was Sam. b. It was Sam who broke the typewriter. c. John screwed up again. d. John likes Mary, too. e.

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

3 PROBABILITY TOPICS

3 PROBABILITY TOPICS Chapter 3 Probability Topics 135 3 PROBABILITY TOPICS Figure 3.1 Meteor showers are rare, but the probability of them occurring can be calculated. (credit: Navicore/flickr) Introduction It is often necessary

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

CS1800: Mathematical Induction. Professor Kevin Gold

CS1800: Mathematical Induction. Professor Kevin Gold CS1800: Mathematical Induction Professor Kevin Gold Induction: Used to Prove Patterns Just Keep Going For an algorithm, we may want to prove that it just keeps working, no matter how big the input size

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

Conditional Probability, Independence and Bayes Theorem Class 3, Jeremy Orloff and Jonathan Bloom

Conditional Probability, Independence and Bayes Theorem Class 3, Jeremy Orloff and Jonathan Bloom Conditional Probability, Independence and Bayes Theorem Class 3, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Know the definitions of conditional probability and independence of events. 2.

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Fitting a Straight Line to Data

Fitting a Straight Line to Data Fitting a Straight Line to Data Thanks for your patience. Finally we ll take a shot at real data! The data set in question is baryonic Tully-Fisher data from http://astroweb.cwru.edu/sparc/btfr Lelli2016a.mrt,

More information

NaturalLanguageProcessing-Lecture08

NaturalLanguageProcessing-Lecture08 NaturalLanguageProcessing-Lecture08 Instructor (Christopher Manning):Hi. Welcome back to CS224n. Okay. So today, what I m gonna do is keep on going with the stuff that I started on last time, going through

More information

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1 Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees Recap: Lexicalized PCFGs We now need to estimate rule probabilities such as P rob(s(questioned,vt) NP(lawyer,NN) VP(questioned,Vt) S(questioned,Vt)) 6.864 (Fall 2007): Lecture 5 Parsing and Syntax III

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

Ratios, Proportions, Unit Conversions, and the Factor-Label Method

Ratios, Proportions, Unit Conversions, and the Factor-Label Method Ratios, Proportions, Unit Conversions, and the Factor-Label Method Math 0, Littlefield I don t know why, but presentations about ratios and proportions are often confused and fragmented. The one in your

More information

Text Mining. March 3, March 3, / 49

Text Mining. March 3, March 3, / 49 Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

P (E) = P (A 1 )P (A 2 )... P (A n ).

P (E) = P (A 1 )P (A 2 )... P (A n ). Lecture 9: Conditional probability II: breaking complex events into smaller events, methods to solve probability problems, Bayes rule, law of total probability, Bayes theorem Discrete Structures II (Summer

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information