Ling/CMSC 773 Take-Home Midterm Spring 2008

Size: px

Start display at page:

Download "Ling/CMSC 773 Take-Home Midterm Spring 2008"

Bertina Rosaline Hutchinson
5 years ago
Views:

1 Solution set. This document is a solution set. It will only be available to students temporarily. It may not be kept, nor shown to anyone outside the current class at any time. Typical correct answers are given along with some guidance as to point values for grading; notwithstanding that explicit guidance, all grading is discretionary and based on the level of understanding and competence explicitly demonstrated in answers. Ling/CMSC 773 Take-Home Midterm Spring

2 1 Short Answer Questions [30 points total] Answer each question clearly and briefly. (Even if you make sure to be pretty thorough which you should, in order to obtain partial credit these are short answers, not long essays.) (a) Give a constituency tree for Many students said they enjoyed the interesting exam (using the usual sort of notation from class, please, nothing linguistically fancy). Give a head table for the context-free rules needed to construct the tree, i.e. context-free rules annotated to show which constituent is the head. Then show explicitly, step-by-step, how to convert your constituency tree into an unlabeled dependency tree. Answer: Tree and head table should be reasonable, and students must show or describe explicitly the intermediate steps of (a) propagating the heads up the tree, and (b) using the propagated heads to extract the dependency relations. (Guidance: 2 points for each of those two criteria, otherwise discretionary.) (b) For this question, assume data sparseness is not a problem, i.e. assume all probabilities can be estimated accurately. Consider the difficulty that a simple probabilistic CFG has in distinguishing the likelihood of the sentences people eat roasted peanuts and peanuts eat roasted people. Briefly describe one solution to this problem and its advantages and disadvantages. Now consider the sentence Children sometimes munch chocolate soldiers on Christmas morning. With the solution you just described, is the probability of this sentence likely to have a value that accords with your intuition? Explain why or why not. Answer: Recall that the difficulty in this example is that the two sentences have the same probability, since the expansions from N to peanuts and from N to people take place without regard to where they are in the tree. An obvious solution to pick is fully lexicalizing the grammar, i.e. propagating terminal symbols up into the nonterminals so that you d have rules like VP eat V eat NP people. This has the advantage of making lexically based co-occurrence probabilities relevant in the grammar, e.g. the above rule would be much less likely than VP eat V eat NP peanuts, so the two sentences trees would have different probabilities. However, chocolate soldier is still a problem NP for that solution: the problem would show up in the probability for VP eat V munch NP soldiers, which would be low assuming any reasonably realistic training set. (The corpus is a proxy for knowledge about what happens in the world, and generally soldiers don t get munched.) Chocolate soldiers behave differently from real soldiers both linguistically and in the world. The linguistically interesting aspect of this example is that it violates the assumption which holds true pretty generally that the distribution of a phrase is heavily dominated by the head of the phrase. (c) Suppose you have an unfair coin that comes up heads 7/8 of the time when you flip it. Suppose you flip the coin 1000 times and report the outcome each time, paying one penny per bit for your communication. (In cash, due before you get to send the message.) If you re thrifty (you want to pay as little as possible) and smart (you ve taken this class), what s the lowest amount you could expect to get away with paying? Justify your answer. Answer: If we were communicating the results of 1000 trials all at once, we could choose a code where heads gets a shorter representation than tails. In that case, the entropy of the distribution provides a lower bound for the average message length; that s H(1/8,7/8) = - (1/8 * log2(1/8)) + (7/8 * log2(7/8)) = 0.54, so the total payment would have a lower bound of $5.40. However, the problem says you ve got to communicate the outcome each time. Unless you ve proposed some way to communicate less than a single bit of information in a message (none that I m aware of), there s no way to encode heads in less than a bit. This means that the best you can do is sending 1 for heads and 0 for tails (or vice versa), for a total best cost of $ (Grading guidance: 3/5 for using the entropy to get $5.40. Another point for recognizing the fraction-of-abit problem but getting something other than $10.00.) (d) A former Amazon employee once said: We sold more books today that didn t sell at all yesterday than we sold today of all the books that did sell yesterday. (To state this in plainer English, If you look at 2

3 our infrequently-sold books, the total number we sell (let s call it N) is very large, even if each individual book is only bought by a few people. And if you look at N, it s bigger than the total sales of the frequently sold books! ) George Zipf would be unsurprised. Briefly explain why. Answer: This quote was taken from The Long Tail at Wikipedia. The Long Tail is a property of many distributions, including Zipf s. From Wikipedia: In these distributions a high-frequency or high-amplitude population is followed by a low-frequency or low-amplitude population which gradually tails off. In many cases the infrequent or low-amplitude events... can cumulatively outnumber or outweigh the initial portion... such that in aggregate they comprise the majority. (Guidance: Some reference to the long tail property of Zipfian distributions must be made.) (e) Briefly explain/illustrate how a measure of uncertainty can also be considered a measure of surprise and a measure of quantity of information. Answer: This can be illustrated by considering a scenario in which outcomes are communicated to between parties who share knowledge of the probability distribution over those outcomes, e.g. the result of a horse race (if one imagines the same horses racing over and over again, with a fixed probability distribution for which horse will win). If the outcome is completely certain, then uncertainty is 0, and so is the the quantity of information that needs to be communicated, equally so the recipient s level of surprise. If the result is maximally uncertain, i.e. a uniform distribution, then the recipient has absolutely no information in advance, so one could not be any more surprised by an outcome and the quantity of information communicated is at its maximum. (Guidance: All three characterizations must be mentioned.) (f) Here s an argument: Consider the sentences (i) Hold the newsreader s nose squarely, waiter, or friendly milk will countermand my trousers, and (ii) Countermand friendly, hold milk my, newsreader s nose or squarely the trousers waiter will. Neither (it s fair to assume) has ever occurred in the prior experience of a speaker of English, and yet any such speaker would readily identify the former as grammatical and the latter as ungrammatical. Therefore one cannot associate the notion grammaticality of a sentence in English with the notion likelihood of a sentence in English. Briefly but convincingly demolish this argument. Answer: The student should be able to generalize from the argument involving colorless green ideas sleep furiously to this example. The argument is done nicely in Section 4.2 of the Abney (1996) reading, and in the (optional) Pereira (2000) reading, Pereira shows that using the aggregate bigram model (Question 2!), the grammatical and ungrammatical versions of the sentence differ in probability by five orders of magnitude. Applying those arguments would be nice, but I m ok with any reasonable discussion showing that the student knows maximum likelihood estimation of a bigram model is a naïve way to assign probabilities to sentences (i) and (ii), and better correlations between probability and grammaticality intuitions can be obtained with more sophisticated probabilistic modeling (e.g. good smoothing, hidden classes, backoff to part-of-speech tags). 3

4 2 EM [20 points] Consider the following hidden model variation on a bigram model for word sequences. As in the usual bigram model, we express the probability of an entire sequence w 1 w 2... w T by (1) T p(w 1 w 2... w T ) = p(w 1 ) p(w t w t 1 ). t=2 However, the parameters used in the product are defined as follows: (2) p(w t w t 1 ) = C p(w t c)p(c w t 1 ) i=1 In plain English, the generative story for this model is the following. Instead of generating the next word w t based on the previous word w t 1, as in a usual bigram model, we generate a class c based on w t 1, and then we generate w t based on c. 1 So the probability of choosing the next word w t is a sum of probabilities, one for each hidden class. The probability contributed to the choice of w t for each class c, namely p(w t c)p(c w t 1 ), represents the joint probability of two events: picking c based on w t 1, and then picking w t based on c. Since c is hidden, we have to sum up over the different possibilities. Intuitively, the hidden class c can be viewed as capturing the general properties of w t 1 that are relevant for generating the next word. Or you can think of going through hidden class c to get from w t 1 to w t. As you can see, there are two sets of parameters in this model. The first set is the word-to-class probabilities p(c w i ), and the second set is the class-to-word probabilities p(w j c). (Where both w i and w j range over the entire vocabulary, and c ranges over the set of C hidden classes.) Because of the way the model is structured, there s an EM algorithm for estimating these parameters that is much simpler than the Forward-Backward algorithm. In particular, there s no need at all for dynamic programming. 2 For the sake of consistent notation, please use N(x, y) as your notation for counts. E.g. N(w i, w j ) would be the number of times w i is followed by w j. (a) (2 points) In addition to capturing generalizations (by associating words with abstract classes that are learned automatically), an advantage of this model is that in practice it has far fewer parameters than a usual (non-hidden) bigram model. Let the size of the vocabulary be V and the number of classes be C. How many classes can you use in the hidden model before the total number of parameters exceeds the number of parameters in the usual (non-hidden) bigram model? If you re comparing the total number of parameters in the two models, What does the comparison look like for typical values of C = 32 and V = 50000? Answer: A bigram model has V 2 parameters (or more properly V (V + 1) if you include the parameters for starting the string with w i for each w i, but O(V 2 ) in any case so I won t nitpick on that point). The hidden model has O(2CV ) parameters. Minus nitpicking, the latter reaches the former when C exceeds V 2. It s many orders of magnitude larger in the usual circumstances: versus (Guidance: require explicit mentions of V 2 and 2CV, and correct explicit comparison using typical values.) (b) (2 points) Suppose the classes c were observable rather than hidden. Express the maximum likelihood estimate for the probability p(w j c) in terms of counts. Answer: p(w j c) = N(wj,c) w j N(w j,c). Expressing denominator as N(c) is ok. 1 Also, as usual, we can assume w 1 is always a special start word that always starts an observed sequence with probability 1. 2 Intuitively, this is because the choice of the hidden class at every step depends only on what s observable, i.e. hidden classes are independent of each other given the intervening word. 4

5 (c) (2 points) Suppose the classes c were observable rather than hidden. Express the maximum likelihood estimate for the probability p(c w i ) in terms of counts. Answer: p(c w i ) = N(c,wi) c N(c,w i). Expressing denominator as N(w i) is ok. (d) (7 points) Recall that for many cases of EM algorithms, the basic structure of the algorithm can be described as follows: 1. Set initial values for parameters µ 2. E-step: Figure out expected counts of relevant events, where those events typically involve both observable and hidden values, using the current parameters µ to determine what s expected. 3. M-step: Use the expected counts to compute µ new, a new set of parameters. In the EM algorithms we ve studied, this is a maximum likelihood estimate, i.e. just normalizing (expected) counts. 4. If the probabilities have converged (or if we ve done some maximum number of iterations), stop; otherwise let µ = µ new and go back to the E-step for the next iteration. If you read about the EM algorithm for this model, the updating of the parameters is described as having the following M-step for the two sets of parameters. w p new (c w i ) = j N(w i, w j )p(c w i, w j ) (3) c w j N(w i, w j )p(c w i, w j ) w p new (w j c) = i N(w i, w j )p(c w i, w j ) (4) w i N(w i, w j )p(c w i, w j ) The E-step is described as: (5) w j p(c w i, w j ) = p(w j c)p(c w i ) c p(w j c )p(c w i ) In words, p(c w i, w j ) can be thought of as the probability that c was the hidden state that got used when going from w i to w j. Explain why equation 4 is the way to calculate the new value for p new (w j c) given the previous guesses for the parameters. Partial credit will be assigned for good explanations, but a perfect answer will show via equations (along with written explanation) how to derive equation 4 using the non-hidden maximum likelihood estimate (part b) as the starting point. (I.e. the form of the explanation will follow the same general schema we used for deriving the update for HMM transition probabilities a ij, starting with the maximum likelihood estimate you d use if all the state transitions were visible. Though, as was mentioned, no dynamic program is needed.) Answer: What s most important here is recognizing that although you want to assign probabilities involving c, you can only observe transitions from w i to w j ; therefore the key step is distributing the credit for every such transition to all the c s that transition could have gone through. Although p(c w i, w j ) was given to you, it can be derived as follows by taking observable maximum likelihood estimation as the starting point. If we could observe the c, we would compute (6) p(c w i, w j ) = N(w i, c, w j ) c N(w i, c, w j ). Since we can t, we provide an estimate in terms of expectations using the current probability estimates: (7) p(c w i, w j ) = E[N(w i, c, w j )] E[ c N(w i, c, w j )]. 5

6 Now, there are N(w i, c, w j ) transitions from w i to w j that go through c. This count can be seen as the total number of opportunities (transitions from w i to w j ) times the expected probability of that opportunity going through c. According to the model, that probability would be Pr(c w i ) Pr(w j c). This gives us (8) p(c w i, w j ) = N(w i, w j )p(w j c)p(c w i ) c N(w i, w j )p(w j c )p(c w i ). We can pull N(w i, w j ) outside the sum in the denominator, and then cancel with the same value in the numerator, yielding the expression for p(c w i, w j ) given above. Now we are in a position to derive p new (w j c). The logic is quite similar. We d like p new (w j c) = N(w j, c) w N(w j j, c) but the c are unobserved, so we need to use expectations based on the current probabilities, which we can express as p new (w j c) = E[N(w j, c)] E[ w N(w j j, c)]. Considering the numerator, the number of times we go to w j, and do so via c, is a sum of counts over all the w i we could have come from, i.e. w i N(w i, c, w j ). We can now play the usual trick of expressing that total count in terms of the number of observed opportunities to go through c (which is the number of times we go from w i to w j, i.e. N(w i, w j )) and the probability of such an opportunity actually going through c, which is p(c w i, w j ). Thus, we have reduced the problem of the expected value for the numerator to a problem of figuring out the expectation of p(c w i, w j ). But that s solved: Eq. (5) shows how to compute that value in terms of the data and the current model s estimates for p(c w j ) and p(w i c). The denominator follows trivially. (e) (7 points) Same as part (d), but explain why equation 3 is the way to update p new (c w i ). Answer: The logic is identical to part (d). Moving from observable to expectations, you get p(c w i ) = E[N(c, w i)] E[ c N(c, w i )]. Sum N(w i, c, w j ), over w j to get the joint count you want in the numerator, so now the problem is again reduced to estimating N(w i, c, w j ) in terms of the observed N(w i, c, w j ) and the current model probability estimates, same as above. Since we re conditioning on w i, the denominator must sum the numerator over classes c. For those who have ever looked more formally at EM: you will not receive credit for deriving the EM updates from the model s likelihood function, using Lagrange multipliers to optimize, etc. I don t want to see a Q function. I m looking for an understanding of the updates in relation to the non-hidden maximum likelihood estimate, as was our focus in discussions of the Forward-Backward and Inside-Outside algorithms. 6

7 3 N-gram models [15 points] Consider the following problem, known as language identification : given a previously unseen string of natural language text, what language is it written in? This is often solved using n-gram models. Assume you have training samples containing 20,000 words of running text in each of k languages, {L 1,..., L k }. Furthermore, to keep things simple, assume that all of these languages are alphabetic (i.e. not ideographic like many Asian languages) and that all data is in Unicode, and that all k languages use significantly overlapping alphabets (e.g. you could suppose they all use a roman alphabet with accents/diacritics).. If you need to for concreteness, you can assume the genre is newswire, and that there is no markup. Now, imagine that you are asked to design a solution to the language identification problem based on n-grams. (a) (5 points) Would you use n-grams composed of words or characters? Justify your answer in quantitative terms, clearly stating any assumptions you make (e.g. about sizes of vocabularies, sizes of alphabets, or any other relevant quantities). Answer: I expect characters, because of data sparseness; any argument for words would have to be quite innovative and strong. Answers should involve comparisons of V n parameters for alphabets (typically, say, 50) versus vocabularies (typically at least in the tens of thousands). (Guidance: -2 points if no explicit discussion of what n would be in this scenario: n matters, since sufficiently wide character n-grams can be just as sparse as small word n-grams.) (b) (5 points) Explain in detail how, given a new piece of text in one of the k languages, you would take a Bayesian approach to identifying which language L i it is written in.you can assume that all k languages are a priori equally likely. Answer: If O is the new observed text, the goal is to choose L i maximizing p(l i O). Using Bayes Rule, the assumption of uniform priors, and the fact that p(o) is constant, this can be shown to be equivalent to maximizing p(o L i ) or, to write it another way, p i (O) where p i is a model trained on a sample of L i. (Guidance: 2 points for just computing the probabilities p i (s) for string s, and picking the i with the highest probability, without Bayes. Evaluating perplexity/cross entropy rather than likelihood is also ok.) (c) (5 points) Suppose you have read about two different kinds of n-gram model (e.g. a bigram model and a trigram model, or a standard bigram model and the aggregate bigram model of the previous question) and you want to know which kind performs better on this problem. Under the assumption that you have only the small amount of data in each language as described above, explain in detail how you would conduct an evaluation to assess which of the two kinds of models is better for this problem. Answer: Minimally, any evaluation must have a clear distinction between training and test data (Guidance: deduct 4 points if absent). In this scenario, one would use m-fold cross-validation because there is so little data available. (Guidance: deduct 2 points if absent.) 7

8 4 Relative Entropy [20 points] Selectional preferences are semantic constraints that a predicate places on its arguments. For example, the verb drink prefers that its objects be in the semantic class beverage, which is why John drank the wine sounds a lot better than, say, John drank the toaster. In the early 1990s, the following was proposed as a probabilistic model of selectional preferences: S object (v) = D(p(c v) p(c)) = Pr(object is in class c verb is v) log c Classes A object (v, c) = Pr(object is in class c verb is v) log Pr(object is in class c verb is v). Pr(object is in class c) Pr(object is in class c verb is v). Pr(object is in class c) The value S object (v) was referred to as the selectional preference strength of verb v, and the value A object (v, c) was called the selectional association between the verb v and a particular semantic category c. (a) Assuming probabilities are estimated accurately (for example, pretend the classes are observable), and assuming that verb-object frequencies in the corpus are a reasonably accurate reflection of the real world, would the model correctly predict that A object (drink, beverage) should be higher than A object (drink, appliance)? Explain why or why not. (b) Below, the graph on the left shows A object (v 1, c) and the graph on the right shows A object (v 2, c), for two verbs v 1 and v 2. (The x-axis ranges over possible semantic categories c, and each bar gives the value of A object (v, c).) As you can see, one of the verbs is eat and the other one is find. Even without the labels, you would have been able to identify which was which, based on what you know about the model (and the assumptions we ve made). Explain how, with explicit reference to the formal definition of the model. (c) The person who proposed this model claimed that S object (v) models, quantitively, how much information the verb carries about the semantic category of its direct object. Explain why this claim is true. Answer: 8

9 (a) Assuming a corpus where probabilities are estimated accurately and reflect the way things work in the real world, the prediction would be correct. Consider the component probabilities in the definition of A object. Given that the verb is drink, the probability of a beverage as direct object is high; but the marginal probability of beverages as direct objects is low (not likely across all verbs). So the ratio will have a high value, as will the value it s being multiplied by. In contrast, for appliance direct objects, the conditional probability given the verb drink is low, so the ratio is going to have a smallish value (even if the marginal probability of appliance is also low). Notice how similar this is to the homework problem where you were asked to look at the way certain terms did and did not contribute to the relative entropy between two probability distributions. (b) Verb find is on the left and eat is on the right. Since find is not particularly selective about the semantic class of its direct object (you can find happiness, a job, a planet, your keys, your long lost uncle, etc.), no one semantic category is likely to particularly dominate the conditional probability distribution Pr(object is in class c verb is v), which is consistent with the flat, relatively undifferentiated profile on the left. (Notice that if c is relatively independent of v, we expect Pr(c v) Pr(c), which implies Pr(c v)log Pr(c v) Pr(c) Pr(c v)log1 = 0.) In contrast, the verb eat is highly selective about the semantic category of its direct objects, which is reflected in the spiky profile on the right hand side. (But note that neither of these profiles represents class frequency, which should be obvious since there are negative values for some classes.) As an aside, notice that although both profiles have some negative values, the magnitudes of negative values are very small relative to the magnitudes of positive values. This, too, relates to the homework problems that involved relative entropy. It also provides a visual, intuitive explanation for the somewhat counterintuitive fact that D(p q) can t ever be negative. (c) The truth of the claim follows from the definition of S object as the relative entropy (K-L divergence) between two particular distributions. Probability Pr(object is in class c verb is v) is the probability of the class given the verb. Probability Pr(object is in class c) can be seen as the probability of a particular semantic class in direct object position in the absence of information about the specific verb. Now, recall that D(p q) can be interpreted as the number of bits wasted, on average, by encoding events using distribution q as the basis for the encoding rather than the true distribution p. In this case, we are comparing the true distribution, when you know what the verb is, versus a less informed model where you ignore the verb. Therefore the relative entropy can be interpreted as the number of extra bits you need on average to identify c, if you don t know v. Conversely, it s the information you gain about semantic class c if you choose to use the distribution where you do know what the verb is. Therefore it s fair to say that S object measures the quantity of information the verb carries about the semantic category of the direct object. Incidentally, this model was first proposed in Resnik (1993), Selection and Information: A Class-based Approach to Lexical Relationships, Ph.D. dissertation, Computer and Information Science, University of Pennsylvania. (Guidance: most credit for the right general argument but deduct if information is not characterized in the sense of information theory or if K-L divergence is not included in the discussion.) 9

10 5 Parsing and Evaluation [15 points] You have been hired by a AwesomeSearchEngine.com to make recommendations concerning language technology. They have been approached by Whizdee, Inc., a startup company, and in a sales presentation Whizdee s sales representative says the following: We re doing very exciting work on parsing, and our results are very impressive. In one experiment, we trained one of our parsers on 80% of the Penn Treebank and tested on the other 20%, and its labeled recall on constituent boundaries was 98.2%. We also did an experiment, training on the same data, where the labeled precision for constituent boundaries was 97.6%. With numbers like that, how could you lose? (a) Consider the true parse T, and the system parse P, below: [T] (S (NP the prof) (VP (VPRT looked up) (NP the grade)) (ADVP today)) [P] (S (NP the prof) (VP looked (PP up (NP the grade)) (ADVP today))) What are the values for labeled precision and labeled recall? Note that I ve omitted all part of speech labels because they re not used in constituent recall/precision calculations. Every uppercase symbol in the parse is a constituent label. (b) Should people be impressed by Whizdee s numbers? Explain why or why not. (c) Whizdee s decided you re so smart that they, too, want to pay you as a consultant. They ve got a questionanswering search engine that uses a parser, and a contract from the New York Times. They paid minimum wage to impoverished linguistics grad students in New York City to create parse trees for 20,000 New York Times articles written between March 15, 1997 and March 15, Whizdee plans to make their system available to the public starting April 1, but the New York Times insists on a formal parser evaluation first. One of Whizdee s scientists says that they should evaluate their parser by doing 10-fold cross validation. Another of their scientists says that they should evaluate it by training on the data up to March 15, 2006 and testing on the rest. Explain what the two competing evaluation approaches are and discuss the advantages and disadvantages of each approach. Solution (a) Labeled precision and labeled recall for parse trees are computed by comparing the spans covered by the constituents along with the labels. For example, both P and T contain,np:0-2 (a noun phrase spanning positions 0-2, where 0 is the position just before the and 2 is the position just after prof). So this constituent would be a true positive. VP:2-7 in P would be a false positive: it matches the label of the VP in T, but not its span (the true VP goes from 2 to 6). In this example, there are 4 true positives (S, two NPs, and ADVP), and each of P and T contain 6 constituents, so precision and recall both work out to be 4 6 = 2 3. (b) They did two experiments and both were reported incompletely. My intent in the question was for you to observe that for most tasks, precision and recall, measured independently, can be traded off against each other. For example, a high precision result can be obtained trivially by labeling only constituents that are very unlikely to be wrong as an extreme case, giving parse trees that contain nothing but an S covering the whole sentence. (Look, ma, 100% precision!) So I was ok with responses encouraging skepticism on these grounds (and, say, suggesting F-measure as a harder-to-game combined measure.) On the other hand, if you think about it, a falsely high recall result for parse trees is much harder to obtain, especially for labeled constituents: since you have to choose a unique label per 10

11 constituent and you also need a valid parse tree, you can t just throw in all possible constituents. (For example, in P, the preposition up could not be part of both a VPRT looked up and the PP up the grade.) So I d also give full credit (and a pat on the back) for the argument that high labeled recall could be a credible indication that the parser is doing a good job. (c) Both approaches properly separate training and test data. Ten-fold cross validation divides the full data set into folds f 1,..., f 10, and performs ten training-test splits: each fold f i is treated as test data using {f j j i} as training data. The ten evaluation results are then averaged to give you a single figure of merit. This has the advantage of providing a better estimate of performance, since you can compute both the mean and the standard deviation. It has the disadvantage of being a bit more costly and complicated and, more important, it s not very well suited to the stated task. The parser is going to need to perform well on news data generated after the parser is trained (taking into account new people, places, etc., as well as new expressions in the language as they are introduced) and the cross-validation strategy doesn t take chronological ordering into account. The alternative here could be described as a chronological train/test split. The advantage is that it is a better match for the task: unless there s going to be constant retraining (requiring constant treebanking), the application will be using a parser that was trained on older data and is being run on chronologically newer data. The main disadvantage is that a single train/test split gives you less information than cross validation since you only have the one data point. 11

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other