Transformational Priors Over Grammars

Size: px
Start display at page:

Download "Transformational Priors Over Grammars"

Transcription

1 Transformational Priors Over Grammars Jason Eisner Johns Hopkins University July 6, 2002 EMNLP This talk is called Transformational Priors Over Grammars. It should become clear what I mean by a prior over grammars, and where the transformations come in. But here s the big concept: 1

2 The Big Concept Want to parse (or build a syntactic language model). Must estimate rule probabilities. Problem: Too many possible rules! Especially with lexicalization and flattening (which help). So it s hard to estimate probabilities. Suppose we want to estimate probabilities of parse trees, either to pick the best one or to do language modeling. Then we have to estimate the probabilities of context free rules. But the problem, as usual, is sparse data since there are too many rules, too many probabilities to estimate. This is especially true if we use lexicalized rules, especially flat ones where all the dependents attach at one go. It does help to use such rules, as we ll see, but it also increases the number of parameters. 2

3 The Big Concept Problem: Too many rules! Especially with lexicalization and flattening (which help). So it s hard to estimate probabilities. Solution: Related rules tend to have related probs POSSIBLE relationships are given a priori LEARN which relationships are strong in this language (just like feature selection) Method has connections to: Parameterized finite-state machines (Monday s talk) Bayesian networks (inference, abduction, explaining away) Linguistic theory (transformations, metarules, etc.) Solution, I think, is to realize that related rules tend to have related probabilities. Then if you don t have enough data to observe a rule s probability directly, you can estimate it by looking at other, related rules. It s a form of smoothing. Sort of like reducing the number of parameters, although actually I m going to keep all the parameters in case the data aren t sparse, and use a prior to bias their values in case the data are sparse. OLD: This is like reducing the number of parameters, since it lets you predict a rule s probability instead of learning it. OLD: (More precisely, you have a prior expectation of that rule probability, which can be overridden by data, but which you can fall back on in the absence of data.) What do I mean by related rules? I mean something like active and passive, but it varies from language to language. So you give the model a grab bag of possible relationships, which is language independent, and it learns which ones are predictive. That s akin to feature selection, in TBL or maxent modeling. You have maybe features generated by filling in templates, but only a few hundred or a few thousand of them turn out to be useful. The statistical method I ll use is a new one, but it has connections to other things. First of all, I m giving a very general talk first thing Monday morning about PFSMs, and these models are a special case. I l h l i i hb i h li i i 3

4 Problem: Too Many Rules... TO to S fund NP projects SBAR that NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund Here s a parse, or a fragment of one; the whole sentence might be I want to fund projects that are worthy. To see whether it s a likely parse, we see whether its individual CF rules are likely. For instance, the rule we need here for fund was used 5 times in training data. 4

5 [Want To Multiply Rule Probabilities] NP DT fund 24 NN fund 8 NP DT NN fund S 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund TO fund NP 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund to 1 NP DT ADJP NNP fund projects SBAR 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP that... 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NNP fund, VP, 1 PRP$ fund p(tree) =... p( S) p( TO) p( NP) p( SBAR)... (oversimplified) 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund The other rules in the parse have their own counts. And to get the probability of the parse, basically you convert the counts to probabilities and multiply them. I m oversimplifying, but you already know how PCFGs work and it doesn t matter to this talk. What matters is how to convert the counts to probabilities. 5

6 Too Many Rules But Luckily... TO to S fund NP projects SBAR that... All these rules for fund & other, still unobserved rules are connected by the deep structure of English. 26 NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund Notice that I m using lexicalized rules. Every rule I pull out of training data contains a word, so words can be idiosyncratic: the list of rules for fund might be different than the list of rules for another noun, or at least have different counts. That s important for parsing. Now, I didn t pick fund for any reason in fact, this is an old slide. But it s instructive to look at this list of rules for fund, which is from the Penn Treebank. It s a long list, is the first thing to notice, and we haven t seen them all there s a long tail of singletons. But there s order here. All of these rules are connected in ways that are common in English. 6

7 Rules Are Related 26 NP DT fund fund behaves like a 24 NN fund 8 NP DT NN fund typical singular noun 7 NNP fund 5 S TO fund NP one fact! 2 NP NNP fund 2 NP DT NPR NN fund though PCFG represents it as many apparently unrelated rules. 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund We could summarize them by saying that fund behaves like a typical singular noun. That s just one fact to learn we don t have to learn the rules individually. So in a sense there s only one parameter here.. 7

8 Rules Are Related 26 NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund one more fact! 2 S TO fund NP PP 1 NP DT JJ NN fund even if several more rules. 1 NP DT NPR JJ fund 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund Verb rules are RELATED. 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD 1 NP DT NN fund VP Should... be able to PREDICT the ones we haven t seen. DT NN fund PP fund behaves like a typical singular noun or transitive verb TO to S fund NP projects SBAR that... 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund Of course, it s not quite right, because we just saw it used as a transitive verb, to fund projects that are worthy. There are a few verb rules in the list. But that s just a second fact. These verb rules are related. We ve only seen a few, but that should be enough to predict the rest of the transitive verb paradigm. 8

9 Rules Are Related fund behaves like a typical singular noun or transitive verb but as noun, has an idiosyncratic fondness for purpose clauses the ACL fund to put proceedings online the old ACL fund for students to attend ACL one more fact! predicts dozens of unseen rules 26 NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund I said it could act as a typical noun or verb, but that s still not quite right, because as a noun it s not quite typical. Look at these rules in orange for noun phrases like They describe what the fund does. Typical nouns can take these purpose clauses, but fund takes them more often than typical, probably for semantic reasons. Well, that s fact #3 about fund. It explains these 5 rules and predicts dozens more. 9

10 Rules Are Related fund behaves like a typical singular noun or transitive verb but as noun, has an idiosyncratic fondness for purpose clauses and maybe other idiosyncrasies to be discovered, like unaccusativity 26 NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund 2 S TO fund NP PP 1 NP DT JJ NN fund 1 DT NPR JJ fund 1NSF NPissued DT ADJP the NNPgrant fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP The grant issued today??? 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund NSF funded the grant The grant funded today 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund unlikely sentence, 1 NP but if we do see it, NP$ JJ NN fund 1 NP DT JJ fund is unaccusativity plausible? (vs. other parse) And I ll mention one more potential fact. There s no rule in training data suggesting that fund might be unaccusative. What s unaccusative? It s kind of a sneak passive, like this We d like to say that since some verbs like issue can do this, maybe fund can too: NSF funded the grant, the grant funded today. Based on the evidence we ve seen so far, that s low-probability. But we don t want it to be too low, since if the system were to see this sentence, it would have to decide between the unaccusative parse and treating today as a direct object: today was funded by the grant. We d want it to admit that the unaccusative parse is syntactically reasonable. That s how it can learn new constructions it does EM. Oh, that s the best parse of this weird sentence, I guess I ll count it as new training data. 10

11 All This Is Quantitative! fund behaves like a typical singular noun or transitive verb but as noun, has an idiosyncratic fondness for purpose clauses and maybe other idiosyncrasies to be discovered, like unaccusativity 26 NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund how often? 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP and how does that tell us p(rule)? 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund So we have 4 facts about fund there might be more. And really, they re all quantitative. How OFTEN is it a noun, and how often a verb? How MUCH more frequent are purpose clauses than for the typical noun? How OFTEN is fund used unaccusatively? 11

12 Format of the Rules S S NP Jim V VP VP PP NP in the oven NP Jim put NP pizza PP in the oven put pizza S NP VP VP VP PP VP V NP V put (put) (put) (put) (put) S NP put NP PP Here s a traditional structure for Jim put pizza in the oven. Going from the top down, it expands S by a sequence of 4 rules. Nowadays we condition those expansions on the head word, put note that put is the head of all these projections. But I m going to argue in favor of this structure on the right, which collapses the spine of put into a single level. Put takes all its dependents at once. 12

13 Format of the Rules Why use flat rules? Avoids silly independence assumptions: a win NP put NP Johnson 1998 Jim pizza New experiments Our method likes them Traditional rules aren t systematically related But relationships exist S NP put NP PP among wide, flat rules that express different ways of filling same roles S PP in the oven It s a way of avoiding independence assumptions. Adjuncts are not really independent of one another. And even traditional methods do better when estimating the whole flat rule at one go. Mark Johnson showed that for some special cases, and my experiments bear it out in spades. But the new method especially wants to work with wide, flat rules, because it looks at relationships among rules. 13

14 Format of the Rules Why use flat rules? Avoids silly independence assumptions: a win NP put NP Johnson 1998 Jim a very New experiments heavy Our method likes them pizza Traditional rules aren t systematically related But relationships exist S NP put PP NP among wide, flat rules that express different ways of filling same roles S PP in the oven 14

15 Format of the Rules Why use flat rules? Avoids silly independence assumptions: a win, NP put NP PP Johnson 1998 Jim a pizza New experiments Our method likes them Traditional rules aren t systematically related But relationships exist S NP, NP put PP among wide, flat rules that express different ways of filling same roles S in the oven A pepperoni pizza he put in the oven. What s next, shrimp fricassee? Jim put a pizza in the oven last week A pizza was put in the oven last week 15

16 Format of the Rules Why use flat rules? Avoids silly independence assumptions: a win, NP Johnson 1998 Jim New experiments Our method likes them Traditional rules aren t systematically related But relationships exist among wide, flat rules that express different ways of filling same roles S put NP a pizza PP in the oven in short, flat rules are the locus of transformations What a transformation does is to take a flat rule a word together with all its roles and rearrange the way those roles are expressed syntactically. 16

17 Format of the Rules Why use flat rules? Avoids silly indep. assumptions: a win Johnson 1998 New experiments Our method likes them Traditional rules aren t systematically related But relationships exist among wide, flat rules that express different ways of filling same roles flat rules are the locus of exceptions (e.g., put is exceptionally likely to take a PP, but not a second PP) in short, flat rules are the locus of transformations And some ways of expressing those roles may be particularly favored. 17

18 Intuition: Listing is costly and hard to learn. Most rules are derived. Hey Just Like Linguistics! Lexicalized syntactic formalisms: CG, LFG, TAG, HPSG, LCFG Grammar = set of lexical entries very like flat rules Exceptional entries OK flat rules are the locus of exceptions (e.g., put is exceptionally likely to take a PP, but not a second PP) listed entries derived entries Explain coincidental patterns of lexical entries: metarules/ transformations/lexical redundancy rules in short, flat rules are the locus of transformations Hey, just like ling. Think about the lexicon What these formalisms have in common is that they all end in G no, just kidding. In all of them, the grammar is just a set of lexical entries that can be combined in various ways. And a lexical entry is always basically like one of our flat rule. It s ok to have some weird entries in the lexicon, but there s also a lot of redundancy, as we saw for FUND. And linguists have mechanisms for deriving the redundant entries. So you could also see this talk as being about how to stochasticize these approaches to syntax including the lexical redundancy rules. 18

19 The Rule Smoothing Task Input: Rule counts (from parses or putative parses) Output: Probability distribution over rules Evaluation: Perplexity of held-out rule counts That is, did we assign high probability to the rules needed to correctly parse test data? Now that we know what a rule is, let s talk about rule smoothing. We look at some parses and count up the rules. In EM, they d be fractional counts. Then we have to figure out the real probabilities of the rules. To evaluate, we look at some more parses and see whether they perplex our model. Our model is good if it assigns high probability to the rules needed to parse test sentences correctly. 19

20 The Rule Smoothing Task Input: Rule counts (from parses or putative parses) Output: Probability distribution over rules Evaluation: Perplexity of held-out rule counts Rule probabilities: p(s NP put NP PP S,put) Infinite set of possible rules; so we will estimate p(s NP Adv PP put PP PP NP AdjP S S, put) = a very tiny number > 0 Now that we know what a rule is, let s talk about rule smoothing. We look at some parses and count up the rules. In EM, they d be fractional counts. Then we have to figure out the real probabilities of the rules. To evaluate, we look at some more parses and see whether they perplex our model. Our model is good if it assigns high probability to the rules needed to parse test sentences correctly. 20

21 Grid of Lexicalized Rules S... encourage question fund merge repay remove To NP To NP PP To AdvP NP To AdvP NP PP To PP To S NP NP. NP NP PP. NP Md NP NP Md NP PPTmp NP Md PP PP NP SBar. (etc.) S To fund NP PP ( to fund projects with ease ) S To merge NP PP ( to merge projects with ease ) We saw this rule before... I've pulled out the head word, fund, and used it as a column label. The rest of the rule w/o the word is the row label, and I'll call it a frame. Here s a similar atom - only the word is different, the frame is the same - so it goes in a different column of the same row. 21

22 Training Counts S... encourage question fund merge repay remove To NP To NP PP To AdvP NP 1 To AdvP NP PP 1 NP NP. 2 NP NP PP. 1 NP Md NP 1 NP Md NP PPTmp 1 NP Md PP PP 1 To PP 1 To S 1 NP SBar. 2 (other) Count of (word, frame) And in training data, we saw each of those frames twice. These are real counts, by the way. So this is our training data... 22

23 Naive prob. estimates (MLE model) S... encourage question fund merge repay remove To NP To NP PP To AdvP NP To AdvP NP PP NP NP NP NP PP NP Md NP NP Md NP PPTmp NP Md PP PP To PP To S NP SBar (other) Estimate of p(frame word) * 1000 First column, encourage - we have seen encourage once with each of these frames, for a total of 5 - so we give each frame the probability 1 out of 5 or 0.2. And every other But there are more things in language and speech, MLE model, than are dreamt of in your philosophy! In other words, all these zeroes are a problem. There are new things under the sun, and they will show up in test data. In fact, they ll show up quite often! When the parser needs a particular atom, chances are 21% it never saw that atom during training - so it s got prob=0 in this table. You might point out, well, that s because your atoms are so specific - they re specified down to the level of the particular word. But chances are 5% it never even saw the frame before - so the whole row is all 0 s. We have to generalize from the other rows. 23

24 TASK: counts probs ( smoothing ) S... encourage question fund merge repay remove To NP To NP PP To AdvP NP To AdvP NP PP NP NP NP NP PP NP Md NP NP Md NP PPTmp NP Md PP PP To PP To S NP SBar (other) Estimate of p(frame word) * but for parsing what we really need is the true probabilities of the atoms, not the counts. (These are probs * 1000, to 2 signif. figures, so the ones that jut left are big, the ones that jut right are small.) [flip back and forth to counts] So that s our task - to turn counts into probabilities. This is traditionally called smoothing: we ve sort of smeared the black counts down the column so that we get some positive probability on each row. In fact, as the legend at the bottom says, these are p(frame word), so each column is a distribution over possible frames for a word. It ranges over ALL possible frames, and it sums to 1. A possible frame is anything of the form blah blah blah --- blah blah, so there are infinitely many of them. You ll note that the counts involved are very small. These are real data, and they re 0 s and 1 s and 2 s. The difference between seeing something 0 times and 1 time may just be a matter of luck - or it may be significant. That swhy this problem is hard, and why I m using a Bayesian approach, which weighs evidence carefully against expectations. 24

25 Smooth Matrix via LSA / SVD, or SBS? S... encourage question fund merge repay remove To NP To NP PP To AdvP NP 1 To AdvP NP PP 1 NP NP. 2 NP NP PP. 1 NP Md NP 1 NP Md NP PPTmp 1 NP Md PP PP 1 To PP 1 To S 1 NP SBar. 2 (other) Count of (word, frame) No then each column would be approximated by a linear combination of standard columns. Also true for similarity-based smoothing (Lee et al.) That s not a bad idea, since the column for fund would be a mixture of noun behavior and verb behavior. But lots of rows of the training matrix are all zeros. That is, lots of frames in test data never showed up in training data at all. SVD can t handle that. And SVD doesn t know anything about the internal structure of frames it sees each column as a vector over interchangeable dimensions. But it s clear from looking at this table that the internal structure is really predictive. If a frame appears, then it generally appears with PP added at the right edge. These frames for remove are just split-infinitive versions of these: to completely remove, to surgically remove. 25

26 Smoothing via a Bayesian Prior Choose grammar to maximize p(observed rule counts grammar)*p(grammar) grammar = probability distribution over rules Our job: Define p(grammar) Question: What makes a grammar likely, a priori? This paper s answer: Systematicity. Rules are mainly derivable from other rules. Relatively few stipulations ( deep facts ). We d like a grammar that explains the observed data, and is also a priori a good grammar. So we use Bayes Rule and maximize this product. By a grammar I mean a probability distribution over rules. 26

27 Only a Few Deep Facts fund behaves like a transitive verb 10% of time and noun 90% of time takes purpose clauses 5 times as often as typical noun. 26 NP DT fund 24 NN fund 8 NP DT NN fund 7 NNP fund 5 S TO fund NP 2 NP NNP fund 2 NP DT NPR NN fund 2 S TO fund NP PP 1 NP DT JJ NN fund 1 NP DT NPR JJ fund 1 NP DT ADJP NNP fund 1 NP DT JJ JJ NN fund 1 NP DT NN fund SBAR 1 NPR fund 1 NP-PRD DT NN fund VP 1 NP DT NN fund PP 1 NP DT ADJP NN fund ADJP 1 NP DT ADJP fund PP 1 NP DT JJ fund PP-TMP 1 NP-PRD DT ADJP NN fund VP 1 NP NNP fund, VP, 1 NP PRP$ fund 1 S-ADV DT JJ fund 1 NP DT NNP NNP fund 1 SBAR NP MD fund NP PP 1 NP DT JJ JJ fund SBAR 1 NP DT JJ NN fund SBAR 1 NP DT NNP fund 1 NP NP$ JJ NN fund 1 NP DT JJ fund These are the key facts. If fund has other little idiosyncrasies, we can add those facts, but small idiosyncrasies don t hurt the prior probability much the prior cares more about big ones.. 27

28 Smoothing via a Bayesian Prior Previous work (several papers in past decade): Rules should be few, short, and approx. equiprobable These priors try to keep rules out of grammar Bad idea for lexicalized grammars This work: Prior tries to get related rules into grammar transitive passive at 1/20 the probability NSF spraggles the project The project is spraggled by NSF Would be weird for the passive to be missing, and prior knows it! In fact, weird if p(passive) is too far from 1/20 * p(active) Few facts, not few rules! If you can say NSF funds the project, it would be really weird if you couldn t say The project is funded by NSF. This prior is going to be much happier if the passive rule for fund is in there. Or really, all rules are in there, the question is what the probability is. 28

29 for now, stick to See paper for various evidence that these should be predictive. Simple Edit Transformations Delete NP S NP see I see S NP see NP I see you do fancier things by a sequence of edits S NP see SBAR PP I see that it s love with my own eyes Insert PP Subst NP SBAR Insert PP Swap SBAR,PP S NP see NP PP I see you with my own eyes S NP see SBAR I see that it s love S NP see PP SBAR I see with my own eyes that it s love We won t do passives in these experiments, though. Stick to simple edit transformations. Start with a simple transitive verb rule. What are the related rules? We could suppress the direct object let the hearer infer it. We could make the instrument explicit, in case the hearer can t infer it. - I see you in the park with a telescope We could change the type of the direct object and see not an object, you, but a proposition, that it s love. And we could do heavy-shift. These have different probabilities. Remember, we re treating this as feature selection. We ll tell the statistical model about all kinds of transformations, including insertion of weird constituents in weird places, and let it figure out which ones are good. But I did some preliminary experiments to verify that in general, edit transformations do tend to be predictive. You can read about those in the paper. 29

30 p(s NP see SBAR PP) = 0.5*0.1*0.1* * START Halt Halt Halt S NP see NP I see you Halt Halt S NP see SBAR PP I see that it s love with my own eyes 0.1 Delete NP 0.2 Insert PP 0.1 Subst NP SBAR 0.1 Insert PP 0.6 Swap SBAR,PP S NP see I see S NP see NP PP I see you with my own eyes S NP see SBAR I see that it s love 0.9 Halt S NP see PP SBAR I see with my own eyes that it s love Halt These transformations have probabilities. We can use the transformation probabilities to calculate the rule probabilities, basically in the obvious way. 30

31 S NP Adv PP see PP PP NP AdjP S graph goes on forever START 0.5 S NP see NP I see you S NP see SBAR PP I see that it s love with my own eyes 0.3 whole transitive verb paradigm (with probs) S DT JJ see the holy see S NP see I see intransitive verb paradigm noun paradigm Could get mixture behavior by adjusting start probs. But not quite right - can t handle negative exceptions within a paradigm. And what of the language s transformation probs? So if we increase this from 0.5, then all the transitive verb rules increase. If we increase this from 0.3, then all the intransitive verb rules increase. And there s some crosstalk between them, so if we suddenly learn that see is a transitive verb, we also raise the probability that it could be used intransitively. Just as fund can be used as a verb or noun, so can see: So one way to get a simple mixture behavior would be to adjust the start weights. That would be sort of like SVD, where a word is approximated as a linear combination of basis vectors. But here the basis vectors or paradigms are infinite they re distributions over an infinite set of possible rules. And we re doing something else that SVD can t do, which is to use information about the dimensions some of these rules are related to each other in the sense of having low edit distance. If every word s distribution over frames is a mixture of standard distributions (which is not quite what we ll end up doing), then maybe we should use LSA or SVD to find those standard distributions and the mixture coefficients. But that would just model the observed distribution vector as a sum of standard vectors. It wouldn t constrain what those standard vectors looked like, for example by paying attention to edit distance. And those standard vectors would have finite support, so 31

32 Infinitely Many Arc Probabilities: Derive From Finite Parameter Set S NP see Insert PP S NP see PP S NP see NP Insert PP S NP see NP PP PP more places to insert so probability is split among more options Why not just give any two PP-insertion arcs the same probability? In second one, we are inserting PP into a slightly different context. In second one, more places to insert PP, so each has lower probability. 32

33 Arc Probabilities: A Conditional Log-Linear Model 1 Z exp Insert PP To make sure outgoing arcs sum to 1, introduce a normalizing factor Z (at each vertex). S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 5 +θ 6 Halt 1 Z exp Models p(arc vertex) 33

34 Arc Probabilities: A Conditional Log-Linear Model S NP see Insert PP S NP see PP inserted into slightly different context S NP see NP Insert PP S NP see NP PP PP more places to insert Both are PP-adjunction arcs. Same probability? Almost but not quite 34

35 Arc Probabilities: A Conditional Log-Linear Model Not enough just to say Insert PP. Each arc bears several features, whose weights determine its probability. S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 6 +θ 7 feature weights a feature of weight 0 has no effect raising a feature s weight strengthens all arcs with that feature Every arc bears several features describing what it does, which together determine its probability. It s like turning a knob that adjusts a whole class of arcs. There are as many knobs as there are features. But only finitely many. 35

36 Arc Probabilities: A Conditional Log-Linear Model S NP see Insert PP S NP see PP 1 Z exp θ 3 +θ 5 +θ 7 θ 3 θ 5 θ 6 θ 7 S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 6 +θ 7 : appears on arcs that insert PP into S : appears on arcs that insert PP just after head : appears on arcs that insert PP just after NP : appears on arcs that insert PP just before edge 36

37 Arc Probabilities: A Conditional Log-Linear Model S NP see Insert PP S NP see PP 1 Z exp θ 3 +θ 5 +θ 7 θ 3 θ 5 θ 6 θ 7 S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 6 +θ 7 : appears on arcs that insert PP into S : appears on arcs that insert PP just after head : appears on arcs that insert PP just after NP : appears on arcs that insert PP just before edge 37

38 Arc Probabilities: A Conditional Log-Linear Model S NP see Insert PP S NP see PP 1 Z exp θ 3 +θ 5 +θ 7 S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 6 +θ 7 These arcs share most features. So their probabilities tend to rise and fall together. To fit data, could manipulate them independently (via θ 5,θ 6 ). 38

39 Prior Distribution PCFG grammar is determined by θ 0, θ 1, θ 2, 39

40 Universal Grammar 40

41 Instantiated Grammar 41

42 Prior Distribution Grammar is determined by θ 0, θ 1, θ 2, Our prior: θ i ~ N(0, σ 2 ), IID Thus: -log p(grammar) = c+ (θ 02 +θ 12 +θ 22 + )/σ 2 So good grammars have few large weights. Prior prefers one generalization to many exceptions. 42

43 Arc Probabilities: A Conditional Log-Linear Model S NP see Insert PP S NP see PP 1 Z exp θ 3 +θ 5 +θ 7 S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 6 +θ 7 To raise both rules probs, cheaper to use θ 3 than both θ 5 & θ 6. This generalizes also raises other cases of PP-insertion! 43

44 Arc Probabilities: A Conditional Log-Linear Model S NP fund NP Insert PP S NP fund NP PP 1 Z exp θ 3 +θ 82 +θ 6 +θ 7 S NP see NP Insert PP S NP see NP PP 1 Z exp θ 3 +θ 84 +θ 6 +θ 7 To raise both probs, cheaper to use θ 3 than both θ 82 & θ 84. This generalizes also raises other cases of PP-insertion! 44

45 Reparameterization Grammar is determined by θ 0, θ 1, θ 2, A priori, the θ i are normally distributed We ve reparameterized! The parameters are feature weights θ i, not rule probabilities Important tendencies captured in big weights Similarly: Fourier transform find the formants Similarly: SVD find the principal components It s on this deep level that we want to compare events, impose priors, etc. 45

46 46

47 47

48 Other models of this string: max-likelihood n-gram Collins arg/adj hybrids 48

49 Simple Bigram Model (Eisner 1996) A parser assumes tree is probable if its component rules are: Try assuming rule is probable if its component bigrams are: A B C D p(a start) p(b A) p(c B) p( C) p(d ) p(stop D) Markov process, 1 symbol of memory; conditioned on L, w, side of One-count backoff to handle sparse data (Chen & Goodman 1996) p(l A B C D w)= p(l w) p(a B C D L,w) Ok, here s a simple model that does assign non-zero probability to every atom. Remember we ve assumed a tree is probable if its component atoms are. We might make the same independence assumption at a finer grain, and assume that an atom is probable if its subatomic particles are. I ve taken a subatomic particle to be a sequence of 2 consecutive nonterminals in the frame, known as a bigram. The bigrams here are start A, A B, etc. To get the prob of the frame, we do like before - multiply together the bigrams probabilities and divide by the overlap. And that s just saying that the frame is generated by a Markov process with one symbol of memory. With some other tricks of the trade, this does respectably at parsing if you have $250K worth of data. 49

50 Use non-flat frames? Extra training info. For test, sum over all bracketings. 50

51 Perplexity: Predicting test frames 20% further reduction from previous lit. Can get big perplexity reduction just by flattening. 51

52 Perplexity: Predicting test frames best model without transformations best model with transformations from previous lit 52

53 test rules with 0 training observations p(rule head, S) best model with transformations best model without transformations 53

54 test rules with 1 training observation p(rule head, S) best model with transformations best model without transformations 54

55 test rules with 2 training observations p(rule head, S) best model with transformations best model without transformations 55

56 Forced matching task Test model s ability to extrapolate novel frames for a word Randomly select two (word, frame) pairs from test data... ensuring that neither frame was ever seen in training Ask model to choose a matching: word 1 word 2 frame A frame B word 1 word 2 frame A frame B i.e., does frame A look more like word 1 s known frames or word 2 s? 20% fewer errors than bigram model 56

57 Graceful degradation Twice as much data But no transformations Even when you take away half of the transformation model s data, it still wins. Or in the case of hybrid models, it s about a tie. So these kinds of perplexity reductions were comparable to a twofold increase in the amount of training data. 57

58 Summary: Reparameterize PCFG in terms of deep transformation weights, to be learned under a simple prior. Problem: Too many rules! Especially with lexicalization and flattening (which help). So it s hard to estimate probabilities. Solution: Related rules tend to have related probs POSSIBLE relationships are given a priori LEARN which relationships are strong in this language (just like feature selection) Method has connections to: Parameterized finite-state machines (Monday s talk) Bayesian networks (inference, abduction, explaining away) Linguistic theory (transformations, metarules, etc.) 58

59 FIN 59

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees Recap: Lexicalized PCFGs We now need to estimate rule probabilities such as P rob(s(questioned,vt) NP(lawyer,NN) VP(questioned,Vt) S(questioned,Vt)) 6.864 (Fall 2007): Lecture 5 Parsing and Syntax III

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Fitting a Straight Line to Data

Fitting a Straight Line to Data Fitting a Straight Line to Data Thanks for your patience. Finally we ll take a shot at real data! The data set in question is baryonic Tully-Fisher data from http://astroweb.cwru.edu/sparc/btfr Lelli2016a.mrt,

More information

A Context-Free Grammar

A Context-Free Grammar Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily

More information

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Natural Language Processing CS 4120/6120 Spring 2017 Northeastern University David Smith with some slides from Jason Eisner & Andrew

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Introduction to Algebra: The First Week

Introduction to Algebra: The First Week Introduction to Algebra: The First Week Background: According to the thermostat on the wall, the temperature in the classroom right now is 72 degrees Fahrenheit. I want to write to my friend in Europe,

More information

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1 Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Decoding and Inference with Syntactic Translation Models

Decoding and Inference with Syntactic Translation Models Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

CS1800: Strong Induction. Professor Kevin Gold

CS1800: Strong Induction. Professor Kevin Gold CS1800: Strong Induction Professor Kevin Gold Mini-Primer/Refresher on Unrelated Topic: Limits This is meant to be a problem about reasoning about quantifiers, with a little practice of other skills, too

More information

X-bar theory. X-bar :

X-bar theory. X-bar : is one of the greatest contributions of generative school in the filed of knowledge system. Besides linguistics, computer science is greatly indebted to Chomsky to have propounded the theory of x-bar.

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP  NP PP 1.0. N people 0. /6/7 CS 6/CS: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang The grammar: Binary, no epsilons,.9..5

More information

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this. Chapter 12 Synchronous CFGs Synchronous context-free grammars are a generalization of CFGs that generate pairs of related strings instead of single strings. They are useful in many situations where one

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University September 27, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

ASTRO 114 Lecture Okay. What we re going to discuss today are what we call radiation laws. We ve

ASTRO 114 Lecture Okay. What we re going to discuss today are what we call radiation laws. We ve ASTRO 114 Lecture 15 1 Okay. What we re going to discuss today are what we call radiation laws. We ve been spending a lot of time talking about laws. We ve talked about gravitational laws, we ve talked

More information

Quadratic Equations Part I

Quadratic Equations Part I Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification with Tree Substitution Grammars Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 5. 1 Review (Pairwise Independence and Derandomization) 6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can

More information

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query:

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query: Words vs. Terms Words vs. Terms Information Retrieval cares about You search for em, Google indexes em Query: What kind of monkeys live in Costa Rica? 600.465 - Intro to NLP - J. Eisner 1 600.465 - Intro

More information

Introduction to Semantics. The Formalization of Meaning 1

Introduction to Semantics. The Formalization of Meaning 1 The Formalization of Meaning 1 1. Obtaining a System That Derives Truth Conditions (1) The Goal of Our Enterprise To develop a system that, for every sentence S of English, derives the truth-conditions

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Please bring the task to your first physics lesson and hand it to the teacher.

Please bring the task to your first physics lesson and hand it to the teacher. Pre-enrolment task for 2014 entry Physics Why do I need to complete a pre-enrolment task? This bridging pack serves a number of purposes. It gives you practice in some of the important skills you will

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

NaturalLanguageProcessing-Lecture08

NaturalLanguageProcessing-Lecture08 NaturalLanguageProcessing-Lecture08 Instructor (Christopher Manning):Hi. Welcome back to CS224n. Okay. So today, what I m gonna do is keep on going with the stuff that I started on last time, going through

More information

The Infinite PCFG using Hierarchical Dirichlet Processes

The Infinite PCFG using Hierarchical Dirichlet Processes S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise S NP VP NP PRP VP VBD NP NP DT NN PRP she VBD heard DT the NN noise

More information

DIFFERENTIAL EQUATIONS

DIFFERENTIAL EQUATIONS DIFFERENTIAL EQUATIONS Basic Concepts Paul Dawkins Table of Contents Preface... Basic Concepts... 1 Introduction... 1 Definitions... Direction Fields... 8 Final Thoughts...19 007 Paul Dawkins i http://tutorial.math.lamar.edu/terms.aspx

More information

Processing/Speech, NLP and the Web

Processing/Speech, NLP and the Web CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Chapter 1 Review of Equations and Inequalities

Chapter 1 Review of Equations and Inequalities Chapter 1 Review of Equations and Inequalities Part I Review of Basic Equations Recall that an equation is an expression with an equal sign in the middle. Also recall that, if a question asks you to solve

More information

Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only

Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only 1/53 Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only Larry Moss Indiana University Nordic Logic School August 7-11, 2017 2/53 An example that we ll see a few times Consider the

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

(7) a. [ PP to John], Mary gave the book t [PP]. b. [ VP fix the car], I wonder whether she will t [VP].

(7) a. [ PP to John], Mary gave the book t [PP]. b. [ VP fix the car], I wonder whether she will t [VP]. CAS LX 522 Syntax I Fall 2000 September 18, 2000 Paul Hagstrom Week 2: Movement Movement Last time, we talked about subcategorization. (1) a. I can solve this problem. b. This problem, I can solve. (2)

More information

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use? Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

More information

At the start of the term, we saw the following formula for computing the sum of the first n integers:

At the start of the term, we saw the following formula for computing the sum of the first n integers: Chapter 11 Induction This chapter covers mathematical induction. 11.1 Introduction to induction At the start of the term, we saw the following formula for computing the sum of the first n integers: Claim

More information

CS460/626 : Natural Language

CS460/626 : Natural Language CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 23, 24 Parsing Algorithms; Parsing in case of Ambiguity; Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th,

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars CS 585, Fall 2017 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2017 Brendan O Connor College of Information and Computer Sciences

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this. Preface Here are my online notes for my Calculus II course that I teach here at Lamar University. Despite the fact that these are my class notes they should be accessible to anyone wanting to learn Calculus

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Discrete Structures Proofwriting Checklist

Discrete Structures Proofwriting Checklist CS103 Winter 2019 Discrete Structures Proofwriting Checklist Cynthia Lee Keith Schwarz Now that we re transitioning to writing proofs about discrete structures like binary relations, functions, and graphs,

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS The integers are the set 1. Groups, Rings, and Fields: Basic Examples Z := {..., 3, 2, 1, 0, 1, 2, 3,...}, and we can add, subtract, and multiply

More information

One-to-one functions and onto functions

One-to-one functions and onto functions MA 3362 Lecture 7 - One-to-one and Onto Wednesday, October 22, 2008. Objectives: Formalize definitions of one-to-one and onto One-to-one functions and onto functions At the level of set theory, there are

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2 1 Basic Probabilities The probabilities that we ll be learning about build from the set theory that we learned last class, only this time, the sets are specifically sets of events. What are events? Roughly,

More information

1 Rules in a Feature Grammar

1 Rules in a Feature Grammar 27 Feature Grammars and Unification Drew McDermott drew.mcdermott@yale.edu 2015-11-20, 11-30, 2016-10-28 Yale CS470/570 1 Rules in a Feature Grammar Grammar rules look like this example (using Shieber

More information

Algorithms for Syntax-Aware Statistical Machine Translation

Algorithms for Syntax-Aware Statistical Machine Translation Algorithms for Syntax-Aware Statistical Machine Translation I. Dan Melamed, Wei Wang and Ben Wellington ew York University Syntax-Aware Statistical MT Statistical involves machine learning (ML) seems crucial

More information

Take the Anxiety Out of Word Problems

Take the Anxiety Out of Word Problems Take the Anxiety Out of Word Problems I find that students fear any problem that has words in it. This does not have to be the case. In this chapter, we will practice a strategy for approaching word problems

More information

Guide to Proofs on Sets

Guide to Proofs on Sets CS103 Winter 2019 Guide to Proofs on Sets Cynthia Lee Keith Schwarz I would argue that if you have a single guiding principle for how to mathematically reason about sets, it would be this one: All sets

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

A* Search. 1 Dijkstra Shortest Path

A* Search. 1 Dijkstra Shortest Path A* Search Consider the eight puzzle. There are eight tiles numbered 1 through 8 on a 3 by three grid with nine locations so that one location is left empty. We can move by sliding a tile adjacent to the

More information

Physics Motion Math. (Read objectives on screen.)

Physics Motion Math. (Read objectives on screen.) Physics 302 - Motion Math (Read objectives on screen.) Welcome back. When we ended the last program, your teacher gave you some motion graphs to interpret. For each section, you were to describe the motion

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Ling 240 Lecture #15. Syntax 4

Ling 240 Lecture #15. Syntax 4 Ling 240 Lecture #15 Syntax 4 agenda for today Give me homework 3! Language presentation! return Quiz 2 brief review of friday More on transformation Homework 4 A set of Phrase Structure Rules S -> (aux)

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Sums and products: What we still don t know about addition and multiplication

Sums and products: What we still don t know about addition and multiplication Sums and products: What we still don t know about addition and multiplication Carl Pomerance, Dartmouth College Hanover, New Hampshire, USA Based on joint work with R. P. Brent, P. Kurlberg, J. C. Lagarias,

More information

Multilevel Coarse-to-Fine PCFG Parsing

Multilevel Coarse-to-Fine PCFG Parsing Multilevel Coarse-to-Fine PCFG Parsing Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, Shrivaths Iyengar, Jeremy Moore, Michael Pozar, and Theresa

More information

CS1800: Mathematical Induction. Professor Kevin Gold

CS1800: Mathematical Induction. Professor Kevin Gold CS1800: Mathematical Induction Professor Kevin Gold Induction: Used to Prove Patterns Just Keep Going For an algorithm, we may want to prove that it just keeps working, no matter how big the input size

More information

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith U. Washington CMU UT-Austin CMU Contributions 1. A new generative model for learning

More information

Exploratory Factor Analysis and Principal Component Analysis

Exploratory Factor Analysis and Principal Component Analysis Exploratory Factor Analysis and Principal Component Analysis Today s Topics: What are EFA and PCA for? Planning a factor analytic study Analysis steps: Extraction methods How many factors Rotation and

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016 CKY & Earley Parsing Ling 571 Deep Processing Techniques for NLP January 13, 2016 No Class Monday: Martin Luther King Jr. Day CKY Parsing: Finish the parse Recognizer à Parser Roadmap Earley parsing Motivation:

More information

What we still don t know about addition and multiplication. Carl Pomerance, Dartmouth College Hanover, New Hampshire, USA

What we still don t know about addition and multiplication. Carl Pomerance, Dartmouth College Hanover, New Hampshire, USA What we still don t know about addition and multiplication Carl Pomerance, Dartmouth College Hanover, New Hampshire, USA You would think that all of the issues surrounding addition and multiplication were

More information

Hierarchical Bayesian Nonparametrics

Hierarchical Bayesian Nonparametrics Hierarchical Bayesian Nonparametrics Micha Elsner April 11, 2013 2 For next time We ll tackle a paper: Green, de Marneffe, Bauer and Manning: Multiword Expression Identification with Tree Substitution

More information

Feature selection. Micha Elsner. January 29, 2014

Feature selection. Micha Elsner. January 29, 2014 Feature selection Micha Elsner January 29, 2014 2 Using megam as max-ent learner Hal Daume III from UMD wrote a max-ent learner Pretty typical of many classifiers out there... Step one: create a text file

More information