Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Size: px

Start display at page:

Download "Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing"

Milton Lloyd
5 years ago
Views:

1 CS 562: Empirical Methods in Natural Language Processing Unit 2: Tree Models Lectures 19-23: Context-Free Grammars and Parsing Oct-Nov 2009 Liang Huang

2 Big Picture we have already covered... sequence models (WFSAs, WFSTs) decoding (Viterbi Algorithm) supervised training (counting, smoothing) unsupervised training (EM using forward-backward) in this unit we ll look beyond sequences, and cover... tree models (prob context-free grammars and extensions) decoding ( parsing, CKY Algorithm) supervised training (lexicalization, history-annotation,...) unsupervised training (EM via inside-outside) 2

3 Course Project (see handout) Important Dates Tue Nov 10: initial project proposal due Thu Nov 19: final project proposal due; data preparation and one-hour baseline finished. Thu Dec 3: midway project presentations Wed Dec 16: final project writeups due at Topic (see list of samples from previous years) must involve statistical processing of linguistic structures NO boring topics like text classification with bags of words Amount of Work: ~2 HWs for each student 3

4 Review of HW3 what s wrong with this epron-jpron.wfst? 100 (0 (AA "AA" *e* 1.0 ) ) (0 (D "D" *e* 1.0 ) ) (0 (F "F" *e* 1.0 ) ) (0 (HH "HH" *e* 1.0 ) ) (0 (G "G" *e* 1.0 ) ) (0 (B "B" *e* 1.0 ) )... (AA (100 *e* "O O" ) ) (AA (100 *e* "A" ) ) (AA (100 *e* "O" ) ) (AA (100 *e* "A A" ) )... (M (100 *e* "N" ) ) (M (100 *e* "M" ) ) (N (100 *e* "NN" ) ) ( 100 ( 0 " " *e* 1.0 ) ) ( 100 ( 0 *e* *e* 1.0 ) ) 4

5 Limitations of Sequence Models can you write an FSA/FST for the following? { (a n, b n ) } { (a 2n, b n ) } { a n b n } { w w R } { (w, w R ) } does it matter to human languages? [The woman saw the boy [that heard the man [that left] ] ]. [The claim [that the house [he bought] is valuable] is wrong]. but humans can t really process infinite recursions... stack overflow! 5

6 Let s try to write a grammar... (courtesy of Julia Hockenmaier) let s take a closer look... we ll try our best to represent English in a FSA... basic sentence structure: N, V, N 6

7 Subject-Verb-Object compose it with a lexicon, and we get an HMM so far so good 7

8 (Recursive) Adjectives the ball (courtesy of Julia Hockenmaier) the big ball the big, red ball the big, red, heavy ball... then add Adjectives, which modify Nouns the number of modifiers/adjuncts can be unlimited. how about no determiner before noun? play tennis 8

9 Recursive PPs (courtesy of Julia Hockenmaier) the ball the ball in the garden the ball in the garden behind the house the ball in the garden behind the house near the school... recursion can be more complex but we can still model it with FSAs! so why bother to go beyond finite-state? 9

10 FSAs can t go hierarchical! (courtesy of Julia Hockenmaier) but sentences have a hierarchical structure! so that we can infer the meaning we need not only strings, but also trees FSAs are flat, and can only do tail recursions (i.e., loops) but we need real (branching) recursions for languages 10

11 FSAs can t do Center Embedding The mouse ate the corn. (courtesy of Julia Hockenmaier) The mouse that the snake ate ate the corn. The mouse that the snake that the hawk ate ate ate the corn.... vs. The claim that the house he bought was valuable was wrong. vs. I saw the ball in the garden behind the house near the school. in theory, these infinite recursions are still grammatical competence (grammatical knowledge) in practice, studies show that English has a limit of 3 performance (processing and memory limitations) FSAs can model finite embeddings, but very inconvenient. 11

12 How about Recursive FSAs? problem of FSAs: only tail recursions, no branching recursions can t represent hierarchical structures (trees) can t generate center-embedded strings is there a simple way to improve it? recursive transition networks (RTNs) S NP VP -> > > 2 -> NP Det N -> > > 2 -> VP V NP -> > > 2 ->

13 Context-Free Grammars S NP VP NP Det N NP NP PP PP P NP N {ball, garden, house, sushi } P {in, behind, with} V... Det... VP V NP VP VP PP... 13

14 Context-Free Grammars A CFG is a 4-tuple N,Σ,R,S A set of nonterminals N (e.g. N = {S, NP, VP, PP, Noun, Verb,...}) A set of terminals Σ (e.g. Σ = {I, you, he, eat, drink, sushi, ball, }) A set of rules R R {A β with left-hand-side (LHS)" A N and right-hand-side (RHS) β (N Σ)* } A start symbol S (sentence) 14

15 Parse Trees N {sushi, tuna} P {with} V {eat} NP N NP NP PP PP P NP VP V NP VP VP PP 15

16 CFGs for Center-Embedding The mouse ate the corn. The mouse that the snake ate ate the corn. The mouse that the snake that the hawk ate ate ate the corn.... { a n b n } { w w R } can you also do { a n b n c n }? or { w w R w }? { a n b n c m d m } what s the limitation of CFGs? CFG for center-embedded clauses: S NP ate NP; NP NP RC; RC that NP ate 16

17 Review write a CFG for... { a m b n c n d m } { a m b n c 3m+2n } { a m b n c m d n } buffalo buffalo buffalo... write an FST or synchronous CFG for... { (w, w R ) } { (a n, b n ) } HW3: including p(eprons) is wrong HW4: using carmel to test your own code 17

18 Chomsky Hierarchy CS 498 JH: Introduction to NLP (Fall 08) 18

19 Constituents, Heads, Dependents CS 498 JH: Introduction to NLP (Fall 08) 19

20 Constituency Test how about there is or I do? CS 498 JH: Introduction to NLP (Fall 08) 20

21 Arguments and Adjuncts arguments are obligatory CS 498 JH: Introduction to NLP (Fall 08) 21

22 Arguments and Adjuncts adjuncts are optional CS 498 JH: Introduction to NLP (Fall 08) 22

23 Noun Phrases (NPs) CS 498 JH: Introduction to NLP (Fall 08) 23

24 The NP Fragment CS 498 JH: Introduction to NLP (Fall 08) 24

25 ADJPs and PPs CS 498 JH: Introduction to NLP (Fall 08) 25

26 Verb Phrase (VP) CS 498 JH: Introduction to NLP (Fall 08) 26

27 VPs redefined CS 498 JH: Introduction to NLP (Fall 08) 27

28 Sentences CS 498 JH: Introduction to NLP (Fall 08) 28

29 Sentence Redefined CS 498 JH: Introduction to NLP (Fall 08) 29

30 Probabilistic CFG normalization sumβ p( A β) =1 what s the most likely tree? in finite-state world, what s the most likely string? given string w, what s the most likely tree for w this is called parsing (like decoding) CS 498 JH: Introduction to NLP (Fall!08) 30

31 Probability of a tree CS 498 JH: Introduction to NLP (Fall!08) 31

32 Most likely tree given string parsing is to search for the best tree t* that: t* = argmax_t p (t w) = argmax_t p(t) p (w t) = argmax_{t: yield(t)=w} p(t) analogous to HMM decoding is it related to intersection or composition in FSTs? 32

33 CKY Algorithm (S, 0, n) NAACL w0 w1... wn-1 Dynamic Programming

34 CKY Algorithm flies like a flower S NP VP NP DT NN NP NNS NP NP PP VP VB NP VP VP PP VP VB PP P NP VB flies NNS flies VB like P like DT a NN flower NAACL Dynamic Programming

35 CKY Algorithm S, VP, NP NNS, VB, NP S VP VB, P, VP, PP DT NP NN flies like a flower S NP VP NP DT NN NP NNS NP NP PP VP VB NP VP VP PP VP VB PP P NP VB flies NNS flies VB like P like DT a NN flower NAACL Dynamic Programming

36 CKY Example NAACL 2009 CS 498 JH: Introduction to NLP (Fall!08) 36 Dynamic Programming

37 Chomsky Normal Form wait! how can you assume a CFG is binary-branching? well, we can always convert a CFG into Chomsky- Normal Form (CNF) A B C A a how to deal with epsilon-removal? how to do it with PCFG? 37

38 Parsing as Deduction (B, i, k) (C, k, j) : a (A, i, j) : b A B C : a b Pr(A B C) NAACL Dynamic Programming

39 Parsing as Intersection (B, i, k) (C, k, j) : a (A, i, j) : b A B C : a b Pr(A B C) intersection between a CFG G and an FSA D: define L(G) to be the set of strings (i.e., yields) G generates define L(G D) = L(G) L(D) what does this new language generate?? what does the new grammar look like? what about CFG CFG? NAACL Dynamic Programming

40 Parsing as Composition NAACL Dynamic Programming

Packed Forests a compact representation of many parses by sharing common sub-derivations polynomial-space encoding of exponentially large set

41 Packed Forests a compact representation of many parses by sharing common sub-derivations polynomial-space encoding of exponentially large set nodes hyperedges a hypergraph 0 I 1 saw 2 him 3 with 4 a 5 mirror 6 NAACL 2009 (Klein and Manning, 2001; Huang and Chiang, 2005) 41 Dynamic Programming

42 Lattice vs. Forest NAACL Dynamic Programming

43 Forest and Deduction (B, i, k) u1 : a fe (C, k, j) u2 : b (B, i, k) (C, k, j) (A, i, j) A B C v : a b Pr(A B C) (A, i, j) (Nederhof, 2003) tails u1 : a u2 : b u1 : a : b u2 antecedents fe fe v : fe (a,b) v : fe (a,b) head consequent NAACL Dynamic Programming

44 Related Formalisms v v OR-node e e AND-node u1 u2 u1 u2 OR-nodes NAACL Dynamic Programming

45 Viterbi Algorithm for DAGs 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): key observation: d(u) is fixed to optimal at this time u w(u, v) v d(v) = d(u) w(u, v) time complexity: O( V + E ) NAACL Dynamic Programming

46 Viterbi Algorithm for DAHs 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming hyperedge e = ((u1,.., u e ), v, fe) use d(ui) s to update d(v) key observation: d(u i) s are fixed to optimal at this time u1 fe v d(v) = f e (d(u 1 ),,d(u e )) u2 time complexity: O( V + E ) NAACL (assuming constant arity) Dynamic Programming

47 Example: CKY Parsing parsing with CFGs in Chomsky Normal Form (CNF) typical instance of the generalized Viterbi for DAHs many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) bottom-up left-to-right O(n 3 P ) NAACL Dynamic Programming

48 Example: CKY Parsing parsing with CFGs in Chomsky Normal Form (CNF) typical instance of the generalized Viterbi for DAHs many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) (S, 0, n) bottom-up left-to-right right-to-left O(n 3 P ) NAACL Dynamic Programming

49 Parser/Tree Evaluation how would you evaluate the quality of output trees? need to define a similarity measure between trees for sequences, we used same length: hamming distance (e.g., POS tagging) varying length: edit distance (e.g., Japanese transliteration) varying length: precision/recall/f (e.g., word-segmentation) varying length: crossing brackets (e.g., word-segmentation) for trees, we use precision/recall/f and crossing brackets standard PARSEVAL metrics (implemented as evalb.py) NAACL Dynamic Programming

50 PARSEVAL comparing nodes ( brackets ): labelled (by default): (NP, 2, 5); or unlabelled: (2, 5) precision: how many predicted nodes are correct? recall: how many correct nodes are predicted? how to fake precision or recall? F-score: F=2pr/(p+r) other metrics: crossing brackets NAACL matched=6 predicted=7 gold=7 precision=6/7 recall=6/7 F=6/7 Dynamic Programming

51 Inside-Outside Algorithm NAACL Dynamic Programming

52 Inside-Outside Algorithm NAACL Dynamic Programming

53 Inside-Outside Algorithm inside prob beta is easy to compute (CKY, max=>+) what is outside prob alpha(x,i,j)? need to enumerate ways to go to TOP from X,i,j X,i,j can be combined with other nodes on the left/right L: sum_{y->z X, k} alpha(y,k,j) Pr(Y->Z X) beta(z,k,i) R: sum_{y->x Z, k} alpha(y,i,k) Pr(Y->X Z) beta(z,j,k) why beta is used in alpha? very diff. from F-W algorithm what is the likelihood of the sentence? beta(top, 0, n) or alpha(w_i, i, i+1) for any i NAACL Dynamic Programming

54 Inside-Outside Algorithm TOP TOP Y Y Z X X Z 0 k i j n 0 i j k n L: sum_{y->z X, k} alpha(y,k,j) Pr(Y->Z X) beta(z,k,i) R: sum_{y->x Z, k} alpha(y,i,k) Pr(Y->X Z) beta(z,j,k) NAACL Dynamic Programming

55 Inside-Outside Algorithm how do you do EM with alphas and betas? easy; M-step: divide by fractional counts fractional count of rule (X,i,j -> Y,i,k Z,k,j) is alpha(x,i,j) prob(y Z X) beta(y,i,k) beta(z,k,j) if we replace + by max, what will alpha/beta mean? beta : Viterbi inside: best way to derive X,i,j alpha : Viterbi outside: best way to go to TOP from X,i,j now what is alpha (X, i, j) beta (X, i, j)? best derivation that contains X,i,j (useful for pruning) NAACL Dynamic Programming

56 Translation as Parsing translation with SCFGs => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel VP PP (1) VP (2), VP (2) PP (1) VP juxing le huitan, held a meeting PP yu Shalong, with Sharon held a talk with Sharon VP1, 6 with Sharon held a talk complexity: same as CKY parsing -- O(n 3 ) PP1, 3 VP3, 6 yu Shalong juxing le huitan NAACL Dynamic Programming

57 Adding a Bigram Model... meeting... Shalong _... talk bigram with Sharon... Sharon... talks held... Sharon NAACL 2009 VP1, 6 bigram held... talk with... Sharon VP3, 6 PP1, 3 complexity: O(n 3 V 4(m-1) ) PP1, 3 VP3, 6 with... Sharon along... Sharon with... Shalong 57 VP1, 6 held... talk held... meeting hold... talks Dynamic Programming

Decoding and Inference with Syntactic Translation Models

Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP