CS 562: Empirical Methods in Natural Language Processing Unit 2: Tree Models Lectures 19-23: Context-Free Grammars and Parsing Oct-Nov 2009 Liang Huang (lhuang@isi.edu)
Big Picture we have already covered... sequence models (WFSAs, WFSTs) decoding (Viterbi Algorithm) supervised training (counting, smoothing) unsupervised training (EM using forward-backward) in this unit we ll look beyond sequences, and cover... tree models (prob context-free grammars and extensions) decoding ( parsing, CKY Algorithm) supervised training (lexicalization, history-annotation,...) unsupervised training (EM via inside-outside) 2
Course Project (see handout) Important Dates Tue Nov 10: initial project proposal due Thu Nov 19: final project proposal due; data preparation and one-hour baseline finished. Thu Dec 3: midway project presentations Wed Dec 16: final project writeups due at oana@isi.edu. Topic (see list of samples from previous years) must involve statistical processing of linguistic structures NO boring topics like text classification with bags of words Amount of Work: ~2 HWs for each student 3
Review of HW3 what s wrong with this epron-jpron.wfst? 100 (0 (AA "AA" *e* 1.0 ) ) (0 (D "D" *e* 1.0 ) ) (0 (F "F" *e* 1.0 ) ) (0 (HH "HH" *e* 1.0 ) ) (0 (G "G" *e* 1.0 ) ) (0 (B "B" *e* 1.0 ) )... (AA (100 *e* "O O" 0.012755102040816327) ) (AA (100 *e* "A" 0.4642857142857143) ) (AA (100 *e* "O" 0.49489795918367346) ) (AA (100 *e* "A A" 0.015306122448979591) )... (M (100 *e* "N" 0.08396946564885496) ) (M (100 *e* "M" 0.6793893129770993) ) (N (100 *e* "NN" 0.015942028985507246) ) ( 100 ( 0 " " *e* 1.0 ) ) ( 100 ( 0 *e* *e* 1.0 ) ) 4
Limitations of Sequence Models can you write an FSA/FST for the following? { (a n, b n ) } { (a 2n, b n ) } { a n b n } { w w R } { (w, w R ) } does it matter to human languages? [The woman saw the boy [that heard the man [that left] ] ]. [The claim [that the house [he bought] is valuable] is wrong]. but humans can t really process infinite recursions... stack overflow! 5
Let s try to write a grammar... (courtesy of Julia Hockenmaier) let s take a closer look... we ll try our best to represent English in a FSA... basic sentence structure: N, V, N 6
Subject-Verb-Object compose it with a lexicon, and we get an HMM so far so good 7
(Recursive) Adjectives the ball (courtesy of Julia Hockenmaier) the big ball the big, red ball the big, red, heavy ball... then add Adjectives, which modify Nouns the number of modifiers/adjuncts can be unlimited. how about no determiner before noun? play tennis 8
Recursive PPs (courtesy of Julia Hockenmaier) the ball the ball in the garden the ball in the garden behind the house the ball in the garden behind the house near the school... recursion can be more complex but we can still model it with FSAs! so why bother to go beyond finite-state? 9
FSAs can t go hierarchical! (courtesy of Julia Hockenmaier) but sentences have a hierarchical structure! so that we can infer the meaning we need not only strings, but also trees FSAs are flat, and can only do tail recursions (i.e., loops) but we need real (branching) recursions for languages 10
FSAs can t do Center Embedding The mouse ate the corn. (courtesy of Julia Hockenmaier) The mouse that the snake ate ate the corn. The mouse that the snake that the hawk ate ate ate the corn.... vs. The claim that the house he bought was valuable was wrong. vs. I saw the ball in the garden behind the house near the school. in theory, these infinite recursions are still grammatical competence (grammatical knowledge) in practice, studies show that English has a limit of 3 performance (processing and memory limitations) FSAs can model finite embeddings, but very inconvenient. 11
How about Recursive FSAs? problem of FSAs: only tail recursions, no branching recursions can t represent hierarchical structures (trees) can t generate center-embedded strings is there a simple way to improve it? recursive transition networks (RTNs) --------------------------------------- S NP VP -> 0 ------> 1 ------> 2 -> --------------------------------------- --------------------------------------- NP Det N -> 0 ------> 1 ------> 2 -> --------------------------------------- ---------------------------------- VP V NP -> 0 ------> 1 ------> 2 -> ---------------------------------- 12
Context-Free Grammars S NP VP NP Det N NP NP PP PP P NP N {ball, garden, house, sushi } P {in, behind, with} V... Det... VP V NP VP VP PP... 13
Context-Free Grammars A CFG is a 4-tuple N,Σ,R,S A set of nonterminals N (e.g. N = {S, NP, VP, PP, Noun, Verb,...}) A set of terminals Σ (e.g. Σ = {I, you, he, eat, drink, sushi, ball, }) A set of rules R R {A β with left-hand-side (LHS)" A N and right-hand-side (RHS) β (N Σ)* } A start symbol S (sentence) 14
Parse Trees N {sushi, tuna} P {with} V {eat} NP N NP NP PP PP P NP VP V NP VP VP PP 15
CFGs for Center-Embedding The mouse ate the corn. The mouse that the snake ate ate the corn. The mouse that the snake that the hawk ate ate ate the corn.... { a n b n } { w w R } can you also do { a n b n c n }? or { w w R w }? { a n b n c m d m } what s the limitation of CFGs? CFG for center-embedded clauses: S NP ate NP; NP NP RC; RC that NP ate 16
Review write a CFG for... { a m b n c n d m } { a m b n c 3m+2n } { a m b n c m d n } buffalo buffalo buffalo... write an FST or synchronous CFG for... { (w, w R ) } { (a n, b n ) } HW3: including p(eprons) is wrong HW4: using carmel to test your own code 17
Chomsky Hierarchy CS 498 JH: Introduction to NLP (Fall 08) 18
Constituents, Heads, Dependents CS 498 JH: Introduction to NLP (Fall 08) 19
Constituency Test how about there is or I do? CS 498 JH: Introduction to NLP (Fall 08) 20
Arguments and Adjuncts arguments are obligatory CS 498 JH: Introduction to NLP (Fall 08) 21
Arguments and Adjuncts adjuncts are optional CS 498 JH: Introduction to NLP (Fall 08) 22
Noun Phrases (NPs) CS 498 JH: Introduction to NLP (Fall 08) 23
The NP Fragment CS 498 JH: Introduction to NLP (Fall 08) 24
ADJPs and PPs CS 498 JH: Introduction to NLP (Fall 08) 25
Verb Phrase (VP) CS 498 JH: Introduction to NLP (Fall 08) 26
VPs redefined CS 498 JH: Introduction to NLP (Fall 08) 27
Sentences CS 498 JH: Introduction to NLP (Fall 08) 28
Sentence Redefined CS 498 JH: Introduction to NLP (Fall 08) 29
Probabilistic CFG normalization sumβ p( A β) =1 what s the most likely tree? in finite-state world, what s the most likely string? given string w, what s the most likely tree for w this is called parsing (like decoding) CS 498 JH: Introduction to NLP (Fall!08) 30
Probability of a tree CS 498 JH: Introduction to NLP (Fall!08) 31
Most likely tree given string parsing is to search for the best tree t* that: t* = argmax_t p (t w) = argmax_t p(t) p (w t) = argmax_{t: yield(t)=w} p(t) analogous to HMM decoding is it related to intersection or composition in FSTs? 32
CKY Algorithm (S, 0, n) NAACL 2009 33 w0 w1... wn-1 Dynamic Programming
CKY Algorithm flies like a flower S NP VP NP DT NN NP NNS NP NP PP VP VB NP VP VP PP VP VB PP P NP VB flies NNS flies VB like P like DT a NN flower NAACL 2009 34 Dynamic Programming
CKY Algorithm S, VP, NP NNS, VB, NP S VP VB, P, VP, PP DT NP NN flies like a flower S NP VP NP DT NN NP NNS NP NP PP VP VB NP VP VP PP VP VB PP P NP VB flies NNS flies VB like P like DT a NN flower NAACL 2009 35 Dynamic Programming
CKY Example NAACL 2009 CS 498 JH: Introduction to NLP (Fall!08) 36 Dynamic Programming
Chomsky Normal Form wait! how can you assume a CFG is binary-branching? well, we can always convert a CFG into Chomsky- Normal Form (CNF) A B C A a how to deal with epsilon-removal? how to do it with PCFG? 37
Parsing as Deduction (B, i, k) (C, k, j) : a (A, i, j) : b A B C : a b Pr(A B C) NAACL 2009 38 Dynamic Programming
Parsing as Intersection (B, i, k) (C, k, j) : a (A, i, j) : b A B C : a b Pr(A B C) intersection between a CFG G and an FSA D: define L(G) to be the set of strings (i.e., yields) G generates define L(G D) = L(G) L(D) what does this new language generate?? what does the new grammar look like? what about CFG CFG? NAACL 2009 39 Dynamic Programming
Parsing as Composition NAACL 2009 40 Dynamic Programming
Packed Forests a compact representation of many parses by sharing common sub-derivations polynomial-space encoding of exponentially large set nodes hyperedges a hypergraph 0 I 1 saw 2 him 3 with 4 a 5 mirror 6 NAACL 2009 (Klein and Manning, 2001; Huang and Chiang, 2005) 41 Dynamic Programming
Lattice vs. Forest NAACL 2009 42 Dynamic Programming
Forest and Deduction (B, i, k) u1 : a fe (C, k, j) u2 : b (B, i, k) (C, k, j) (A, i, j) A B C v : a b Pr(A B C) (A, i, j) (Nederhof, 2003) tails u1 : a u2 : b u1 : a : b u2 antecedents fe fe v : fe (a,b) v : fe (a,b) head consequent NAACL 2009 43 Dynamic Programming
Related Formalisms v v OR-node e e AND-node u1 u2 u1 u2 OR-nodes NAACL 2009 44 Dynamic Programming
Viterbi Algorithm for DAGs 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): key observation: d(u) is fixed to optimal at this time u w(u, v) v d(v) = d(u) w(u, v) time complexity: O( V + E ) NAACL 2009 45 Dynamic Programming
Viterbi Algorithm for DAHs 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming hyperedge e = ((u1,.., u e ), v, fe) use d(ui) s to update d(v) key observation: d(u i) s are fixed to optimal at this time u1 fe v d(v) = f e (d(u 1 ),,d(u e )) u2 time complexity: O( V + E ) NAACL 2009 46 (assuming constant arity) Dynamic Programming
Example: CKY Parsing parsing with CFGs in Chomsky Normal Form (CNF) typical instance of the generalized Viterbi for DAHs many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) bottom-up left-to-right O(n 3 P ) NAACL 2009 47 Dynamic Programming
Example: CKY Parsing parsing with CFGs in Chomsky Normal Form (CNF) typical instance of the generalized Viterbi for DAHs many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) (S, 0, n) bottom-up left-to-right right-to-left O(n 3 P ) NAACL 2009 48 Dynamic Programming
Parser/Tree Evaluation how would you evaluate the quality of output trees? need to define a similarity measure between trees for sequences, we used same length: hamming distance (e.g., POS tagging) varying length: edit distance (e.g., Japanese transliteration) varying length: precision/recall/f (e.g., word-segmentation) varying length: crossing brackets (e.g., word-segmentation) for trees, we use precision/recall/f and crossing brackets standard PARSEVAL metrics (implemented as evalb.py) NAACL 2009 49 Dynamic Programming
PARSEVAL comparing nodes ( brackets ): labelled (by default): (NP, 2, 5); or unlabelled: (2, 5) precision: how many predicted nodes are correct? recall: how many correct nodes are predicted? how to fake precision or recall? F-score: F=2pr/(p+r) other metrics: crossing brackets NAACL 2009 50 matched=6 predicted=7 gold=7 precision=6/7 recall=6/7 F=6/7 Dynamic Programming
Inside-Outside Algorithm NAACL 2009 51 Dynamic Programming
Inside-Outside Algorithm NAACL 2009 52 Dynamic Programming
Inside-Outside Algorithm inside prob beta is easy to compute (CKY, max=>+) what is outside prob alpha(x,i,j)? need to enumerate ways to go to TOP from X,i,j X,i,j can be combined with other nodes on the left/right L: sum_{y->z X, k} alpha(y,k,j) Pr(Y->Z X) beta(z,k,i) R: sum_{y->x Z, k} alpha(y,i,k) Pr(Y->X Z) beta(z,j,k) why beta is used in alpha? very diff. from F-W algorithm what is the likelihood of the sentence? beta(top, 0, n) or alpha(w_i, i, i+1) for any i NAACL 2009 53 Dynamic Programming
Inside-Outside Algorithm TOP TOP Y Y Z X X Z 0 k i j n 0 i j k n L: sum_{y->z X, k} alpha(y,k,j) Pr(Y->Z X) beta(z,k,i) R: sum_{y->x Z, k} alpha(y,i,k) Pr(Y->X Z) beta(z,j,k) NAACL 2009 54 Dynamic Programming
Inside-Outside Algorithm how do you do EM with alphas and betas? easy; M-step: divide by fractional counts fractional count of rule (X,i,j -> Y,i,k Z,k,j) is alpha(x,i,j) prob(y Z X) beta(y,i,k) beta(z,k,j) if we replace + by max, what will alpha/beta mean? beta : Viterbi inside: best way to derive X,i,j alpha : Viterbi outside: best way to go to TOP from X,i,j now what is alpha (X, i, j) beta (X, i, j)? best derivation that contains X,i,j (useful for pruning) NAACL 2009 55 Dynamic Programming
Translation as Parsing translation with SCFGs => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel VP PP (1) VP (2), VP (2) PP (1) VP juxing le huitan, held a meeting PP yu Shalong, with Sharon held a talk with Sharon VP1, 6 with Sharon held a talk complexity: same as CKY parsing -- O(n 3 ) PP1, 3 VP3, 6 yu Shalong juxing le huitan NAACL 2009 56 Dynamic Programming
Adding a Bigram Model... meeting... Shalong _... talk bigram with Sharon... Sharon... talks held... Sharon NAACL 2009 VP1, 6 bigram held... talk with... Sharon VP3, 6 PP1, 3 complexity: O(n 3 V 4(m-1) ) PP1, 3 VP3, 6 with... Sharon along... Sharon with... Shalong 57 VP1, 6 held... talk held... meeting hold... talks Dynamic Programming