Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Similar documents
Decoding and Inference with Syntactic Translation Models

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

LECTURER: BURCU CAN Spring

Aspects of Tree-Based Statistical Machine Translation

Chapter 14 (Partially) Unsupervised Parsing

Probabilistic Context-free Grammars

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Advanced Natural Language Processing Syntactic Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Natural Language Processing

Lecture 12: Algorithms for HMMs

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Lecture 12: Algorithms for HMMs

Parsing with Context-Free Grammars

CS460/626 : Natural Language

Soft Inference and Posterior Marginals. September 19, 2013

Statistical Methods for NLP

Probabilistic Context-Free Grammar

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing with Context-Free Grammars

Statistical Methods for NLP

Multiword Expression Identification with Tree Substitution Grammars

Lecture 13: Structured Prediction

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Processing/Speech, NLP and the Web

Aspects of Tree-Based Statistical Machine Translation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Spectral Unsupervised Parsing with Additive Tree Metrics

Statistical Methods for NLP

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

A Context-Free Grammar

CS460/626 : Natural Language

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

DT2118 Speech and Speaker Recognition

Artificial Intelligence

Ch. 2: Phrase Structure Syntactic Structure (basic concepts) A tree diagram marks constituents hierarchically

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

10/17/04. Today s Main Points

A* Search. 1 Dijkstra Shortest Path

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Latent Variable Models in NLP

Lecture 15. Probabilistic Models on Graph

NLP Programming Tutorial 11 - The Structured Perceptron

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Unit 1: Sequence Models

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Lecture 5: UDOP, Dependency Grammars

Natural Language Processing

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 9: Hidden Markov Model

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Sharpening the empirical claims of generative syntax through formalization

X-bar theory. X-bar :

CS 6120/CS4120: Natural Language Processing

CS 545 Lecture XVI: Parsing

Algorithms for Syntax-Aware Statistical Machine Translation

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Lab 12: Structured Prediction

Context- Free Parsing with CKY. October 16, 2014

Spectral Learning for Non-Deterministic Dependency Parsing

Context Free Grammars: Introduction. Context Free Grammars: Simplifying CFGs

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Probabilistic Linguistics

Midterm sample questions

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Structure and Complexity of Grammar-Based Machine Translation

Syntax-based Statistical Machine Translation

Hidden Markov Models in Language Processing

Expectation Maximization (EM)

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Syntax-Based Decoding

Synchronous Grammars

Sequence Labeling: HMMs & Structured Perceptron

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Statistical Machine Translation

Intelligent Systems (AI-2)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

An introduction to PRISM and its applications

CISC4090: Theory of Computation

Marrying Dynamic Programming with Recurrent Neural Networks

Hidden Markov Models

Language and Statistics II

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

What we have done so far

Context Free Grammars: Introduction

Transcription:

CS 562: Empirical Methods in Natural Language Processing Unit 2: Tree Models Lectures 19-23: Context-Free Grammars and Parsing Oct-Nov 2009 Liang Huang (lhuang@isi.edu)

Big Picture we have already covered... sequence models (WFSAs, WFSTs) decoding (Viterbi Algorithm) supervised training (counting, smoothing) unsupervised training (EM using forward-backward) in this unit we ll look beyond sequences, and cover... tree models (prob context-free grammars and extensions) decoding ( parsing, CKY Algorithm) supervised training (lexicalization, history-annotation,...) unsupervised training (EM via inside-outside) 2

Course Project (see handout) Important Dates Tue Nov 10: initial project proposal due Thu Nov 19: final project proposal due; data preparation and one-hour baseline finished. Thu Dec 3: midway project presentations Wed Dec 16: final project writeups due at oana@isi.edu. Topic (see list of samples from previous years) must involve statistical processing of linguistic structures NO boring topics like text classification with bags of words Amount of Work: ~2 HWs for each student 3

Review of HW3 what s wrong with this epron-jpron.wfst? 100 (0 (AA "AA" *e* 1.0 ) ) (0 (D "D" *e* 1.0 ) ) (0 (F "F" *e* 1.0 ) ) (0 (HH "HH" *e* 1.0 ) ) (0 (G "G" *e* 1.0 ) ) (0 (B "B" *e* 1.0 ) )... (AA (100 *e* "O O" 0.012755102040816327) ) (AA (100 *e* "A" 0.4642857142857143) ) (AA (100 *e* "O" 0.49489795918367346) ) (AA (100 *e* "A A" 0.015306122448979591) )... (M (100 *e* "N" 0.08396946564885496) ) (M (100 *e* "M" 0.6793893129770993) ) (N (100 *e* "NN" 0.015942028985507246) ) ( 100 ( 0 " " *e* 1.0 ) ) ( 100 ( 0 *e* *e* 1.0 ) ) 4

Limitations of Sequence Models can you write an FSA/FST for the following? { (a n, b n ) } { (a 2n, b n ) } { a n b n } { w w R } { (w, w R ) } does it matter to human languages? [The woman saw the boy [that heard the man [that left] ] ]. [The claim [that the house [he bought] is valuable] is wrong]. but humans can t really process infinite recursions... stack overflow! 5

Let s try to write a grammar... (courtesy of Julia Hockenmaier) let s take a closer look... we ll try our best to represent English in a FSA... basic sentence structure: N, V, N 6

Subject-Verb-Object compose it with a lexicon, and we get an HMM so far so good 7

(Recursive) Adjectives the ball (courtesy of Julia Hockenmaier) the big ball the big, red ball the big, red, heavy ball... then add Adjectives, which modify Nouns the number of modifiers/adjuncts can be unlimited. how about no determiner before noun? play tennis 8

Recursive PPs (courtesy of Julia Hockenmaier) the ball the ball in the garden the ball in the garden behind the house the ball in the garden behind the house near the school... recursion can be more complex but we can still model it with FSAs! so why bother to go beyond finite-state? 9

FSAs can t go hierarchical! (courtesy of Julia Hockenmaier) but sentences have a hierarchical structure! so that we can infer the meaning we need not only strings, but also trees FSAs are flat, and can only do tail recursions (i.e., loops) but we need real (branching) recursions for languages 10

FSAs can t do Center Embedding The mouse ate the corn. (courtesy of Julia Hockenmaier) The mouse that the snake ate ate the corn. The mouse that the snake that the hawk ate ate ate the corn.... vs. The claim that the house he bought was valuable was wrong. vs. I saw the ball in the garden behind the house near the school. in theory, these infinite recursions are still grammatical competence (grammatical knowledge) in practice, studies show that English has a limit of 3 performance (processing and memory limitations) FSAs can model finite embeddings, but very inconvenient. 11

How about Recursive FSAs? problem of FSAs: only tail recursions, no branching recursions can t represent hierarchical structures (trees) can t generate center-embedded strings is there a simple way to improve it? recursive transition networks (RTNs) --------------------------------------- S NP VP -> 0 ------> 1 ------> 2 -> --------------------------------------- --------------------------------------- NP Det N -> 0 ------> 1 ------> 2 -> --------------------------------------- ---------------------------------- VP V NP -> 0 ------> 1 ------> 2 -> ---------------------------------- 12

Context-Free Grammars S NP VP NP Det N NP NP PP PP P NP N {ball, garden, house, sushi } P {in, behind, with} V... Det... VP V NP VP VP PP... 13

Context-Free Grammars A CFG is a 4-tuple N,Σ,R,S A set of nonterminals N (e.g. N = {S, NP, VP, PP, Noun, Verb,...}) A set of terminals Σ (e.g. Σ = {I, you, he, eat, drink, sushi, ball, }) A set of rules R R {A β with left-hand-side (LHS)" A N and right-hand-side (RHS) β (N Σ)* } A start symbol S (sentence) 14

Parse Trees N {sushi, tuna} P {with} V {eat} NP N NP NP PP PP P NP VP V NP VP VP PP 15

CFGs for Center-Embedding The mouse ate the corn. The mouse that the snake ate ate the corn. The mouse that the snake that the hawk ate ate ate the corn.... { a n b n } { w w R } can you also do { a n b n c n }? or { w w R w }? { a n b n c m d m } what s the limitation of CFGs? CFG for center-embedded clauses: S NP ate NP; NP NP RC; RC that NP ate 16

Review write a CFG for... { a m b n c n d m } { a m b n c 3m+2n } { a m b n c m d n } buffalo buffalo buffalo... write an FST or synchronous CFG for... { (w, w R ) } { (a n, b n ) } HW3: including p(eprons) is wrong HW4: using carmel to test your own code 17

Chomsky Hierarchy CS 498 JH: Introduction to NLP (Fall 08) 18

Constituents, Heads, Dependents CS 498 JH: Introduction to NLP (Fall 08) 19

Constituency Test how about there is or I do? CS 498 JH: Introduction to NLP (Fall 08) 20

Arguments and Adjuncts arguments are obligatory CS 498 JH: Introduction to NLP (Fall 08) 21

Arguments and Adjuncts adjuncts are optional CS 498 JH: Introduction to NLP (Fall 08) 22

Noun Phrases (NPs) CS 498 JH: Introduction to NLP (Fall 08) 23

The NP Fragment CS 498 JH: Introduction to NLP (Fall 08) 24

ADJPs and PPs CS 498 JH: Introduction to NLP (Fall 08) 25

Verb Phrase (VP) CS 498 JH: Introduction to NLP (Fall 08) 26

VPs redefined CS 498 JH: Introduction to NLP (Fall 08) 27

Sentences CS 498 JH: Introduction to NLP (Fall 08) 28

Sentence Redefined CS 498 JH: Introduction to NLP (Fall 08) 29

Probabilistic CFG normalization sumβ p( A β) =1 what s the most likely tree? in finite-state world, what s the most likely string? given string w, what s the most likely tree for w this is called parsing (like decoding) CS 498 JH: Introduction to NLP (Fall!08) 30

Probability of a tree CS 498 JH: Introduction to NLP (Fall!08) 31

Most likely tree given string parsing is to search for the best tree t* that: t* = argmax_t p (t w) = argmax_t p(t) p (w t) = argmax_{t: yield(t)=w} p(t) analogous to HMM decoding is it related to intersection or composition in FSTs? 32

CKY Algorithm (S, 0, n) NAACL 2009 33 w0 w1... wn-1 Dynamic Programming

CKY Algorithm flies like a flower S NP VP NP DT NN NP NNS NP NP PP VP VB NP VP VP PP VP VB PP P NP VB flies NNS flies VB like P like DT a NN flower NAACL 2009 34 Dynamic Programming

CKY Algorithm S, VP, NP NNS, VB, NP S VP VB, P, VP, PP DT NP NN flies like a flower S NP VP NP DT NN NP NNS NP NP PP VP VB NP VP VP PP VP VB PP P NP VB flies NNS flies VB like P like DT a NN flower NAACL 2009 35 Dynamic Programming

CKY Example NAACL 2009 CS 498 JH: Introduction to NLP (Fall!08) 36 Dynamic Programming

Chomsky Normal Form wait! how can you assume a CFG is binary-branching? well, we can always convert a CFG into Chomsky- Normal Form (CNF) A B C A a how to deal with epsilon-removal? how to do it with PCFG? 37

Parsing as Deduction (B, i, k) (C, k, j) : a (A, i, j) : b A B C : a b Pr(A B C) NAACL 2009 38 Dynamic Programming

Parsing as Intersection (B, i, k) (C, k, j) : a (A, i, j) : b A B C : a b Pr(A B C) intersection between a CFG G and an FSA D: define L(G) to be the set of strings (i.e., yields) G generates define L(G D) = L(G) L(D) what does this new language generate?? what does the new grammar look like? what about CFG CFG? NAACL 2009 39 Dynamic Programming

Parsing as Composition NAACL 2009 40 Dynamic Programming

Packed Forests a compact representation of many parses by sharing common sub-derivations polynomial-space encoding of exponentially large set nodes hyperedges a hypergraph 0 I 1 saw 2 him 3 with 4 a 5 mirror 6 NAACL 2009 (Klein and Manning, 2001; Huang and Chiang, 2005) 41 Dynamic Programming

Lattice vs. Forest NAACL 2009 42 Dynamic Programming

Forest and Deduction (B, i, k) u1 : a fe (C, k, j) u2 : b (B, i, k) (C, k, j) (A, i, j) A B C v : a b Pr(A B C) (A, i, j) (Nederhof, 2003) tails u1 : a u2 : b u1 : a : b u2 antecedents fe fe v : fe (a,b) v : fe (a,b) head consequent NAACL 2009 43 Dynamic Programming

Related Formalisms v v OR-node e e AND-node u1 u2 u1 u2 OR-nodes NAACL 2009 44 Dynamic Programming

Viterbi Algorithm for DAGs 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming edge (u, v) in E use d(u) to update d(v): key observation: d(u) is fixed to optimal at this time u w(u, v) v d(v) = d(u) w(u, v) time complexity: O( V + E ) NAACL 2009 45 Dynamic Programming

Viterbi Algorithm for DAHs 1. topological sort 2. visit each vertex v in sorted order and do updates for each incoming hyperedge e = ((u1,.., u e ), v, fe) use d(ui) s to update d(v) key observation: d(u i) s are fixed to optimal at this time u1 fe v d(v) = f e (d(u 1 ),,d(u e )) u2 time complexity: O( V + E ) NAACL 2009 46 (assuming constant arity) Dynamic Programming

Example: CKY Parsing parsing with CFGs in Chomsky Normal Form (CNF) typical instance of the generalized Viterbi for DAHs many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) bottom-up left-to-right O(n 3 P ) NAACL 2009 47 Dynamic Programming

Example: CKY Parsing parsing with CFGs in Chomsky Normal Form (CNF) typical instance of the generalized Viterbi for DAHs many variants of CKY ~ various topological ordering (S, 0, n) (S, 0, n) (S, 0, n) bottom-up left-to-right right-to-left O(n 3 P ) NAACL 2009 48 Dynamic Programming

Parser/Tree Evaluation how would you evaluate the quality of output trees? need to define a similarity measure between trees for sequences, we used same length: hamming distance (e.g., POS tagging) varying length: edit distance (e.g., Japanese transliteration) varying length: precision/recall/f (e.g., word-segmentation) varying length: crossing brackets (e.g., word-segmentation) for trees, we use precision/recall/f and crossing brackets standard PARSEVAL metrics (implemented as evalb.py) NAACL 2009 49 Dynamic Programming

PARSEVAL comparing nodes ( brackets ): labelled (by default): (NP, 2, 5); or unlabelled: (2, 5) precision: how many predicted nodes are correct? recall: how many correct nodes are predicted? how to fake precision or recall? F-score: F=2pr/(p+r) other metrics: crossing brackets NAACL 2009 50 matched=6 predicted=7 gold=7 precision=6/7 recall=6/7 F=6/7 Dynamic Programming

Inside-Outside Algorithm NAACL 2009 51 Dynamic Programming

Inside-Outside Algorithm NAACL 2009 52 Dynamic Programming

Inside-Outside Algorithm inside prob beta is easy to compute (CKY, max=>+) what is outside prob alpha(x,i,j)? need to enumerate ways to go to TOP from X,i,j X,i,j can be combined with other nodes on the left/right L: sum_{y->z X, k} alpha(y,k,j) Pr(Y->Z X) beta(z,k,i) R: sum_{y->x Z, k} alpha(y,i,k) Pr(Y->X Z) beta(z,j,k) why beta is used in alpha? very diff. from F-W algorithm what is the likelihood of the sentence? beta(top, 0, n) or alpha(w_i, i, i+1) for any i NAACL 2009 53 Dynamic Programming

Inside-Outside Algorithm TOP TOP Y Y Z X X Z 0 k i j n 0 i j k n L: sum_{y->z X, k} alpha(y,k,j) Pr(Y->Z X) beta(z,k,i) R: sum_{y->x Z, k} alpha(y,i,k) Pr(Y->X Z) beta(z,j,k) NAACL 2009 54 Dynamic Programming

Inside-Outside Algorithm how do you do EM with alphas and betas? easy; M-step: divide by fractional counts fractional count of rule (X,i,j -> Y,i,k Z,k,j) is alpha(x,i,j) prob(y Z X) beta(y,i,k) beta(z,k,j) if we replace + by max, what will alpha/beta mean? beta : Viterbi inside: best way to derive X,i,j alpha : Viterbi outside: best way to go to TOP from X,i,j now what is alpha (X, i, j) beta (X, i, j)? best derivation that contains X,i,j (useful for pruning) NAACL 2009 55 Dynamic Programming

Translation as Parsing translation with SCFGs => monolingual parsing parse the source input with the source projection build the corresponding target sub-strings in parallel VP PP (1) VP (2), VP (2) PP (1) VP juxing le huitan, held a meeting PP yu Shalong, with Sharon held a talk with Sharon VP1, 6 with Sharon held a talk complexity: same as CKY parsing -- O(n 3 ) PP1, 3 VP3, 6 yu Shalong juxing le huitan NAACL 2009 56 Dynamic Programming

Adding a Bigram Model... meeting... Shalong _... talk bigram with Sharon... Sharon... talks held... Sharon NAACL 2009 VP1, 6 bigram held... talk with... Sharon VP3, 6 PP1, 3 complexity: O(n 3 V 4(m-1) ) PP1, 3 VP3, 6 with... Sharon along... Sharon with... Shalong 57 VP1, 6 held... talk held... meeting hold... talks Dynamic Programming