Chapter 14 (Partially) Unsupervised Parsing

Similar documents
In this chapter, we explore the parsing problem, which encompasses several questions, including:

Probabilistic Context-free Grammars

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Advanced Natural Language Processing Syntactic Parsing

Statistical Methods for NLP

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

LECTURER: BURCU CAN Spring

Probabilistic Context Free Grammars. Many slides from Michael Collins

Natural Language Processing

Lecture 5: UDOP, Dependency Grammars

The Infinite PCFG using Hierarchical Dirichlet Processes

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Latent Variable Models in NLP

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Soft Inference and Posterior Marginals. September 19, 2013

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Decoding and Inference with Syntactic Translation Models

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Parsing with Context-Free Grammars

Processing/Speech, NLP and the Web

So# Inference and Posterior Marginals. September 19, 2013

Expectation Maximization (EM)

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

A Context-Free Grammar

A* Search. 1 Dijkstra Shortest Path

Lecture 7: Introduction to syntax-based MT

Probabilistic Context-Free Grammar

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Introduction to Computational Linguistics

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Algorithms for Syntax-Aware Statistical Machine Translation

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

CS460/626 : Natural Language

Features of Statistical Parsers

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Parsing with Context-Free Grammars

Lecture 15. Probabilistic Models on Graph

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Probabilistic Context Free Grammars

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

STA 414/2104: Machine Learning

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A gentle introduction to Hidden Markov Models

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Lecture 12: Algorithms for HMMs

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

CS 545 Lecture XVI: Parsing

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Parikh s theorem. Håkan Lindqvist

Aspects of Tree-Based Statistical Machine Translation

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Einführung in die Computerlinguistik

Midterm sample questions

Lecture 12: Algorithms for HMMs

CS : Speech, NLP and the Web/Topics in AI

Multilevel Coarse-to-Fine PCFG Parsing

Multiword Expression Identification with Tree Substitution Grammars

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Recurrent neural network grammars

Properties of Context-Free Languages

3.10. TREE DOMAINS AND GORN TREES Tree Domains and Gorn Trees

Pushdown Automata: Introduction (2)

Context- Free Parsing with CKY. October 16, 2014

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Foundations of Informatics: a Bridging Course

STA 4273H: Statistical Machine Learning

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 2: Regular Expression

Lecture 12: EM Algorithm

Theory Bridge Exam Example Questions

Constituency Parsing

IBM Model 1 for Machine Translation

Machine Learning for natural language processing

Context-Free Grammars and Languages. Reading: Chapter 5

Context-free grammars and languages

Natural Language Processing. Lecture 13: More on CFG Parsing

CS20a: summary (Oct 24, 2002)

Lecture 9: Hidden Markov Model

Marrying Dynamic Programming with Recurrent Neural Networks

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

CPS 220 Theory of Computation Pushdown Automata (PDA)

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Transcription:

Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with new ones. It would be nice if we could automatically discover these transformations. Suppose that we have a grammar defined over nonterminals of the form X [q], where X is a nonterminal from the training data (e.g., NP) and q is a number between 1 and k (for simplicity, let s say k = 2). We only observe trees over nonterminals X, but need to learn weights for our grammar. We can do this using EM Matsuzaki, Miyao, and Tsujii, 2005; Petrov et al., 2006.

Chapter 14. (Partially) Unsupervised Parsing 101 14.1 Hard EM on trees Suppose our first training example is the following tree: S NP VP DT NN VBD NP the cat saw DT NN the dog And suppose that our initial grammar is: DT[1] 1 the S[1] 0.2 NP[1] VP[1] NP[1] 0.2 DT[1] NN[1] VP[1] 0.2 VBD[1] NP[1] DT[2] 1 the S[1] 0.4 NP[1] VP[2] NP[1] 0.4 DT[1] NN[2] VP[1] 0.4 VBD[1] NP[2] NN[1] 0.2 cat S[1] 0.1 NP[2] VP[1] NP[1] 0.1 DT[2] NN[1] VP[1] 0.1 VBD[2] NP[1] NN[1] 0.8 dog S[1] 0.3 NP[2] VP[2] NP[1] 0.3 DT[2] NN[2] VP[1] 0.3 VBD[2] NP[2] NN[2] 0.7 cat S[2] 0.5 NP[1] VP[1] NP[2] 0.5 DT[1] NN[1] VP[2] 0.5 VBD[1] NP[1] NN[2] 0.3 dog S[2] 0.1 NP[1] VP[2] NP[2] 0.1 DT[1] NN[2] VP[2] 0.1 VBD[1] NP[2] VBD[1] 1 saw S[2] 0.2 NP[2] VP[1] NP[2] 0.2 DT[2] NN[1] VP[2] 0.2 VBD[2] NP[1] VBD[2] 1 saw S[2] 0.2 NP[2] VP[2] NP[2] 0.2 DT[2] NN[2] VP[2] 0.2 VBD[2] NP[2] If we want to do hard EM, we need to do: E step: find the highest-weight derivation of the grammar that matches the observed tree (modulo the annotations [q]). M step: re-estimate the weights of the grammar by counting the rules used in the derivations found in the E step, and normalize. The M step is easy. The E step is essentially Viterbi CKY, only easier because we re given a tree instead of just a string. The chart for this algorithm looks like the following, where each cell works exactly like the cells in CKY. Can you fill in the rest?

Chapter 14. (Partially) Unsupervised Parsing 102 NP[1] 0.28 NP[2] 0.14 DT[1] 1 DT[2] 1 NN[1] 0.2 NN[2] 0.7 the cat saw the dog 14.2 Hard EM on strings Another scenario is that we re given only strings instead of trees, and we have some grammar that we want to learn weights for. For example, if we train a grammar on the Wall Street Journal portion of the Penn Treebank, but we want to learn a parser for Twitter. This kind of domain adaptation problem is

Chapter 14. (Partially) Unsupervised Parsing 103 often solved using something similar to hard EM. Extract the grammar rules and initialize the weights by training on the labeled training data (in our example, the WSJ portion of the Treebank). E step: find the highest-weight tree of the grammar for each string in the unlabeled training data (in our example, the Twitter data). M step: re-estimate the weights of the grammar. The E/M steps can be repeated, though in practice one might find that one iteration works the best. The following sections, all optional, present the machinery that we need in order to do real EM on either strings or trees. First, we present a more abstract view of parsing as intersection between a CFG and a finite-state automaton (analogous to how we viewed decoding with FSTs as intersection). Then, we show how to compute expected counts of rules using the Inside/Outside algorithm, which is analogous to computing expected counts of transitions using the Forward/Backward algorithm. 14.3 Parsing as intersection (optional) Previously we showed how to find the best tree for w. But suppose we don t just want the best tree, but all possible trees. This is, again, an easy modification. Instead of keeping the best back-pointer in each back[i, j ][X ], we keep a set of all the back-pointers. Require: string w = w 1 w n and grammar G = (N,Σ,R,S) Ensure: G generates all parses of w 1: initialize chart[i, j ][X ] for all 0 i < j n, X N 2: for all i 1,...,n and (X p w i ) R do 3: p chart[i[1],i][x ] chart[i 1,i][X ] {X i 1,i wi } 4: end for 5: for l 2,...,n do 6: for i 0,...,n l do 7: j i + l 8: for k i + 1,...,j 1 do 9: for all (X p YZ) R do 10: if chart[i,k][y ] and chart[k, j ][Z ] then 11: p chart[i, j ][X ] chart[i, j ][X ] {X i,j Yi,k Z k,j } 12: end if 13: end for 14: end for 15: end for 16: end for 17: G chart[i, j ][X ] 0 i<j n X N The grammar G generates all and only the parse trees of w. It is usually called a packed forest. It is a forest because it represents a set of trees, and it is packed because it is only polynomial sized yet represents a possibly exponential set of trees. (It is more commonly represented as a data structure called a

Chapter 14. (Partially) Unsupervised Parsing 104 hypergraph rather than as a CFG. But since the two representations are equivalent, we will stick with a CFG.) The packed forest can be thought of as a CFG that generates the language L(G) {w}. More generally, a CFG can be intersected with any finite-state automaton to produce another CFG. Why would we want to do this? For example, consider the following fragment G of our grammar (??) from above (NP is the start symbol). This generates the ungrammatical strings (14.6) a arrow (14.7) an banana NP DT NN (14.1) DT a (14.2) DT an (14.3) NN arrow (14.4) NN banana (14.5) How do we tell a computer when to say a and when to say an? The most natural way is to use an FSA like the following FSA M: arrow banana (14.8) q 0 a banana arrow q a an q an So we need to combine G and M into a single thing (another CFG) that generates the intersection of L(G) and L(M). To do this, we can use a construction due to Bar-Hillel (Bar-Hillel, Perles, and Shamir, 1961), we which illustrate by example. Assume that M has no ϵ-transitions. The basic idea is to modify the CFG so that it simulates the action of M. Every nonterminal symbol A gets annotated with two states, becoming A q,r. State q says where the FSA could be before reading the yield of the symbol, and r says where the FSA would end up after reading the yield of the symbol. We consider each rule of the grammar one by one. First, consider the production DT a. Suppose that M is in state q 0. After reading the yield of this rule, it will end up in state q a. So we make a new rule, DT q0,q a a

Chapter 14. (Partially) Unsupervised Parsing 105 But what if M starts in state q a or state q an? In that case, M would reject the string, so we don t make any new rules for these cases. By similar reasoning, we make rules DT q0,q an an NN q0,q 0 arrow NN qan,q 0 arrow NN q0,q 0 banana NN qa,q 0 banana Now consider the production NP DT NN. Imagine that M is in state q 0 and reads the yield of DT. What state will it be in after? Since we don t know, we consider all possibilities. Suppose that it moves to state q a, and suppose that after reading the yield of NN, it ends up in state q an. To capture this case, we make a new rule, NP q0,q an DT q0,q a NN qa,q an We do this for every possible sequence of states. (We didn t promise that this construction would be the most efficient one!) Finally, we create a new start symbol, S, and a rule DT q0,q 0 NN q0,q 0 NP q0,q a DT q0,q 0 NN q0,q a NP q0,q an DT q0,q 0 NN q0,q an DT q0,q a NN qa,q 0. S because q 0 is both the initial state and a final state. If M had more than one final state, we would make a new rule for each. Question 24. How big is the resulting grammar, as a function of the number of rules in G and states in M? But not all the rules generated are actually used. For example, it s impossible to get from q 0 to q 0 while reading a DT, so any rule that involves DT q0,q 0 can be deleted. We can build a more compact grammar as follows. Consider again the production NP DT NN. Imagine that M is in state q 0 and reads the yield of DT. What state will it be in after? Instead of considering all possibilities, we only look at the annotated nonterminals that have been created already. We have created DT q0,q a, so M could be in state q a. Then, after reading NN, it could be in state q 0 (because we have created NN qa,q 0 ). So we make a new rule, and, by similar reasoning, we make DT q0,q a NN qa,0 DT q0,q an NN qan,0

Chapter 14. (Partially) Unsupervised Parsing 106 S outside[x i,j ] X i,j inside[x i,j ] Figure 14.1: The inside probability of X i,j is the total weight of all subtrees rooted in X i,j (gray), and the outside probability is the total weight of all tree fragments with a broken-off node X i,j (white). but no others. We do this for all the rules in the grammar. But after we are done, we must go back and process all the rules again, because the set of annotated nonterminals may have changed. Only when the set of annotated nonterminals converges can we stop. In this case, we are done. The resulting grammar is: DT q0,q a NN qa,q 0 DT q0,q an NN qan,q 0 DT q0,q a a DT q0,q an an NN q0,q 0 arrow NN qan,q 0 arrow NN q0,q 0 banana NN qa,q 0 banana Question 25. Extend the FSA M to include all the vocabulary from grammar (??) and find their intersection. 14.4 Inside/outside probabilities (optional) Recall that when training a weighted FSA using EM, the key algorithmic step was calculating forward and backward probabilities. The forward probability of a state q is the total weight of all paths from the start state to q, and the backward probability is the total weight of all paths from q to any final state. We can define a similar concept for a node (nonterminal symbol) in a forest. The inside probability of a node X i,j is the total weight of all derivations X i,j w, where w is any string of terminal symbols. That is, it is the total weight of all subtrees derivable by X i,j.

Chapter 14. (Partially) Unsupervised Parsing 107 S outside[x i,j ] X i,j P(X YZ) Y i,k Z k,j inside[y i,k ] inside[z k,j ] Figure 14.2: To compute the outside probability of Y i,k, we use the outside probability of X i,j and the inside probability of Z k,j. The outside probability is the total weight of all derivations S vx i,j w, where v and w are strings of terminal symbols. That is, it is the total weight of all tree fragments with root S and a single broken-off node X i,j (see Figure??). We already left it to you (Question??) to figure out how to calculate the total weight of all derivations. The intermediate values of this calculation are the inside probabilities. How about the outside probabilities? This computation proceeds top-down. To compute the outside probability of Y i,k, we look at its possible parents X i,j and siblings Z k,j. For each, we can compute the total weight of the outside part of all derivations going through these three nodes: it is outside[x i,j ] P(X YZ) inside[z k,j ]. If we sum over all possible parents and siblings (both left and right), we get the outside probability of Y i,k. More precisely, the algorithm looks like this: Require: string w = w 1 w n and grammar G = (N,Σ,R,S) Require: inside[x i,j ] is the inside probability of X i,j Ensure: outside[x i,j ] is the outside probability of X i,j for all X,i, j do initialize outside[x i,j ] 0 end for outside[s,0,n] 1 for l n,n 1,...,2 do for i 0,...,n l do j i + l for k i + 1,...,j 1 do for all (X p YZ) R do outside[y i,k ] outside[y i,k ] + outside[x i,j ] p inside[z k,j ] outside[z k,j ] outside[z k,j ] + outside[x i,j ] p inside[y i,k ] end for end for end for top-down

Chapter 14. (Partially) Unsupervised Parsing 108 end for Question 26. Fill in the rest of the details for EM training of a PCFG. 1. What is the total weight of derivations going through a forest edge X i,j Y i,k Z k,j? 2. What is the fractional count of this edge (probability of using this edge given input string w)? 3. How do you compute the fractional count of a production X YZ(probability of using this production given input string w)? 4. How do you reestimate the probability of a production X YZ?

Bibliography Bar-Hillel, Y., M. Perles, and E. Shamir (1961). On formal properties of simple phrase structure grammars. In: Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 14.2, pp. 143 172. Matsuzaki, Takuya, Yusuke Miyao, and Jun ichi Tsujii (2005). Probabilistic CFG with Latent Annotations. In: Proc. ACL, pp. 75 82. Petrov, Slav et al. (2006). Learning Accurate, Compact, and Interpretable Tree Annotation. In: Proc. COLING-ACL, pp. 433 440.