{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Similar documents
Probabilistic Context-Free Grammar

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Probabilistic Context Free Grammars

Advanced Natural Language Processing Syntactic Parsing

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Processing/Speech, NLP and the Web

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS : Speech, NLP and the Web/Topics in AI

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Statistical Methods for NLP

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

LECTURER: BURCU CAN Spring

Probabilistic Context-free Grammars

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

CS460/626 : Natural Language

Chapter 14 (Partially) Unsupervised Parsing

The Infinite PCFG using Hierarchical Dirichlet Processes

Multilevel Coarse-to-Fine PCFG Parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Natural Language Processing

Expectation Maximization (EM)

Statistical Methods for NLP

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

CS460/626 : Natural Language

Introduction to Probablistic Natural Language Processing

Probabilistic Linguistics

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

DT2118 Speech and Speaker Recognition

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Parsing with Context-Free Grammars

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

A Context-Free Grammar

Soft Inference and Posterior Marginals. September 19, 2013

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Lecture 15. Probabilistic Models on Graph

A* Search. 1 Dijkstra Shortest Path

Introduction to Computational Linguistics

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Lecture 12: Algorithms for HMMs

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Lecture 5: UDOP, Dependency Grammars

The Inside-Outside Algorithm

The effect of non-tightness on Bayesian estimation of PCFGs

Context- Free Parsing with CKY. October 16, 2014

Lecture 12: Algorithms for HMMs

Context Free Grammars: Introduction

Probabilistic Context-Free Grammars and beyond

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Multiword Expression Identification with Tree Substitution Grammars

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Lab 12: Structured Prediction

Context Free Grammars: Introduction. Context Free Grammars: Simplifying CFGs

Even More on Dynamic Programming

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Decoding and Inference with Syntactic Translation Models

Spectral Unsupervised Parsing with Additive Tree Metrics

Parsing with Context-Free Grammars

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Remembering subresults (Part I): Well-formed substring tables

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Features of Statistical Parsers

Improved TBL algorithm for learning context-free grammar

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Final, CSC242, May 2009

Bayesian Inference for PCFGs via Markov chain Monte Carlo

Algorithms for Syntax-Aware Statistical Machine Translation

Latent Variable Models in NLP

Marrying Dynamic Programming with Recurrent Neural Networks

Language and Statistics II

STA 4273H: Statistical Machine Learning

Einführung in die Computerlinguistik

Natural Language Processing. Lecture 13: More on CFG Parsing

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Computability Theory

10/17/04. Today s Main Points

The Infinite PCFG using Hierarchical Dirichlet Processes

Algebraic Dynamic Programming. Solving Satisfiability with ADP

Suppose h maps number and variables to ɛ, and opening parenthesis to 0 and closing parenthesis

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CS20a: summary (Oct 24, 2002)

CSE 490 U Natural Language Processing Spring 2016

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

STA 414/2104: Machine Learning

Transcription:

{Probabilistic Stochastic} Context-Free Grammars (PCFGs) 116

The velocity of the seismic waves rises to... S NP sg VP sg DT NN PP risesto... The velocity IN NP pl of the seismic waves 117

PCFGs APCFGGconsists of: A set of terminals, {w k },k=1,...,v A set of nonterminals, {N i },i =1,...,n A designated start symbol, N 1 A set of rules, {N i ζ j }, (where ζ j is a sequence of terminals and nonterminals) A corresponding set of probabilities on rules such that: i P(N i ζ j ) = 1 j 118

PCFG notation Sentence: sequence of words w 1 w m w ab : the subsequence w a w b N i ab : nonterminal Ni dominates w a w b N j N i w a w b ζ: Repeated derivation from N i gives ζ. 119

PCFG probability of a string P(w 1n ) = t P(w 1n,t) t a parse of w 1n = {t:yield(t)=w 1n } P(t) 120

A simple PCFG (in CNF) S NP VP 1.0 NP NP PP 0.4 PP PNP 1.0 NP astronomers 0.1 VP VNP 0.7 NP ears 0.18 VP VP PP 0.3 NP saw 0.04 P with 1.0 NP stars 0.18 V saw 1.0 NP telescopes 0.1 121

t 1 : S 1.0 NP 0.1 VP 0.7 astronomers V 1.0 NP 0.4 saw NP 0.18 PP 1.0 stars P 1.0 NP 0.18 with ears 122

t 2 : S 1.0 NP 0.1 VP 0.3 astronomers VP 0.7 PP 1.0 V 1.0 NP 0.18 P 1.0 NP 0.18 saw stars with ears 123

The two parse trees probabilities and the sentence probability P(t 1 ) = 1.0 0.1 0.7 1.0 0.4 0.18 1.0 1.0 0.18 = 0.0009072 P(t 2 ) = 1.0 0.1 0.3 0.7 1.0 0.18 1.0 1.0 0.18 = 0.0006804 P(w 15 ) = P(t 1 )+P(t 2 ) = 0.0015876 124

Assumptions of PCFGs 1. Place invariance (like time invariance in HMM): 2. Context-free: k P(N j k(k+c) ζ)is the same P(N j kl ζ words outside w k...w l )=P(N j kl ζ) 3. Ancestor-free: P(N j kl ζ ancestor nodes of Nj kl ) = P(Nj kl ζ) 125

Let the upper left index in i N j be an arbitrary identifying index for a particular token of a nonterminal. Then, P 2 NP 1 S the man 3 VP snores = P(1 S 13 2 3 NP 12 VP33, 2 NP 12 the 1 man 2, 3 VP 33 sno =... = P(S NP V P)P(NP the man)p(v P snores) 126

Some features of PCFGs Reasons to use a PCFG, and some idea of their limitations: Partial solution for grammar ambiguity: a PCFG gives some idea of the plausibility of a sentence. But not a very good idea, as not lexicalized. Better for grammar induction (Gold 1967) Robustness. (Admit everything with low probability.) 127

Some features of PCFGs Gives a probabilistic language model for English. In practice, a PCFG is a worse language model for English than a trigram model. Can hope to combine the strengths of a PCFG and a trigram model. PCFG encodes certain biases, e.g., that smaller trees are normally more probable. 128

Improper (inconsistent) distributions S rhubarb P = 1 3 S S S P = 2 3 rhubarb rhubarb rhubarb rhubarb rhubarb rhubarb... 1 3 2 P(L) = 1 3 + 2 27 + 8 243 +...= 1 2 3 1 3 1 3 = 2 ( ) 2 2 ( 3 1 3 Improper/inconsistent distribution 27 ) 3 2 = 8 243 Not a problem if you estimate from parsed treebank: Chi and Geman 1998). 129

Questions for PCFGs Just as for HMMs, there are three basic questions we wish to answer: P(w 1m G) arg max t P(t w 1m,G) Learning algorithm. Find G such that P(w 1m G) is maximized. 130

Chomsky Normal Form grammars We ll do the case of Chomsky Normal Form grammars, which only have rules of the form: N i N j N k N i w j Any CFG can be represented by a weakly equivalent CFG in Chomsky Normal Form. It s straightforward to generalize the algorithm (recall chart parsing). 131

PCFG parameters We ll do the case of Chomsky Normal Form grammars, which only have rules of the form: N i N j N k N i w j The parameters of a CNF PCFG are: P(N j N r N s G) A n 3 matrix of parameters P(N j w k G) An nt matrix of parameters For j = 1,...,n, P(N j N r N s )+ P(N j w k ) = 1 r,s k 132

Probabilistic Regular Grammar: N i w j N k N i w j Start state, N 1 HMM: P(w 1n ) = 1 n w 1n whereas in a PCFG or a PRG: P(w) = 1 w L 133

Consider: P(John decided to bake a) High probability in HMM, low probability in a PRG or a PCFG. Implement via sink state. APRG Π Start HMM Finish 134

Comparison of HMMs (PRGs) and PCFGs X: NP N N N sink O: the big brown box NP the N big N brown N 0 box 135

Inside and outside probabilities This suggests: whereas for an HMM we have: Forwards = α i (t) = P(w 1(t 1),X t =i) Backwards = β i (t) = P(w tt X t = i) for a PCFG we make use of Inside and Outside probabilities, defined as follows: Outside = α j (p, q) = P(w 1(p 1),Npq,w j (q+1)m G) Inside = β j (p, q) = P(w pq Npq,G) j A slight generalization of dynamic Bayes Nets covers PCFG inference by the inside-outside algorithm (and-or tree of conjunctive daughters disjunctively chosen) 136

Inside and outside probabilities in PCFGs. N 1 α w 1 N j β w p 1 w p w q w q+1 w m 137

Probability of a string Inside probability P(w 1m G) = P(N 1 w 1m G) = P(w 1m,N1m 1,G) =β 1(1,m) Base case: We want to find β j (k, k) (the probability of a rule N j w k ): β j (k, k) = P(w k N j kk,g) = P(N j w k G) 138

Induction: We want to find β j (p, q), forp<q. As this is the inductive step using a Chomsky Normal Form grammar, the first rule must be of the form N j N r N s, so we can proceed by induction, dividing the string in two in various places and summing the result: N j N r N s w p w d w d+1 w q These inside probabilities can be calculated bottom up. 139

For all j, β j (p, q) = P(w pq N j pq,g) = r,s = r,s = r,s = r,s q 1 d=p q 1 d=p P(w pd,n r pd,w (d+1)q,n s (d+1)q Nj pq,g) P(N r pd,ns (d+1)q Nj pq,g) P(w pd N j pq,n r pd,ns (d+1)q,g) P(w (d+1)q N j pq,n r pd,ns (d+1)q,w pd,g) q 1 d=p P(N r pd,ns (d+1)q Nj pq,g) P(w pd N r pd, G)P(w (d+1)q N s (d+1)q,g) q 1 d=p P(N j N r N s )β r (p, d)β s (d + 1,q) 140

Calculation of inside probabilities (CKY algorithm) 1 2 3 4 5 1 β NP = 0.1 β S = 0.0126 β S = 0.0015876 2 β NP = 0.04 β VP = 0.126 β VP = 0.015876 β V = 1.0 3 β NP = 0.18 β NP = 0.01296 4 β P = 1.0 β PP = 0.18 5 β NP = 0.18 astronomers saw stars with ears 141

Outside probabilities Probability of a string: For any k, 1 k m, P(w 1m G) = j P(w 1(k 1),w k,w (k+1)m,n j kk G) = j P(w 1(k 1),N j kk,w (k+1)m G) = j P(w k w 1(k 1),N j kk,w (k+1)n,g) α j (k, k)p(n j w k ) Inductive (DP) calculation: One calculates the outside probabilities top down (after determining the inside probabilities). 142

Outside probabilities Base Case: α 1 (1,m)=1 α j (1,m)=0,forj 1 Inductive Case: N 1 N f pe N j pq N g (q+1)e w 1 w p 1 w p w q w q+1 w e w e+1 w m 143

Outside probabilities Base Case: α 1 (1,m)=1 α j (1,m)=0,forj 1 Inductive Case: it s either a left or right branch we will some over both possibilities and calculate using outside and inside probabilities N 1 N f pe N j pq N g (q+1)e w 1 w p 1 w p w q w q+1 w e w e+1 w m 144

Outside probabilities inductive case AnodeNpq j might be the left or right branch of the parent node. We sum over both possibilities. N 1 N f eq N g e(p 1) N j pq w 1 w e 1 w e w p 1 w p w q w q+1 w m 145

Inductive Case: α j (p, q) = [ f,g m e=q+1 +[ f,g = [ m P(w 1(p 1),w (q+1)m,n f pe,n j pq,n g (q+1)e )] p 1 e=1 f,gnej e=q+1 P(w 1(p 1),w (q+1)m,n f eq,n g e(p 1),Nj pq)] P(w 1(p 1),w (e+1)m,n f pe)p(n j pq,n g (q+1)e Nf pe) p 1 P(w (q+1)e N g (q+1)e )] + [ P(w 1(e 1),w (q+1)m,neq) f f,g e=1 P(N g e(p 1),Nj pq Neq)P(w f e(p 1) N g e(p 1 )] = [ m α f (p, e)p(n f N j N g )β g (q + 1,e)] f,g e=q+1 +[ f,g p 1 e=1 α f (e, q)p(n f N g N j )β g (e, p 1)] 146

Overall probability of a node existing As with a HMM, we can form a product of the inside and outside probabilities. This time: α j (p, q)β j (p, q) = P(w 1(p 1),Npq,w j (q+1)m G)P(w pq Npq,G) j =P(w 1m,Npq G) j Therefore, p(w 1m,N pq G) = j α j (p, q)β j (p, q) Just in the cases of the root node and the preterminals, we know there will always be some such constituent. 147

Training a PCFG We construct an EM training algorithm, as for HMMs. We would like to calculate how often each rule is used: ˆP(N j ζ) = C(Nj ζ) γ C(N j γ) Have data count; else work iteratively from expectations of current model. Consider: α j (p, q)β j (p, q) = P(N 1 w 1m,N j = P(N 1 w 1m G)P(N j wpq G) wpq N 1 w 1m,G) We have already solved how to calculate P(N 1 w 1m ); let us call this probability π. Then: and P(N j w pq N 1 E(N j is used in the derivation) = w 1m,G)= α j(p, q)β j (p, q) π m m p=1 q=p α j (p, q)β j (p, q) π 148

In the case where we are not dealing with a preterminal, we substitute the inductive definition of β, and r,s,p > q: P(N j N r N s w pq N 1 w 1n,G)= q 1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) Therefore the expectation is: E(N j N r N s,n j used) m 1 m q 1 p=1 q=p+1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) π Now for the maximization step, we want: π P(N j N r N s ) = E(Nj N r N s,n j used) E(N j used) 149

Therefore, the reestimation formula, ˆP(N j N r N s ) is the quotient: ˆP(N j N r N s ) = Similarly, m 1 m q 1 p=1 q=p+1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) m m p=1 q=1 α j(p, q)β j (p, q) E(N j w k N 1 w 1m,G)= mh=1 α j (h, h)p(n j w h,w h =w k ) Therefore, ˆP(N j w k ) = π mh=1 α j (h, h)p(n j w h,w h =w k ) mp=1 mq=1 α j (p, q)β j (p, q) Inside-Outside algorithm: repeat this process until the estimated probability change is small. 150

Multiple training instances: if we have training sentences W = (W 1,...W ω ), with W i = (w 1,...,w mi ) and we let u and v bet the common subterms from before: and u i (p,q,j,r,s)= q 1 d=p α j(p, q)p(n j N r N s )β r (p, d)β s (d + 1,q) P(N 1 W i G) v i (p,q,j)= α j(p, q)β j (p, q) P(N 1 W i G) Assuming the observations are independent, we can sum contributions: ωi=1 mi 1 mi ˆP(N j N r N s p=1 q=p+1 ) = u i(p,q,j,r,s) ωi=1 mi mi p=1 q=pv i (p,q,j) and ωi=1 ˆP(N j w k ) = ωi=1 mi p=1 {h:w h =w k } v i(h, h, j) mi q=p v i (p,q,j) 151

Problems with the Inside-Outside algorithm 1. Slow. Each iteration is O(m 3 n 3 ), where m = ω i=1 m i, and n is the number of nonterminals in the grammar. 2. Local maxima are much more of a problem. Charniak reports that on each trial a different local maximum was found. Use simulated annealing? Restrict rules by initializing some parameters to zero? Or HMM initialization? Reallocate nonterminals away from greedy terminals? 3. Lari and Young suggest that you need many more nonterminals available than are theoretically necessary to get good grammar learning (about a threefold increase?). This compounds the first problem. 4. There is no guarantee that the nonterminals that the algorithm learns will have any satisfactory resemblance to the kinds of nonterminals normally motivated in linguistic analysis (NP, VP, etc.). 152