Probabilistic Context Free Grammars

Similar documents
PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Probabilistic Context-Free Grammar

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

CS : Speech, NLP and the Web/Topics in AI

Processing/Speech, NLP and the Web

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Probabilistic Context-free Grammars

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

A* Search. 1 Dijkstra Shortest Path

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Introduction to Computational Linguistics

Advanced Natural Language Processing Syntactic Parsing

CS460/626 : Natural Language

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Properties of Context-Free Languages

CS460/626 : Natural Language

Even More on Dynamic Programming

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Einführung in die Computerlinguistik

Parsing with Context-Free Grammars

Natural Language Processing

Decoding and Inference with Syntactic Translation Models

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Introduction to Probablistic Natural Language Processing

Computability Theory

Natural Language Processing. Lecture 13: More on CFG Parsing

Chapter 14 (Partially) Unsupervised Parsing

Expectation Maximization (EM)

Parsing with Context-Free Grammars

The Inside-Outside Algorithm

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

Probabilistic Linguistics

Probabilistic Context Free Grammars. Many slides from Michael Collins

Chapter 16: Non-Context-Free Languages

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Algorithms for Syntax-Aware Statistical Machine Translation

Soft Inference and Posterior Marginals. September 19, 2013

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Context-Free Grammars. 2IT70 Finite Automata and Process Theory

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

LECTURER: BURCU CAN Spring

Statistical Methods for NLP

CYK Algorithm for Parsing General Context-Free Grammars

Section 1 (closed-book) Total points 30

Introduction to Theory of Computing

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

(pp ) PDAs and CFGs (Sec. 2.2)

Aspects of Tree-Based Statistical Machine Translation

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

d(ν) = max{n N : ν dmn p n } N. p d(ν) (ν) = ρ.

Solutions to Problem Set 3

Non-context-Free Languages. CS215, Lecture 5 c

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Context- Free Parsing with CKY. October 16, 2014

Improved TBL algorithm for learning context-free grammar

Context Free Grammars

An introduction to PRISM and its applications

October 6, Equivalence of Pushdown Automata with Context-Free Gramm

Suppose h maps number and variables to ɛ, and opening parenthesis to 0 and closing parenthesis

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Remembering subresults (Part I): Well-formed substring tables

Lecture 12: Algorithms for HMMs

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Harvard CS 121 and CSCI E-207 Lecture 9: Regular Languages Wrap-Up, Context-Free Grammars

CSCI Compiler Construction

Parsing beyond context-free grammar: Parsing Multiple Context-Free Grammars

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Tree Adjoining Grammars

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

Parsing. Left-Corner Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 17

Notes for Comp 497 (Comp 454) Week 10 4/5/05

CPS 220 Theory of Computation

Computational Models - Lecture 4

CS5371 Theory of Computation. Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL)

Lecture 12: Algorithms for HMMs

Lecture 5: UDOP, Dependency Grammars

Chapter 4: Context-Free Grammars

Properties of context-free Languages

(pp ) PDAs and CFGs (Sec. 2.2)

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Statistical Methods for NLP

Simplification and Normalization of Context-Free Grammars

Hidden Markov Models

Pushdown Automata. We have seen examples of context-free languages that are not regular, and hence can not be recognized by finite automata.

The Pumping Lemma for Context Free Grammars

MTH401A Theory of Computation. Lecture 17

Transcription:

1 Defining PCFGs A PCFG G consists of Probabilistic Context Free Grammars 1. A set of terminals: {w k }, k = 1..., V 2. A set of non terminals: { i }, i = 1..., n 3. A designated Start symbol: 1 4. A set of rules: { i ξ j }, i = 1..., n 5. A probability function P which assigns probabilities to rules so that, for all nonterminals i Conventions: P( i ξ j ) = 1 j otation Meaning G Grammar (PCFG) L Language (generated or accepted by a grammar) t parse tree { 1,..., n } onterminal vocabulary ( 1 start symbol) {w 1,..., w V } Terminal vocabulary w 1... w m Sentence to be parsed j pq onterminal j spans positions p through q in sentence α j (p, q) Outside probabilities β j (p, q) Inside probabilities j w p... w q onterminal j dominates words w p through w q in sentence (not necessarily directly) A key property of PCFGs is what we will call the independence assumption: Independence Assumption The probability of a node sequence s r depends only on the immediate mother node, not any node above that or outside the current constituent. We first address the problem of finding the probability of a string given a PCFG grammar. This is directly useful for some applications, such as filtering the hypoethses of a speech recognizer. But considering some efficient algorithms for calculating string probabilities will have a sideeffect. It will provide a line of attack on a somewhat more central problem for this course: finding the most probable parse of a string. 1

2 Example Thus the only parameters of a PCFG grammar are the rule probabilities, as in the following simple PCFG: (1) S P VP 1.0 P P PP 0.4 PP P P 1.0 P astronomers 0.1 VP V P 0.7 P ears 0.18 VP VP PP 0.3 P saw 0.04 P with 1.0 P stars 0.18 V saw 1.0 P telescopes 0.1 otice the constraint P(P ξ j ) = 1 is met. The sum of the rule probabilities for all the P rules is 1. We consider the probability of the sentence (2) Astronomers saw stars with ears. j with respect to this grammar, which admits the two analyses shown in Figures 1 and 2. The probability of each anaylsis is computed simply by multiplying the probabilities of the rules. We compute the probability of the string by summing the probabilities of its two analyses: P(w 15 ) = P(t 1 ) + P(t 2 ) = 0.0009072 + 0.0006804 = 0.0015876 3 Two algorithms for efficiently computing string probabilities The naive approach to computing string probabilities: Compute the probability of each parse and add. But: The number of parses of a string with a CFG grammar in general grows exponentially with the length of the string. Therefore this is inefficient. 3.1 Inside Probabilities P(w 1,m G) = P( 1 w 1,m G) = P(w 1,m 1 1,m, G) = β 1 (1, m) The probability of a wordstring given the grammar equals the probability that the start symbol exhaustively dominates the string, which is the probability of the string w 1,m given that the start symbol dominates words 1 through m, which we call the inside probability of the string for the span 1m and the category 1. The general definition of an inside probability: β j (p, q) = P(w p,q j p,q, G) 2

There is an an algotrithm for computing the inside probability of a strinng known as the inside algorithm. It breaks down to a base case and an induction: 1. Base case: The event that a nonterminal j exhaustively dominates the word at position k is j kk and the probability that that happens given the grammar is: P( j kk ) = β j(k, k) = P( j w k G) The point of the β notation is that it uses only indices, indices of word positions (subscripts) and grammar elements (superscripts). The point of the 2nd line is that this reduces to a fact about the grammar, irrespective of word positions. 2. Induction We assume a Chomsky normal grammar, so the situation is as pictured in Figure 3. Then β j (p, q) = P(w pq j pq, G) (1) = r,s q 1 P(w pd, r pd, w(d + 1)q, s (d+1)q j pq, G) (2) d=p We regard the probability of an analysis of w pq as an instance of category j as the sum of the probabilities of an analysis using some way of expanding j ; each such analysis uses some production: j r s Following the picture in Figure 3, we split the word string into two halves at word d, p d < q, w pd and w (d+1)q, one corresponding to r and one to s. Thus the jointly occurring events in (2) are wordstring w pd being analyzed as r and word string w (d+1)q being analyzed as s. β j (p, q) = r,s q 1 P( r pd, s (d+1)q j pq, G) P(w pd j pq, r pd, s (d+1)q, G) d=p P(w (d+1)q j pq, r pd, s (d+1)q, w pd, G) (3) Equation (3) uses the chain rule on the joint events of (2). ow we use the independence assumptions of CFGs to conclude: β j (p, q) = r,s q 1 d=p P( r pd, s (d+1)q j pq, G) P(w pd r pd, G) P(w (d+1)q s (d+1)q, G) (4) And substituting in the definitions of an inside probability and a rule probability, we have the basic equation we can compute with: β j (p, q) = r,s q 1 P( j r s j pq, G) β r (p, d) β s (d + 1, q) (5) d=p 3

This means we can compute inside probabilities bottum up. We use a dynamic programming algorithm, and as is customary represent the computations in a table. Cell (i, j) for row i and column j contains the result of the β computations for the string beginning with word i and ending with word j. First we do all the word βs: 1 2 3 4 5 1 β P = 0.1 2 β P = 0.04 β V = 1.0 3 β P = 0.18 4 β P = 0.1 5 β P = 0.18 astronomers saw stars with ears We now proceed as with CYK, trying to build constituents of length 2 next: 1 2 3 4 5 1 β P = 0.1 2 β P = 0.04 β VP = 0.126 β V = 1.0 3 β P = 0.18 4 β P = 0.1 β PP = 0.18 5 β P = 0.18 astronomers saw stars with ears The computation of β VP (2, 3) and β PP (4, 5) And in the next round: β VP (2, 3) = P(VP V P) β V (2, 2) β P (3, 3) (6) = 0.7 1.0 0.18 (7) = 0.126 (8) β PP (4, 5) = P(PP P P) β P (4, 4) β P (5, 5) (9) = 1.0 1.0 0.18 (10) = 0.18 (11) 1 2 3 4 5 1 β P = 0.1 β S = 0.0126 2 β P = 0.04 β V = 1.0 β VP = 0.126 3 β P = 0.18 β P = 0.01296 4 β P = 0.1 β PP = 0.18 5 β P = 0.18 astronomers saw stars with ears 4

The computation of β S (1, 3) and β P (3, 5) β S (1, 3) = P(S P VP) β P (1, 1) β VP (2, 3) (12) = 1.0 0.1 0.126 (13) = 0.0126 (14) β P (3, 5) = P(P P PP) β P (3, 3) β PP (4, 5) (15) And in the only round with an ambiguous constituent: = 0.4 0.18 0.18 (16) = 0.1296 (17) 1 2 3 4 5 1 β P = 0.1 β S = 0.0126 2 β P = 0.04 β VP = 0.126 β VP = 0.015876 β V = 1.0 3 β P = 0.18 β P = 0.01296 4 β P = 0.1 β PP = 0.18 5 β P = 0.18 astronomers saw stars with ears The computation of β VP (2, 5) : β VP (2, 5) = (P(VP V P) β V (2, 2) β P (3, 5)) + (P(VP VP PP) β VP (2, 3) β PP (4, 5)) (18) = (0.7 1.0 0.01296) + (0.3 0.126 0.18) (19) = 0.009072 + 0.006804 (20) = 015876 (21) The probability table (or chart) may be completed in the next step: 1 2 3 4 5 1 β P = 0.1 β S = 0.0126 β S = 0.0015876 2 β P = 0.04 β VP = 0.126 β VP = 0.015876 β V = 1.0 3 β P = 0.18 β P = 0.01296 4 β P = 0.1 β PP = 0.18 5 β P = 0.18 astronomers saw stars with ears The computation of β S (1, 5) is left as an exercise. (22) 5

3.2 Outside Probabilities We define α j (p, q), the outside probability for a category j and a span w pq as α j (p, q) = P(w 1(p 1), j pq, w (q+1)m G) This is the probability of generating j covering the span (p, q) (whatever the words in it) along with the all the words outside it. Using the notion outside probability, we can view the problem of computing the probability of a string from the point of view of any word in it. Pick a word w k. The following is true: P(w 1,m G) = j P(w 1,(k 1), w k, w (k+1)m, j kk G) = j P(w 1,(k 1), w (k+1)m, j kk G) P(w k w 1,(k 1), j kk, w (k+1)m, G) = α j (k, k) P( j w k ) We start with the observation that any assignment of a category to w k fixes a probability for analyses of the entire string compatible with that assignment. So we sum up such analysis probabilities for each category j in the grammar. We then apply the chain rule, and the independence assumptions of PCFGs, and the definition of α j (p, q) to derive an expression for the probability of the entire string. So if we knew the α j (k, k) values, computing the string probability would be easy. We now show the outside algorithm, a method for solving the more general problem of computing α j (p, q). This is done topdown. We break it down as before into a base case and an induction: 1. Base case: α 1 (1, m) = 1 = 0 for j 1 2. Induction The situation is as pictured in Figure 4 or Figure 5 Then α j (p, q) = [ m f,g j e=q+1 P(w 1(p 1), w(q + 1)m, f pe, j pq, g (q+1)e )] +[ p 1 f,g e=1 P(w 1(p 1), w(q + 1)m, f eq, g e(p 1), j pq)] The first double sum is the case of Figure 4. The second is the case of Figure 5. In the first case, we restrict the two categories not to be equal, so as not to count rules of the form: twice. X Y Y α j (p, q) = [ m f,g j e=q+1 P(w 1(p 1), w(q + 1)m, f pe) P( j pq, g (q+1)e f pe) P(w(q + 1)m g (q+1)e )] + [ p 1 f,g e=1 P(w 1(e 1), w(q + 1)m, f eq) P( g e(p 1), j pq f eq) P(w e(p 1) g e(p 1) )] = [ m f,g j e=q+1 α f(p, e)p( f j g )β g (q + 1, e)] +[ p 1 f,g e=1 α f(e, q)p( j g j )β g (e, p 1)] (23) 6

4 Finding the most probable parse We define δ j (p, q) as the highest probability parse of w pq as j What follows is a Viterbi like algorithm for finding the best parse. Corresponding to Viterbi valkues we will have δ values. Corresponding to a Viterbi bnacktrace we will have poointers to the subtrees that help us produce our highest probvability parse for each (category,start, end) triple. 1. Initialization: δ i (p, p) = P( i w p ) 2. Induction: δ i (p, q) = max P( i j k )δ j (p, r)δ k (r + 1, q) 1 j,k n p r<q Store backtrace: Ψ i (p, q) = argmax j,k,r P( i j k )δ j (p, r)δ k (r + 1, q) 3. Termination: The probability of the max probability tree ˆt is: P(ˆt) = δ 1 (1, m) In addition we know the mother node and span of the highest probability parse: 1 and (p, q). In general given a mother node and a span (i, p, q), we can construct a highest probability parse tree below it by following the Ψ pointers: Let Ψ i (p, q) = (j, k, r) Then left daughter = j pr right daughter = k (r+1)q 7

S 1.0 P 0.1 VP 0.7 Astronomers V 1.0 P 0.4 saw P 0.18 PP 1.0 stars P 1.0 P 0.18 with ears P(t 1 ) = 1.0 0.1 0.7 1.0 0.4 0.18 1.0 1.0 0.18 = 0.0009072 Figure 1: t 1 S 1.0 P 0.1 VP 0.3 Astronomers VP 0.7 PP 1.0 V 1.0 P 0.18 P 1.0 P 0.18 saw stars with ears P(t 2 ) = 1.0 0.1 0.3 0.7 1.0 0.18 1.0 1.0 0.18 = 0.0006804 Figure 2: t 2 j r w p... w d s w d+1... w q Figure 3: Computation of inside probability: β j (p, q) 8

1 w 1... w p 1. f w e+1... w m j w p... w q g w q+1... w e Figure 4: Computation of outside probability (left category): α j (p, q) 9

1 w 1... w e 1. f w q+1... w m g w e... w p 1 j w p... w q Figure 5: Computation of outside probability (right category): α j (p, q) 10