Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Similar documents
Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

Parsing. Left-Corner Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 17

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

Parsing. Weighted Deductive Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Einführung in die Computerlinguistik

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Maschinelle Sprachverarbeitung

A* Search. 1 Dijkstra Shortest Path

Probabilistic Context-free Grammars

Advanced Natural Language Processing Syntactic Parsing

Maschinelle Sprachverarbeitung

Natural Language Processing

Probabilistic Context Free Grammars

Statistical Methods for NLP

Parsing. A Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 27

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Parsing with Context-Free Grammars

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Tree Adjoining Grammars

LECTURER: BURCU CAN Spring

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Einführung in die Computerlinguistik

Probabilistic Context-Free Grammar

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Natural Language Processing. Lecture 13: More on CFG Parsing

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS460/626 : Natural Language

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Chapter 14 (Partially) Unsupervised Parsing

Processing/Speech, NLP and the Web

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Parsing Beyond Context-Free Grammars: Tree Adjoining Grammars

Parsing beyond context-free grammar: Parsing Multiple Context-Free Grammars

Machine Learning for natural language processing

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Machine Learning for natural language processing

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

CSC 4181Compiler Construction. Context-Free Grammars Using grammars in parsers. Parsing Process. Context-Free Grammar

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Syntactical analysis. Syntactical analysis. Syntactical analysis. Syntactical analysis

Probabilistic Context Free Grammars. Many slides from Michael Collins

Parsing with Context-Free Grammars

MA/CSSE 474 Theory of Computation

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Introduction to Probablistic Natural Language Processing

Introduction to Computational Linguistics

Mildly Context-Sensitive Grammar Formalisms: Thread Automata

Remembering subresults (Part I): Well-formed substring tables

Lecture 5: UDOP, Dependency Grammars

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Harvard CS 121 and CSCI E-207 Lecture 9: Regular Languages Wrap-Up, Context-Free Grammars

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

DT2118 Speech and Speaker Recognition

Probabilistic Linguistics

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

EXAM. CS331 Compiler Design Spring Please read all instructions, including these, carefully

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Introduction to Natural Language Processing

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Context Free Grammars

Aspects of Tree-Based Statistical Machine Translation

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Machine Learning for natural language processing

Expectation Maximization (EM)

CS481F01 Prelim 2 Solutions

CS : Speech, NLP and the Web/Topics in AI

Context-Free Grammars. 2IT70 Finite Automata and Process Theory

Multiword Expression Identification with Tree Substitution Grammars

Einführung in die Computerlinguistik Kontextfreie Grammatiken - Formale Eigenschaften

Chapter 16: Non-Context-Free Languages

Computational Linguistics II: Parsing

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Machine Learning for natural language processing

CS 6120/CS4120: Natural Language Processing

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

Machine Learning for natural language processing

Context- Free Parsing with CKY. October 16, 2014

Grammars and Context Free Languages

Grammars and Context Free Languages

RNA Secondary Structure Prediction

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

The Infinite PCFG using Hierarchical Dirichlet Processes

CPS 220 Theory of Computation

Multilevel Coarse-to-Fine PCFG Parsing

Syntax Analysis Part III

Suppose h maps number and variables to ɛ, and opening parenthesis to 0 and closing parenthesis

Foundations of Informatics: a Bridging Course

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Plan for 2 nd half. Just when you thought it was safe. Just when you thought it was safe. Theory Hall of Fame. Chomsky Normal Form

Even More on Dynamic Programming

Transcription:

Parsing Probabilistic CFG (PCFG) Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 22

Table of contents 1 Introduction 2 PCFG 3 Inside and outside probability 4 Parsing Jurafsky and Martin (2009) (Some of the slides are due to Wolfgang Maier.) 2 / 22

Data-Driven Parsing Linguistic grammars can not only be created manually. Another way to obtain grammars is to interpret the syntactic structures in a treebank as the derivations of a latent grammar and to use an appopriate algorithm for grammar extraction. One can also estimate occurrence probabilities for the rules of a grammar. These can be used to determine the best parse, resp. parses of a sentence. Furthermore, rule probabilities can serve to speed up parsing. Parsing with a probabilistic grammar obtained from a treebank is called data-driven parsing. 3 / 22

PCFG (1) In most cases, probabilistic CFGs are used for data-driven parsing. PCFG A Probabilistic Context-Free Grammar (PCFG) is a tuple G P = (N, T, P, S, p) where (N, T, P, S) is a CFG and p : P [0, 1] a is a function such that for all A N, p(a α) = 1 A α P a [0, 1] denotes {i R 0 i 1}. p(a α) is the conditional probability p(a α A) 4 / 22

PCFG (2) PCFG Start symbol VP 0.8 VP V NP 0.2 VP VP PP 1 NP Det N 1 PP P NP 0.1 N N PP 1 V sees 1 Det the 1 P with 0.6 N man 0.3 N telescope 5 / 22

PCFG (2) PCFG Start symbol VP 0.8 VP V NP 0.2 VP VP PP 1 NP Det N 1 PP P NP 0.1 N N PP 1 V sees 1 Det the 1 P with 0.6 N man 0.3 N telescope Probability of a parse tree: product of the probabilities of the rules used to generate the parse tree. Probability of a category A spanning a string w: sum of the probabilities of all parse trees with root label A and yield w. 5 / 22

PCFG (3) Parse tree probability 0.8 VP V NP 0.2 VP VP PP 1 NP Det N 1 PP P NP 0.1 N N PP 1 V sees 1 Det the 1 P with 0.6 N man 0.3 N telescope t 1 VP t 2 VP VP PP V NP V NP P NP sees Det N sees Det N with Det N the N PP the man the telescope man P NP P(t 1 ) = 0.6 0.8 0.2 0.3 = 0.0288 P(t 2 ) = 0.6 0.8 0.1 0.3 = 0.0144 with Det the N telescope 6 / 22

PCFG (3) Parse tree probability 0.8 VP V NP 0.2 VP VP PP 1 NP Det N 1 PP P NP 0.1 N N PP 1 V sees 1 Det the 1 P with 0.6 N man 0.3 N telescope t 1 VP t 2 VP VP PP V NP V NP P NP sees Det N sees Det N with Det N the N PP the man the telescope man P NP P(t 1 ) = 0.6 0.8 0.2 0.3 = 0.0288 P(t 2 ) = 0.6 0.8 0.1 0.3 = 0.0144 with p(vp,sees the man with the telescope) = 0.0288 + 0.0144 Det the N telescope 6 / 22

PCFG (4) Probabilities of leftmost derivations: Probability of a leftmost derivation Let G = (N, T, P, S, p) be a PCFG, and let α, γ (N T). Let A β P. The probability of a leftmost derivation α A β l γ is p(α A β l γ) = p(a β) Let A 1 β 1,..., A m β m P, m N. The probability of a leftmost derivation α A1 β1 l p(α A1 β1 l Am βm l Am βm l γ) = γ is m p(a i β i ) i=1 7 / 22

PCFG (5) The probability of leftmost deriving γ from α, α l γ is defined as the sum over the probabilities of all leftmost derivations of γ from α: p(α l γ) = k m p(a i j βj i ) i=1 j=1 where k N is the number of leftmost derivations of γ from α and m N is the derivation length of the ith derivation and A i j βi j is the jth derivation step of the ith leftmost derivation. In the following, the subscript l is omitted assuming that derivations are identified with the corresponding leftmost derivation for probabilities. 8 / 22

PCFG (6) Consistent PCFG A PCFG is consistent if the sum of the probabilities of all sentences in the language equals 1. 9 / 22

PCFG (6) Consistent PCFG A PCFG is consistent if the sum of the probabilities of all sentences in the language equals 1. Example of an inconsistent PCFG.4 S A.6 S B 1 A a 1 B B Problem: probability mass disappears into infinite derivations. p(w) = p(a) = 0.4 w L(G) 9 / 22

PCFG (6) Consistent PCFG A PCFG is consistent if the sum of the probabilities of all sentences in the language equals 1. Example of an inconsistent PCFG.4 S A.6 S B 1 A a 1 B B Problem: probability mass disappears into infinite derivations. p(w) = p(a) = 0.4 w L(G) PCFGs estimated from treebanks are usually consistent. 9 / 22

Inside and outside probability (1) Given a PCFG and an input w = w 1... w n, determine the likelihood of w, i.e., compute t T(w) P(t ). We don t want to compute the probability of every parse tree separately and then sum over the results. This is too expensive. Instead, we adopt a computation with tabulation, in order to share the results for common subtrees. 10 / 22

Inside and outside probability (2) Idea: We fill a N w w matrix α where the first dimension is the id of a non-terminal, and the second and third are the start and end indices of a span. α A,i,j gives the probability of deriving w i... w j from A or, put differently, of a parse tree with root label A and yield w i... w j : α A,i,j = P(A w i... w j A) Inside computation 1 for all 1 i w and A N: if A w i P, then α A,i,i = p(a w i ), else α A,i,i = 0 2 for all 1 i < j w and A N: α A,i,j = j 1 A BC P k=i p(a BC)α B,i,kα C,k+1,j We have in particular α S,1, w = P(w). 11 / 22

Inside and outside probability (3) Inside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 4 j 4 (1,A), (0.1,S) 3 2 1 (1,A), (0.1,S) (1,A), (0.1,S) (1,A), (0.1,S) 1 2 3 4 i 12 / 22

Inside and outside probability (3) Inside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 4 j 4 (3 10 2,S), (0.1,X) (1,A), (0.1,S) 3 2 1 (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (1,A), (0.1,S) 1 2 3 4 i 12 / 22

Inside and outside probability (3) Inside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 4 j 4 3 2 1 (6.9 10 2,S), (0.03,X) (6.9 10 2,S), (0.03,X) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (1,A), (0.1,S) 1 2 3 4 i 12 / 22

Inside and outside probability (3) Inside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 4 j 4 3 2 1 (3.87 10 2,S), (0.069,X) (6.9 10 2,S), (0.03,X) (6.9 10 2,S), (0.03,X) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (1,A), (0.1,S) 1 2 3 4 i 12 / 22

Inside and outside probability (3) Inside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 4 j 4 3 2 1 (3.87 10 2,S), (0.069,X) (6.9 10 2,S), (0.03,X) (6.9 10 2,S), (0.03,X) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (3 10 2,S), (0.1,X) (1,A), (0.1,S) (1,A), (0.1,S) 1 2 3 4 i P(aaaa) = α S,1,4 = 0.0387 12 / 22

Inside and outside probability (4) We can also compute the outside probability of a given non-terminal A with a span from i to j. Inside: Sum over all possibilities for the tree below A (spanning from i to j). Outside: Sum over all possibilities for the part of the parse tree outside the tree below A, i.e., over all possibilities to complete a A, i, j tree into a parse tree for the entire sentence. A Outside probability β A,i,j i j Inside probability α A,i,j 13 / 22

Inside and outside probability (5) We fill a N w w matrix β such that β A,i,j gives the probability of deriving w 1... w i 1 Aw j+1... w w from S or, put differently, of deriving a tree with root label S and yield w 1... w i 1 Aw j+1... w w : β A,i,j = P(S w 1... w i 1 Aw j+1... w w S) We need the inside probabilities in order to compute the outside probabilities. Outside computation 1 β S,1, w = 1 and β A,1, w = 0 for all A S 2 for all 1 i < j w and A N: β A,i,j = n B AC P k=j+1 p(b AC)β B,i,kα C,j+1,k + i 1 B CA P k=1 p(b CA)β B,k,jα C,k,i 1 14 / 22

Inside and outside probability (6) Outside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 3 j 3 2 (1,S), (0,A), (0,X) 1 1 2 3 i 15 / 22

Inside and outside probability (6) Outside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 3 j 3 2 1 (1,S), (0,A), (0,X) (0,S), (0,X), (0.03,A) (0.3,S), (0,A), (0.6,X) 1 2 3 i 15 / 22

Inside and outside probability (6) Outside computation 0.3: S AS 0.6: S AX 0.1: S a 1: X SA 1: A a input w = a 3 j 3 2 1 (1,S), (0,A), (0,X) (0.3,S), (0,A), (0.6,X) (0,S), (0,X), (0.03,A) (0.6,S), (0,X), (8.99 10 3,A) (9 10 2,S), (0.18,X), (3 10 2,A) (0,S), (0,X), (6.9 10 2,A) 1 2 3 i 15 / 22

Inside and outside probability (7) The following holds: 1 The probability of a parse tree for w with a node labeled A that spans w i... w j is P(S w 1... w i 1 Aw j+1... w n w1... w n ) = α A,i,j β A,i,j A i j 2 In particular: P(w) = α S,1, w 16 / 22

Parsing (1) In PCFG parsing, we want to compute the most probable parse tree (= most probable (leftmost) derivation) given an input sentence w, also called the Viterbi parse. This means that we are disambiguating: Among several readings, we search for the best. Sometimes, the k best are searched for (k > 1). During parsing, we must make sure that updates on probabilities (because a better derivation has been found for a nonterminal) do not require updates on other parts of the chart. the order should be such that an item is used within a derivation only when its final probability is reached. 17 / 22

Parsing (2) We can extend the symbolic CYK parser to a probabilistic one. Instead of summing over all derivations (as in the computation of the inside probability), we keep the best one ( Viterbi algorithm). Assume a three-dimensional chart C (non-terminal, start index, length). C A,i,l := 0 for all A, i, l C A,i,1 := p if p : A w i P for all l [1..n]: for all i [1..n l + 1]: for every p : A B C: for every l 1 [1..l 1]: C A,i,l = max{c A,i,l, p C B,i,l1 C C,i+l1,l l 1 } scan complete 18 / 22

Parsing (3) We extend this to a parser. The parser can also deal with unary productions A B. Every chart field has three components, the probability, the rule that has been used and, if the rule is binary, the length l 1 of the first righthand side element. We assume that the grammar does not contain any loops A + A. 19 / 22

Parsing (4) C A,i,1 = p, A w i, if p : A w i P scan for all l [1..n] and for all i [1..n l]: for all p : A B C and for all l 1 [1..l 1]: for all l 1 [1..l 1]: if C B,i,l1 and C C,i+l1,l l 1 then: p new = p C B,i,l1 [1] C C,i+l1,l l 1 [1] if C A,i,l == or C A,i,l [1] < p new then: C A,i,l = p new, A BC, l 1 binary complete repeat until C does not change any more: for every p : A B: if C B,i,l then: p new = p C B,i,l [1] if C A,i,l == or C A,i,l [1] < p new then: C A,i,l = p new, A B, unary complete return build tree(s,1,n) 20 / 22

Parsing (5) Example.1 VP VP NP.6 VP V NP.3 VP V.5 N apple 1 NP Det N.3 V sees.4 V comes Start symbol VP, input w = eats this morning l 3 2.3 V eats 1 Det this.5 N morning 1 1 2 3 i 21 / 22

Parsing (5) Example.1 VP VP NP.6 VP V NP.3 VP V.5 N apple 1 NP Det N.3 V sees.4 V comes Start symbol VP, input w = eats this morning l 3 2.3 V eats 1 Det this.5 N morning 1.3, V eats 1, Det this.5, N morning 1 2 3 i 21 / 22

Parsing (5) Example.1 VP VP NP.6 VP V NP.3 VP V.5 N apple 1 NP Det N.3 V sees.4 V comes.3 V eats 1 Det this.5 N morning Start symbol VP, input w = eats this morning l 3 2.09, VP V 1.3, V eats 1, Det this.5, N morning 1 2 3 i 21 / 22

Parsing (5) Example.1 VP VP NP.6 VP V NP.3 VP V.5 N apple 1 NP Det N.3 V sees.4 V comes.3 V eats 1 Det this.5 N morning Start symbol VP, input w = eats this morning l 3 2.5, NP Det N, 1.09, VP V 1.3, V eats 1, Det this.5, N morning 1 2 3 i 21 / 22

Parsing (5) Example.1 VP VP NP.6 VP V NP.3 VP V.5 N apple 1 NP Det N.3 V sees.4 V comes.3 V eats 1 Det this.5 N morning Start symbol VP, input w = eats this morning l 3.0045, VP VP NP, 1 2.5, NP Det N, 1.09, VP V 1.3, V eats 1, Det this.5, N morning 1 2 3 i 21 / 22

Parsing (5) Example.1 VP VP NP.6 VP V NP.3 VP V.5 N apple 1 NP Det N.3 V sees.4 V comes.3 V eats 1 Det this.5 N morning Start symbol VP, input w = eats this morning l 3.09, VP V NP, 1 2.5, NP Det N, 1.09, VP V 1.3, V eats 1, Det this.5, N morning 1 2 3 i (The analysis of the VP gets revised since a better parse tree has been found.) 21 / 22

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall Series in Articial Intelligence. Pearson Education International, second edition edition. 22 / 22