Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Similar documents
Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Constituency Parsing

A Context-Free Grammar

Probabilistic Context-free Grammars

LECTURER: BURCU CAN Spring

Parsing with Context-Free Grammars

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS 545 Lecture XVI: Parsing

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Processing/Speech, NLP and the Web

Probabilistic Context-Free Grammar

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Advanced Natural Language Processing Syntactic Parsing

The Infinite PCFG using Hierarchical Dirichlet Processes

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Chapter 14 (Partially) Unsupervised Parsing

Multilevel Coarse-to-Fine PCFG Parsing

Natural Language Processing

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

CS 6120/CS4120: Natural Language Processing

CS460/626 : Natural Language

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Statistical Methods for NLP

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Features of Statistical Parsers

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

CS : Speech, NLP and the Web/Topics in AI

Introduction to Computational Linguistics

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Language Modeling. Michael Collins, Columbia University

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Statistical methods in NLP, lecture 7 Tagging and parsing

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Probabilistic Context Free Grammars

Spectral Unsupervised Parsing with Additive Tree Metrics

Lecture 5: UDOP, Dependency Grammars

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Introduction to Probablistic Natural Language Processing

CSE 490 U Natural Language Processing Spring 2016

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Multiword Expression Identification with Tree Substitution Grammars

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

Probabilistic Context-Free Grammars and beyond

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Decoding and Inference with Syntactic Translation Models

10/17/04. Today s Main Points

Parsing with Context-Free Grammars

Lecture 7: Introduction to syntax-based MT

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Latent Variable Models in NLP

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

Algorithms for Syntax-Aware Statistical Machine Translation

A* Search. 1 Dijkstra Shortest Path

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Syntax-Based Decoding

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Natural Language Processing

Natural Language Processing. Lecture 13: More on CFG Parsing

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

NLP Programming Tutorial 11 - The Structured Perceptron

DT2118 Speech and Speaker Recognition

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

CSE 447/547 Natural Language Processing Winter 2018

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Marrying Dynamic Programming with Recurrent Neural Networks

Artificial Intelligence

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Syntax-based Statistical Machine Translation

Remembering subresults (Part I): Well-formed substring tables

Analyzing the Errors of Unsupervised Learning

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Aspects of Tree-Based Statistical Machine Translation

Lecture 9: Hidden Markov Model

HMM and Part of Speech Tagging. Adam Meyers New York University

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Probabilistic Linguistics

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

A Support Vector Method for Multivariate Performance Measures

Context- Free Parsing with CKY. October 16, 2014

Sequence Labeling: HMMs & Structured Perceptron

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Terry Gaasterland Scripps Institution of Oceanography University of California San Diego, La Jolla, CA,

Transcription:

Probabilistic Context-Free Grammars Michael Collins, Columbia University

Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs

A Probabilistic Context-Free Grammar (PCFG) S NP VP 1.0 VP Vi 0.4 VP Vt NP 0.4 VP VP PP 0.2 NP DT NN 0.3 NP NP PP 0.7 PP P NP 1.0 Vi sleeps 1.0 Vt saw 1.0 NN man 0.7 NN woman 0.2 NN telescope 0.1 DT the 1.0 IN with 0.5 IN in 0.5 Probability of a tree t with rules α 1 β 1, α 2 β 2,..., α n β n is p(t) = n i=1 q(α i β i ) where q(α β) is the probability for rule α β.

DERIVATION RULES USED PROBABILITY S

DERIVATION RULES USED PROBABILITY S S NP VP 1.0 NP VP

DERIVATION RULES USED PROBABILITY S S NP VP 1.0 NP VP NP DT NN 0.3 DT NN VP

DERIVATION RULES USED PROBABILITY S S NP VP 1.0 NP VP NP DT NN 0.3 DT NN VP DT the 1.0 the NN VP

DERIVATION RULES USED PROBABILITY S S NP VP 1.0 NP VP NP DT NN 0.3 DT NN VP DT the 1.0 the NN VP NN dog 0.1 the dog VP

DERIVATION RULES USED PROBABILITY S S NP VP 1.0 NP VP NP DT NN 0.3 DT NN VP DT the 1.0 the NN VP NN dog 0.1 the dog VP VP Vi 0.4 the dog Vi

DERIVATION RULES USED PROBABILITY S S NP VP 1.0 NP VP NP DT NN 0.3 DT NN VP DT the 1.0 the NN VP NN dog 0.1 the dog VP VP Vi 0.4 the dog Vi Vi laughs 0.5 the dog laughs

Properties of PCFGs Assigns a probability to each left-most derivation, or parse-tree, allowed by the underlying CFG

Properties of PCFGs Assigns a probability to each left-most derivation, or parse-tree, allowed by the underlying CFG Say we have a sentence s, set of derivations for that sentence is T (s). Then a PCFG assigns a probability p(t) to each member of T (s). i.e., we now have a ranking in order of probability.

Properties of PCFGs Assigns a probability to each left-most derivation, or parse-tree, allowed by the underlying CFG Say we have a sentence s, set of derivations for that sentence is T (s). Then a PCFG assigns a probability p(t) to each member of T (s). i.e., we now have a ranking in order of probability. The most likely parse tree for a sentence s is arg max t T (s) p(t)

Data for Parsing Experiments: Treebanks Penn WSJ Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences An example tree: TOP S NP VP NNP NNPS VBD NP PP NP PP ADVP IN NP CD NN IN NP RB NP PP QP PRP$ JJ NN CC JJ NN NNS IN NP $ CD CD PUNC, NP SBAR NNP PUNC, WHADVP S WRB NP VP DT NN VBZ NP QP NNS PUNC. RB CD Canadian Utilities had 1988 revenue of C$ 1.16 billion, mainly from its natural gas and electric utility businessesin Alberta, where the company serves about 800,000 customers.

Deriving a PCFG from a Treebank Given a set of example trees (a treebank), the underlying CFG can simply be all rules seen in the corpus Maximum Likelihood estimates: q ML (α β) = Count(α β) Count(α) where the counts are taken from a training set of example trees. If the training data is generated by a PCFG, then as the training data size goes to infinity, the maximum-likelihood PCFG will converge to the same distribution as the true PCFG.

PCFGs Booth and Thompson (1973) showed that a CFG with rule probabilities correctly defines a distribution over the set of derivations provided that: 1. The rule probabilities define conditional distributions over the different ways of rewriting each non-terminal. 2. A technical condition on the rule probabilities ensuring that the probability of the derivation terminating in a finite number of steps is 1. (This condition is not really a practical concern.)

Parsing with a PCFG Given a PCFG and a sentence s, define T (s) to be the set of trees with s as the yield. Given a PCFG and a sentence s, how do we find arg max p(t) t T (s)

Chomsky Normal Form A context free grammar G = (N, Σ, R, S) in Chomsky Normal Form is as follows N is a set of non-terminal symbols Σ is a set of terminal symbols R is a set of rules which take one of two forms: X Y1 Y 2 for X N, and Y 1, Y 2 N X Y for X N, and Y Σ S N is a distinguished start symbol

A Dynamic Programming Algorithm Given a PCFG and a sentence s, how do we find Notation: max p(t) t T (s) n = number of words in the sentence w i = i th word in the sentence N = the set of non-terminals in the grammar S = the start symbol in the grammar Define a dynamic programming table π[i, j, X] = maximum probability of a constituent with non-terminal X spanning words i... j inclusive Our goal is to calculate max t T (s) p(t) = π[1, n, S]

An Example the dog saw the man with the telescope

A Dynamic Programming Algorithm Base case definition: for all i = 1... n, for X N π[i, i, X] = q(x w i ) (note: define q(x w i ) = 0 if X w i is not in the grammar) Recursive definition: for all i = 1... n, j = (i + 1)... n, X N, π(i, j, X) = max (q(x Y Z) π(i, s, Y ) π(s + 1, j, Z)) X Y Z R, s {i...(j 1)}

An Example π(i, j, X) = max X Y Z R, s {i...(j 1)} (q(x Y Z) π(i, s, Y ) π(s + 1, j, Z)) the dog saw the man with the telescope

The Full Dynamic Programming Algorithm Input: a sentence s = x 1... x n, a PCFG G = (N, Σ, S, R, q). Initialization: For all i {1... n}, for all X N, Algorithm: π(i, i, X) = For l = 1... (n 1) For i = 1... (n l) Set j = i + l For all X N, calculate π(i, j, X) = and bp(i, j, X) = arg { q(x xi ) if X x i R 0 otherwise max (q(x Y Z) π(i, s, Y ) π(s + 1, j, Z)) X Y Z R, s {i...(j 1)} max X Y Z R, s {i...(j 1)} (q(x Y Z) π(i, s, Y ) π(s + 1, j, Z))

A Dynamic Programming Algorithm for the Sum Given a PCFG and a sentence s, how do we find Notation: t T (s) p(t) n = number of words in the sentence w i = i th word in the sentence N = the set of non-terminals in the grammar S = the start symbol in the grammar Define a dynamic programming table π[i, j, X] = sum of probabilities for constituent with non-terminal X spanning words i... j inclusive Our goal is to calculate p(t) = π[1, n, S]

Summary PCFGs augments CFGs by including a probability for each rule in the grammar. The probability for a parse tree is the product of probabilities for the rules in the tree To build a PCFG-parsed parser: 1. Learn a PCFG from a treebank 2. Given a test data sentence, use the CKY algorithm to compute the highest probability tree for the sentence under the PCFG