Probabilistic Context-free Grammars

Similar documents
LECTURER: BURCU CAN Spring

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Parsing with Context-Free Grammars

Processing/Speech, NLP and the Web

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Statistical Methods for NLP

Advanced Natural Language Processing Syntactic Parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Maschinelle Sprachverarbeitung

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Maschinelle Sprachverarbeitung

CS460/626 : Natural Language

CS 6120/CS4120: Natural Language Processing

Chapter 14 (Partially) Unsupervised Parsing

Multilevel Coarse-to-Fine PCFG Parsing

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Natural Language Processing

CS 545 Lecture XVI: Parsing

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

A Context-Free Grammar

Lecture 12: Algorithms for HMMs

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

10/17/04. Today s Main Points

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 12: Algorithms for HMMs

Parsing with Context-Free Grammars

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

A* Search. 1 Dijkstra Shortest Path

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Constituency Parsing

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Sequence Labeling: HMMs & Structured Perceptron

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Probabilistic Context Free Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Lecture 5: UDOP, Dependency Grammars

Decoding and Inference with Syntactic Translation Models

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

NLP Programming Tutorial 11 - The Structured Perceptron

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Natural Language Processing

Hidden Markov Models

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

DT2118 Speech and Speaker Recognition

Hidden Markov Models (HMMs)

Syntax-Based Decoding

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Soft Inference and Posterior Marginals. September 19, 2013

The Infinite PCFG using Hierarchical Dirichlet Processes

Multiword Expression Identification with Tree Substitution Grammars

Probabilistic Context-Free Grammar

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

CS : Speech, NLP and the Web/Topics in AI

Probabilistic Linguistics

Introduction to Computational Linguistics

Aspects of Tree-Based Statistical Machine Translation

Algorithms for Syntax-Aware Statistical Machine Translation

Features of Statistical Parsers

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Language and Statistics II

Expectation Maximization (EM)

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Statistical Machine Translation

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

HMM and Part of Speech Tagging. Adam Meyers New York University

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Statistical Methods for NLP

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Lecture 9: Hidden Markov Model

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Marrying Dynamic Programming with Recurrent Neural Networks

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Part-of-Speech Tagging

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Spectral Unsupervised Parsing with Additive Tree Metrics

Context- Free Parsing with CKY. October 16, 2014

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Lecture 13: Structured Prediction

Latent Variable Models in NLP

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Midterm sample questions

Transcription:

Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017

The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John VP V ate ate NP Det a a N sandwich sandwich Cell at column i, row k: { A A * w i w k-1 } John

The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John VP V ate ate NP Det a a N sandwich sandwich Cell at column i, row k: { A A * w i w k-1 } John

Recognizer to Parser Parser: need to construct parse trees from chart. Do this by memorizing how each A Ch(i,k) can be constructed from smaller parts. built from B Ch(i,j) and C Ch(j,k) using A B C: store (B,C,j) in backpointer for A in Ch(i,k). analogous to backpointers in HMMs Once chart has been filled, enumerate trees recursively by following backpointers, starting at S Ch(1,n+1).

Let s play a game Given a nonterminal symbol, expand it. You can take one of two moves: expand nonterminal into a sequence of other nontermianls use nonterminals S, NP, VP, PP, or POS tags expand nonterminal into a word

Penn Treebank POS tags

Some real trees Penn Treebank, #0001 Penn Treebank, #0002 nltk.corpus.treebank.parsed_sents("wsj_0001.mrg")[0].draw()

Ambiguity Need to disambiguate: find correct parse tree for ambiguous sentence. S NP VP S V NP NP VP Det N VP PP N PP V NP P NP P NP Det N PRP$ N PRP$ N I shot an elephant in my pyjamas I shot an elephant in my pyjamas How do we identify the correct tree? How do we compute it efficiently? (Remember: exponential number of readings.)

Probabilistic CFGs A probabilistic context-free grammar (PCFG) is a context-free grammar in which each production rule A w has a probability P(A w A): when we expand A, how likely is it that we choose A w? for each nonterminal A, probabilities must sum to one: X P (A! w A) =1 w we will write P(A w) instead of P(A w A) for short

An example S NP VP [1.0] VP V NP [0.5] NP Det N [0.8] VP VP PP [0.5] NP i [0.2] V shot [1.0] N N PP [0.4] PP P NP [1.0] N elephant [0.3] P in [1.0] N pyjamas [0.3] Det an [0.5] Det my [0.5] (let s pretend for simplicity that Det = PRP$)

Generative process PCFG generates random derivations of CFG. each event (expand nonterminal by production rule) statistically independent of all the others 1.0 0.2 0.5 S NP VP i VP i VP PP * i shot an elephant in my pyjamas 0.00072 1.0 0.2 0.4 S NP VP i VP * i V Det N i V Det N PP 0.4 0.00057 * i shot pyjamas

Parse trees P(t 1 ) = 0.00072 P(t 2 ) = 0.00057 S NP S VP 0.2 0.5 V I shot VP 0.5 Det an NP 0.8 N 0.5 0.3 elephant P in PP NP Det my 0.8 N 0.5 0.3 pyjamas NP I shot an VP 0.2 0.5 V Det 0.5 NP 0.8 N 0.3 elephant N 0.4 P in PP NP Det my 0.8 N 0.5 0.3 pyjamas correct = more probable parse tree

Language modeling As with other generative models (HMMs!), can define probability P(w) of string by marginalizing over its possible parses: P (w) = X t2parses(w) P (t) Can compute this efficiently with inside probabilities, see next time.

Disambiguation Assumption: correct parse tree = the parse tree that had highest prob of being generated by random process, i.e. argmax t2parses(w) P (t) We use a variant of the Viterbi algorithm to compute it. Here, Viterbi based on CKY; can do it with other parsing algorithms too.

The intuition Ordinary CKY parse chart: Ch(i,k) = {A A * w i w k-1 } VP NP N PP in my pyjamas VP NP N V shot shot Det an an elephant elephant in my pyjamas

The intuition Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} VP: 0.0036 NP: 0.006 N: 0.014 PP: 0.12 in my pyjamas VP: 0.06 NP: 0.12 N: 0.3 V: 1.0 shot shot Det: 0.5 an an elephant elephant in my pyjamas

Viterbi CKY Define for each span (i,k) and each nonterminal A the probability V (A, i, k) = max A d ) w i...w k 1 P (d) Compute V iteratively inside out, i.e. starting from small spans and working our way up to longer spans. V (A, i, i + 1) = P (A! w i ) V (A, i, k) = max A!B C i<j<k P (A BC) V (B,i,j) V (C, j, k)

Viterbi CKY - pseudocode set all V[A,i,j] to 0 for all i from 1 to n: for all A with rule A -> wi: add A to Ch(i,i+1) V[A,i,i+1] = P(A -> wi) for all b from 2 to n: for all i from 1 to n-b+1: for all k from 1 to b-1: for all B in Ch(i,i+k) and C in Ch(i+k,i+b): for all production rules A -> B C: add A to Ch(i,i+b) if P(A -> B C) * V[B,i,i+k] * V[C,i+k,i+b] > V[A,i,i+b]: V[A,i,i+b] = P(A -> B C) * V[B,i,i+k] * V[C,i+k,i+b]

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} shot in my pyjamas an an elephant elephant in my pyjamas shot

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} PP: 0.12 in my pyjamas V: 1.0 shot shot Det: 0.5 an an N: 0.3 elephant elephant in my pyjamas

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} PP: 0.12 in my pyjamas NP: 0.12 N: 0.3 elephant in my pyjamas Det: 0.5 an elephant V: 1.0 shot an shot

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} N: 0.014 PP: 0.12 in my pyjamas NP: 0.12 N: 0.3 elephant in my pyjamas Det: 0.5 an elephant V: 1.0 shot an shot

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} N: 0.014 PP: 0.12 in my pyjamas VP: 0.06 NP: 0.12 N: 0.3 elephant in my pyjamas Det: 0.5 an elephant V: 1.0 shot an shot

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} NP: 0.0058 N: 0.014 PP: 0.12 in my pyjamas VP: 0.06 NP: 0.12 N: 0.3 elephant in my pyjamas Det: 0.5 an elephant V: 1.0 shot an shot

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} VP: 0.0029 NP: 0.0058 N: 0.014 PP: 0.12 in my pyjamas VP: 0.06 NP: 0.12 N: 0.3 elephant in my pyjamas Det: 0.5 an elephant V: 1.0 shot an shot

Viterbi-CKY in action Viterbi CKY parse chart: Ch(i, k) ={(A, p) p = max d:a) w i...w k 1 P (d)} VP: 0.0029 VP: 0.0036 NP: 0.0058 N: 0.014 PP: 0.12 in my pyjamas VP: 0.06 NP: 0.12 N: 0.3 elephant in my pyjamas Det: 0.5 an elephant V: 1.0 shot an shot

Remarks Viterbi CKY has exactly the same nested loops as the ordinary CKY parser. computing V in addition to Ch only changes constant factor thus asymptotic runtime remains O(n 3 ) Compute optimal parse by storing backpointers. same backpointers as in ordinary CKY sufficient to store the best backpointer for each (A,i,k) if we only care about best parse (and not all parses), i.e. actually uses less memory than ordinary CKY

Obtaining the PCFG How to obtain the CFG? write by hand derive from treebank grammar induction from raw text How to obtain the rule probabilities once we have the CFG? maximum likelihood estimation from treebank EM training from raw text (inside-outside algorithm)

The Penn Treebank Marcus Large (in the mid-90s) quantity of text, annotated with POS tags and syntactic structures. Consists of several sub-corpora: Wall Street Journal: 1 year of news text, 1 million words Brown corpus: balanced corpus, 1 million words ATIS: dialogues on flight bookings, 5000 words Switchboard: spoken dialogue, 3 million words WSJ PTB is standard corpus for training and evaluating PCFG parsers.

Annotation format S VP ADJP-PRD PP NP-SBJ NP DT JJ, JJ NN VBZ JJ IN NN CC NN. That cold, empty sky was full of fire and light.

NP-SBJ Annotation format S VP ((S (NP-SBJ (DT That) (JJ cold) (,,) (JJ empty) (NN sky) ) (VP (VBD was) (ADJP-PRD (JJ full) ADJP-PRD (PP (IN of) (NP PP (NN fire) (CC and) NP (NN light) )))) (..) )) DT JJ, JJ NN VBZ JJ IN NN CC NN. That cold, empty sky was full of fire and light.

Reading off grammar Can directly read off grammar in annotators heads from trees in treebank. Yields very large CFG, e.g. 4500 rules for VP: VP VBD PP VP VBD PP PP VP VBD PP PP PP VP VBD PP PP PP PP VP VBD ADVP PP VP VBD PP ADVP VP VBD PP PP PP PP PP ADVP PP

Reading off grammar Can directly read off grammar in annotators heads from trees in treebank. Yields very large CFG, e.g. 4500 rules for VP: VP VBD PP VP VBD PP PP VP VBD PP PP PP VP VBD PP PP PP PP VP VBD ADVP PP VP VBD PP ADVP This mostly happens because we go from football in the fall to lifting in the winter to football again in the spring. VP VBD PP PP PP PP PP ADVP PP

Evaluation Step 1: Decide on training and test corpus. For WSJ corpus, there is a conventional split by sections: 2-21 22 23 Training Devel Test

Evaluation Step 2: How should we measure the accuracy of the parser? Straightforward idea: Measure exact match, i.e. proportion of gold standard trees that parser got right. This is too strict: parser makes many decisions in parsing a sentence a single incorrect parsing decision makes tree wrong want more fine-grained measure

Comparing parse trees Idea 2 (PARSEVAL): Compare structure of parse tree and gold standard tree. Labeled: Which constituents (span + syntactic category) of one tree also occur in the other? Unlabeled: How do the trees bracket the substrings of the sentence (ignoring syntactic categories)? Gold S S Parse VP VP PP NP-SBJ ADJP NP NP-SBJ ADJP CC DT NN VBZ JJ But the concept is workable IN DT NN VBZ JJ But the concept is workable

Precision What proportion of constituents in parse tree is also present in gold tree? Gold S VP NP-SBJ ADJP CC DT NN VBZ JJ But the concept is workable S PP ( ) ( ) NP NP-SBJ Parse VP ADJP ( ) IN DT NN VBZ JJ But the concept is workable Labeled Precision = 7 / 11 = 63.6% Unlabeled Precision = 10 / 11 = 90.9%

Recall What proportion of constituents in gold tree is also present in parse tree? Gold NP-SBJ S VP ADJP ( ) CC DT NN VBZ JJ But the concept is workable S PP NP NP-SBJ Parse VP ADJP IN DT NN VBZ JJ But the concept is workable Labeled Recall = 7 / 9 = 77.8% Unlabeled Recall = 8 / 9 = 88.9%

F-Score Precision and recall measure opposing qualities of a parser ( soundness and completeness ) Summarize both together in the f-score: F 1 = 2 P R P + R In the example, we have labeled f-score 70.0 and unlabeled f-score 89.9.

Summary PCFGs extend CFGs with rule probabilities. Events of random process are nonterminal expansion steps. These are all statistically independent. Use Viterbi CKY parser to find most probable parse tree for a sentence in cubic time. Read grammars off treebanks. next time: learn rule probabilities Evaluation of statistical parsers.