Probabilistic Context Free Grammars. Many slides from Michael Collins

Similar documents
Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

A Context-Free Grammar

Constituency Parsing

Probabilistic Context-free Grammars

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

LECTURER: BURCU CAN Spring

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Structured Prediction Models via the Matrix-Tree Theorem

Parsing with Context-Free Grammars

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Processing/Speech, NLP and the Web

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Natural Language Processing

Statistical Methods for NLP

The Infinite PCFG using Hierarchical Dirichlet Processes

Advanced Natural Language Processing Syntactic Parsing

CS 545 Lecture XVI: Parsing

Probabilistic Context-Free Grammar

Chapter 14 (Partially) Unsupervised Parsing

CS460/626 : Natural Language

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Multilevel Coarse-to-Fine PCFG Parsing

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Features of Statistical Parsers

Natural Language Processing

Lab 12: Structured Prediction

CS 6120/CS4120: Natural Language Processing

Statistical methods in NLP, lecture 7 Tagging and parsing

Spectral Unsupervised Parsing with Additive Tree Metrics

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

10/17/04. Today s Main Points

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Multiword Expression Identification with Tree Substitution Grammars

Language Modeling. Michael Collins, Columbia University

Parsing with Context-Free Grammars

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Computational Linguistics

Introduction to Probablistic Natural Language Processing

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Decoding and Inference with Syntactic Translation Models

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

CSE 490 U Natural Language Processing Spring 2016

DT2118 Speech and Speaker Recognition

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

NLP Programming Tutorial 11 - The Structured Perceptron

Lecture 5: UDOP, Dependency Grammars

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

CS : Speech, NLP and the Web/Topics in AI

Language Processing with Perl and Prolog

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

HMM and Part of Speech Tagging. Adam Meyers New York University

Sequence Labeling: HMMs & Structured Perceptron

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Mid-term Reviews. Preprocessing, language models Sequence models, Syntactic Parsing

Natural Language Processing

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Structured Prediction

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

f is a function that maps a structure (x, y) to a feature vector w is a parameter vector (also a member of R d )

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Introduction to Computational Linguistics

Hidden Markov Models

A Deterministic Word Dependency Analyzer Enhanced With Preference Learning

CSE 447/547 Natural Language Processing Winter 2018

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 7: Introduction to syntax-based MT

A* Search. 1 Dijkstra Shortest Path

Probabilistic Context Free Grammars

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Spectral Learning for Non-Deterministic Dependency Parsing

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Syntax-Based Decoding

Latent Variable Models in NLP

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Hidden Markov Models (HMMs)

Lecture 9: Hidden Markov Model

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Statistical Methods for NLP

Lecture 13: Structured Prediction

Lecture 12: Algorithms for HMMs

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

TnT Part of Speech Tagger

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Transition-based Dependency Parsing with Selectional Branching

Transcription:

Probabilistic Context Free Grammars Many slides from Michael Collins

Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs

A Probabilistic Context-Free Grammar (PCFG) S ) NP VP 1.0 VP ) Vi 0.4 VP ) Vt NP 0.4 VP ) VP PP 0.2 NP ) DT NN 0.3 NP ) NP PP 0.7 PP ) P NP 1.0 Vi ) sleeps 1.0 Vt ) saw 1.0 NN ) man 0.7 NN ) woman 0.2 NN ) telescope 0.1 DT ) the 1.0 IN ) with 0.5 IN ) in 0.5 I Probability of a tree t with rules 1! 1, 2! 2,..., n! n is p(t) = Q n i=1 q( i! i ) where q(! ) is the probability for rule!.

DERIVATION RULES USED PROBABILITY S

DERIVATION RULES USED PROBABILITY S S! NP VP 1.0 NP VP

DERIVATION RULES USED PROBABILITY S S! NP VP 1.0 NP VP NP! DT NN 0.3 DT NN VP

DERIVATION RULES USED PROBABILITY S S! NP VP 1.0 NP VP NP! DT NN 0.3 DT NN VP DT! the 1.0 the NN VP

DERIVATION RULES USED PROBABILITY S S! NP VP 1.0 NP VP NP! DT NN 0.3 DT NN VP DT! the 1.0 the NN VP NN! dog 0.1 the dog VP

DERIVATION RULES USED PROBABILITY S S! NP VP 1.0 NP VP NP! DT NN 0.3 DT NN VP DT! the 1.0 the NN VP NN! dog 0.1 the dog VP VP! Vi 0.4 the dog Vi

DERIVATION RULES USED PROBABILITY S S! NP VP 1.0 NP VP NP! DT NN 0.3 DT NN VP DT! the 1.0 the NN VP NN! dog 0.1 the dog VP VP! Vi 0.4 the dog Vi Vi! laughs 0.5 the dog laughs

Properties of PCFGs I Assigns a probability to each left-most derivation, or parse-tree, allowed by the underlying CFG

Properties of PCFGs I Assigns a probability to each left-most derivation, or parse-tree, allowed by the underlying CFG I Say we have a sentence s, set of derivations for that sentence is T (s). Then a PCFG assigns a probability p(t) to each member of T (s). i.e.,we now have a ranking in order of probability.

Properties of PCFGs I Assigns a probability to each left-most derivation, or parse-tree, allowed by the underlying CFG I Say we have a sentence s, set of derivations for that sentence is T (s). Then a PCFG assigns a probability p(t) to each member of T (s). i.e.,we now have a ranking in order of probability. I The most likely parse tree for a sentence s is arg max t2t (s) p(t)

Data for Parsing Experiments: Treebanks I Penn WSJ Treebank = 50,000 sentences with associated trees I Usual set-up: 40,000 training sentences, 2400 test sentences An example tree: TOP S NP VP NNP NNPS VBD NP PP NP PP ADVP IN NP CD NN IN NP RB NP PP QP PRP$ JJ NN CC JJ NN NNS IN NP $ CD CD PUNC, NP SBAR NNP PUNC, WHADVP S WRB NP VP DT NN VBZ NP QP NNS PUNC. RB CD Canadian Utilities had 1988 revenue of C$ 1.16 billion, mainly from its natural gas and electric utility businessesin Alberta, where the company serves about 800,000 customers.

Deriving a PCFG from a Treebank I Given a set of example trees (a treebank), the underlying CFG can simply be all rules seen in the corpus I Maximum Likelihood estimates: q ML (! )= Count(! ) Count( ) where the counts are taken from a training set of example trees. I If the training data is generated by a PCFG, then as the training data size goes to infinity, the maximum-likelihood PCFG will converge to the same distribution as the true PCFG.

PCFGs Booth and Thompson (1973) showed that a CFG with rule probabilities correctly defines a distribution over the set of derivations provided that: 1. The rule probabilities define conditional distributions over the di erent ways of rewriting each non-terminal. 2. A technical condition on the rule probabilities ensuring that the probability of the derivation terminating in a finite number of steps is 1. (This condition is not really a practical concern.)

Parsing with a PCFG I Given a PCFG and a sentence s, define T (s) to be the set of trees with s as the yield. I Given a PCFG and a sentence s, how do we find arg max p(t) t2t (s)

Chomsky Normal Form A context free grammar G =(N,,R,S) in Chomsky Normal Form is as follows I N is a set of non-terminal symbols I is a set of terminal symbols I R is a set of rules which take one of two forms: I I X! Y1 Y 2 for X 2 N, andy 1,Y 2 2 N X! Y for X 2 N, andy 2 I S 2 N is a distinguished start symbol

ADynamicProgrammingAlgorithm I Given a PCFG and a sentence s, how do we find I Notation: max p(t) t2t (s) n = number of words in the sentence w i = i th word in the sentence N = the set of non-terminals in the grammar S = the start symbol in the grammar I Define a dynamic programming table [i, j, X] = maximum probability of a constituent with non-terminal X spanning words i...j inclusive I Our goal is to calculate max t2t (s) p(t) = [1, n, S]

An Example the dog saw the man with the telescope

ADynamicProgrammingAlgorithm I Base case definition: for all i =1...n, for X 2 N [i, i, X] =q(x! w i ) (note: define q(x! w i )=0if X! w i is not in the grammar) I Recursive definition: for all i =1...n, j =(i +1)...n, X 2 N, (i, j, X) = max (q(x! YZ) (i, s, Y ) (s +1,j,Z)) X!Y Z2R, s2{i...(j 1)}

An Example (i, j, X) = max (q(x! YZ) (i, s, Y ) (s +1,j,Z)) X!Y Z2R, s2{i...(j 1)} the dog saw the man with the telescope

The Full Dynamic Programming Algorithm Input: asentences = x 1...x n,apcfgg =(N,,S,R,q). Initialization: For all i 2 {1...n}, forallx 2 N, q(x! xi ) (i, i, X) = Algorithm: I For l =1...(n 1) I For i =1...(n l) I I Set j = i + l For all X 2 N, calculate (i, j, X) = and bp(i, j, X) = arg if X! x i 2 R 0 otherwise max (q(x! YZ) (i, s, Y ) (s +1,j,Z)) X!Y Z2R, s2{i...(j 1)} max (q(x! YZ) (i, s, Y ) (s +1,j,Z)) X!Y Z2R, s2{i...(j 1)}

ADynamicProgrammingAlgorithmfortheSum I Given a PCFG and a sentence s, how do we find X p(t) t2t (s) I Notation: n = number of words in the sentence w i = i th word in the sentence N = the set of non-terminals in the grammar S = the start symbol in the grammar I Define a dynamic programming table [i, j, X] = sum of probabilities for constituent with non-terminal X spanning words i...j inclusive I Our goal is to calculate P p(t) = [1, n, S]

Summary I PCFGs augments CFGs by including a probability for each rule in the grammar. I The probability for a parse tree is the product of probabilities for the rules in the tree I To build a PCFG-parsed parser: 1. Learn a PCFG from a treebank 2. Given a test data sentence, use the CKY algorithm to compute the highest probability tree for the sentence under the PCFG

Dependency Parsing

Unlabeled Dependency Parses root John saw a movie I root is a special root symbol I Each dependency is a pair (h, m) where h is the index of a head word, m is the index of a modifier word. In the figures, we represent a dependency (h, m) by a directed edge from h to m. I Dependencies in the above example are (0, 2), (2, 1), (2, 4), and (4, 3). (Wetake0 to be the root symbol.)

All Dependency Parses for John saw Mary root John saw Mary root John saw Mary root John saw Mary root John saw Mary root John saw Mary

A More Complex Example root John saw a movie that he liked today

Conditions on Dependency Structures root John saw a movie that he liked today I The dependency arcs form a directed tree, with the root symbol at the root of the tree. (Definition: A directed tree rooted at root is a tree, where for every word w other than the root, there is a directed path from root to w.) I There are no crossing dependencies. Dependency structures with no crossing dependencies are sometimes referred to as projective structures.

Dependency Parsing Resources I CoNLL 2006 conference had a shared task with dependency parsing of 12 languages (Arabic, Chinese, Czech, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish, Turkish). 19 di erent groups developed dependency parsing systems. (See also CoNLL 2007). I PhD thesis on the topic: Ryan McDonald, Discriminative Training and Spanning Tree Algorithms for Dependency Parsing, University of Pennsylvania. I For some languages, e.g., Czech, there are dependency banks available which contain training data in the form of sentences paired with dependency structures I For other languages, we can extract dependency structures from treebanks

S(told,V) NP(Hillary,NNP) VP(told,VBD) NNP Hillary V(told,VBD) NP(Clinton,NNP) SBAR(that,COMP) VBD NNP told Clinton COMP S that NP(she,PRP) VP(was,Vt) PRP she Vt was NP(president,NN) NN Unlabeled Dependencies: (0,2) (for root! told) (2,1) (for told! Hillary) (2,3) (for told! Clinton) (2,4) (for told! that) (4,6) (for that! was) (6,5) (for was! she) (6,7) (for was! president) president

E ciency of Dependency Parsing I PCFG parsing is O(n 3 G 3 ) where n is the length of the sentence, G is the number of non-terminals in the grammar I Lexicalized PCFG parsing is O(n 5 G 3 ) where n is the length of the sentence, G is the number of non-terminals in the grammar. I Unlabeled dependency parsing is O(n 3 ).

GLMs for Dependency parsing I x is a sentence I GEN(x) is set of all dependency structures for x I f(x, y) is a feature vector for a sentence x paired with a dependency parse y

GLMs for Dependency parsing I To run the perceptron algorithm, we must be able to e ciently calculate arg I Local feature vectors: define f(x, y) = X max w f(x, y) y2gen(x) (h,m)2y g(x, h, m) where g(x, h, m) maps a sentence x and a dependency (h, m) to a local feature vector I Can then use dynamic programming to calculate X arg max w f(x, y) =arg max w g(x, h, m) y2gen(x) y2gen(x) (h,m)2y

Definition of Local Feature Vectors I Features from McDonald et al. (2005): I Note: define wi to be the i th word in the sentence, t i to be the part-of-speech (POS) tag for the i th word. I Unigram features: Identity of wh. Identity of w m. Identity of t h. Identity of t m. I Bigram features: Identity of the 4-tuple hwh,w m,t h,t m i. Identity of sub-sets of this 4-tuple, e.g., identity of the pair hw h,w m i. I Contextual features: Identity of the 4-tuple ht h,t h+1,t m 1,t m i. Similar features which consider t h 1 and t m+1, giving 4 possible feature types. I In-between features: Identity of triples hth,t,t m i for any tag t seen between words h and m.

Results from McDonald (2005) Method Accuracy Collins (1997) 91.4% 1st order dependency 90.7% 2nd order dependency 91.5% I Accuracy is percentage of correct unlabeled dependencies I Collins (1997) is result from a lexicalized context-free parser, with dependencies extracted from the parser s output I 1st order dependency is the method just described. 2nd order dependency is a model that uses richer representations. I Advantages of the dependency parsing approaches: simplicity, e ciency (O(n 3 ) parsing time).