Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Similar documents
A Context-Free Grammar

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context-free Grammars

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Features of Statistical Parsers

Parsing with Context-Free Grammars

LECTURER: BURCU CAN Spring

Driving Semantic Parsing from the World s Response

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Statistical Methods for NLP

Natural Language Processing

Processing/Speech, NLP and the Web

10/17/04. Today s Main Points

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Natural Language Processing

The Rise of Statistical Parsing

CS460/626 : Natural Language

The Infinite PCFG using Hierarchical Dirichlet Processes

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Multilevel Coarse-to-Fine PCFG Parsing

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Parsing with Context-Free Grammars

CS 545 Lecture XVI: Parsing

Log-Linear Models, MEMMs, and CRFs

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Spectral Unsupervised Parsing with Additive Tree Metrics

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Multiword Expression Identification with Tree Substitution Grammars

Lecture 5: UDOP, Dependency Grammars

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Advanced Natural Language Processing Syntactic Parsing

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Lecture 15. Probabilistic Models on Graph

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Midterm sample questions

Effectiveness of complex index terms in information retrieval

A Support Vector Method for Multivariate Performance Measures

TnT Part of Speech Tagger

DT2118 Speech and Speaker Recognition

Statistical Methods for NLP

Semantic Role Labeling and Constrained Conditional Models

The relation of surprisal and human processing

Latent Variable Models in NLP

Chapter 14 (Partially) Unsupervised Parsing

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

Computational Linguistics

Bringing machine learning & compositional semantics together: central concepts

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Maxent Models and Discriminative Estimation

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Maximum Entropy Models for Natural Language Processing

Sparse Forward-Backward for Fast Training of Conditional Random Fields

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Dependency grammar. Recurrent neural networks. Transition-based neural parsing. Word representations. Informs Models

Lecture 13: Structured Prediction

Personal Project: Shift-Reduce Dependency Parsing

Transition-Based Parsing

Extraction of Opposite Sentiments in Classified Free Format Text Reviews

A* Search. 1 Dijkstra Shortest Path

CS : Speech, NLP and the Web/Topics in AI

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Spatial Role Labeling CS365 Course Project

Text Mining. March 3, March 3, / 49

Constituency Parsing

Transformational Priors Over Grammars

SYNTHER A NEW M-GRAM POS TAGGER

Lab 12: Structured Prediction

Log-Linear Models with Structured Outputs

Semantics and Pragmatics of NLP Pronouns

Dynamic Programming: Hidden Markov Models

Algorithms for Syntax-Aware Statistical Machine Translation

Hidden Markov Models

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

X-bar theory. X-bar :

Chunking with Support Vector Machines

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Structured Prediction Models via the Matrix-Tree Theorem

Transition-based Dependency Parsing with Selectional Branching

A Deterministic Word Dependency Analyzer Enhanced With Preference Learning

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

The Noisy Channel Model and Markov Models

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Prenominal Modifier Ordering via MSA. Alignment

Transcription:

Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1

The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument information and traces Created in the early 90s Produced by automatically parsing the newspaper sentences followed by manual correction Took around 3 years to create Sparked a parsing competition which is still running today leading some commentators to describe the last 10 years of NLP as the study of the WSJ 2

An Example Penn Treebank Tree S NP It V ADJP is ADJ S difficult NP 0 TO to V SBAR understand WHNP_2 S what NP_1 I V S want NP 0_1 TO to V NP do 0_2 3

A Tree a typical PTB Parser would produce S NP It V ADJP is ADJ S difficult TO to V SBAR understand WHNP S what NP I V S want Taken from Dienes(2004) TO to do 4

Characterisation of Statistical Parsing 1 What is the grammar which determines the set of legal syntactic structures for a sentence? How is that grammar obtained? What is the algorithm for determining the set of legal parses for a sentence (given a grammar)? What is the model for determining the plausibility of different parses for a sentence? What is the algorithm, given the model and a set of possible parses, which finds the best parse? 5

Characterisation of Statistical Parsing 2 Just two components: T best = arg max T Score(T, S) the model: a function Score which assigns scores (probabilities) to tree, sentence pairs the parser: the algorithm which implements the search for T best Statistical parsing seen as more of a pattern recognition/machine Learning problem plus search the grammar is only implicitly defined by the training data and the method used by the parser for generating hypotheses 6

Statistical Parsing Models Probabilistic approach would suggest the following for the Score function: Score(T, S) = P (T S) Lots of research on different probability models for Penn Treebank trees generative models, log-linear (maximum entropy) models,... We will consider generative models and log-linear models 7

Generative Models arg max T P (T S) = arg max T = arg max T P (T, S) P (S) P (T, S) Why model the joint probability when the sentence is given? Modelling a parse as a generative process allows the parse to be broken into manageable parts, for which the corresponding probabilities can be reliably estimated Probability estimation is easy for these sorts of models (ignoring smoothing issues) maximum likelihood estimation = relative frequency estimation But choosing how to break up the parse is something of a black art 8

PCFGs Probabilistic Context Free Grammars provide a ready-made solution to the statistical parsing problem However, it is important to realise that parameters do not have to be associated with the rules of a context free grammar we can choose to break up the tree in any way we like But extracting a PCFG from the Penn Treebank and parsing with it provides a useful baseline a PCFG parser obtains roughly 70-75% Parseval scores 9

Parameterisation of a Parse Tree Collins describes the following two criteria for a good parameterisation: Discriminative power: the parameters should include the contextual information required for the disambiguation process (PCFGs fail in this regard) Compactness: the model should have as few parameters as possible (while still retaining adequate discriminative power) Collins thesis (p.11) has an instructive belief networks example, which shows that choice of variable ordering is crucial 10

Generative Models for Parse Trees Representation the set of part-of-speech tags whether to pass lexical heads up the tree (lexicalisation) whether to replace words with their morphological stems Decomposition the order in which to generate the tree the order of decisions, d i, made in generating the tree these decisions do not have to correspond to parsing decisions Independence assumptions group decision sequences into equivalence classes, Φ P (T, S) = n P (d i Φ(d 1... d i 1 )) i=1 11

Successive Parameterisations Simple PCFG PCFG + dependencies Dependencies + direction Dependencies + direction + relations Dependencies + direction + relations + subcategorisation Dependencies + direction + relations + subcategorisation + distance Dependencies + direction + relations + subcategorisation + distance + parts-of-speech 12

The Basic Generative Model Each rule in a PCFG has the following form: P (h) L n (l n )... L 1 (l 1 )H(h)R 1 (r 1 )... R m (r m ) P is the parent; H is the head-child; L i and R i are left and right modifiers (n or m may be zero) The probability of a rule can be written (exactly) using the chain rule: p(l n (l n )... L 1 (l 1 )H(h)R 1 (r 1 )... R m (r m ) P (h)) = p(h P (h)) n p(l i (l i ) L 1 (l 1 )... L i 1 (l i 1 ), P (h), H) i=1 m p(r j (r j ) L 1 (l 1 )... L n (l n ), R 1 (r 1 )... R n (r n ), P (h), H) j=1 13

Independence Assumptions For Model 1, assume the modifiers are generated independently of each other: p l (L i (l i ) L 1 (l 1 )... L i 1 (l i 1 ), P (h), H) = p l (L i (l i ) P (h), H) p r (R j (r j ) L 1 (l 1 )... L n (l n ), R 1 (r 1 )... R n (r n ), P (h), H) = p r (R j (r j ) P (h), H) Example rule: S(bought) NP(week) NP(IBM) (bought) p h ( S,bought) p l (NP(IBM) S,,bought) p l (NP(week) S,,bought) p l (STOP S,,bought) p r (STOP S,,bought) 14

Subcategorisation Parameters A better model would distinguish optional arguments (adjuncts) from required arguments (complements) In Last week IBM bought Lotus, Last week is an optional argument Here the verb subcategorises for an NP subject to the left and an NP object to the right subjects are often omitted from subcat frames for English (because every verb has a subject in English) but we ll keep them in the model 15

Subcategorisation Parameters Probability of the rule S(bought) NP(week) NP-C(IBM) (bought): p h ( S,bought) p lc ({NP-C} S,,bought) p rc ({} S,,bought) p l (NP-C(IBM) S,,bought,{NP-C}) p l (NP(week) S,,bought,{}) p l (STOP S,,bought,{}) p r (STOP S,,bought,{}) 16

Results Model 1 achieves 87.5/87.7 LP/LR on WSJ section 23 according to the Parseval measures Model 2 achieves 88.1/88.3 LP/LR Current best scores on this task are around 91 17

Next lecture Estimating the parameters Finding T best (parsing) 18

References Michael Collins (1999), Head-Driven Statistical Models for Natural Language Parsing 19