Multiword Expression Identification with Tree Substitution Grammars

Similar documents
The Infinite PCFG using Hierarchical Dirichlet Processes

LECTURER: BURCU CAN Spring

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Latent Variable Models in NLP

Statistical Methods for NLP

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Advanced Natural Language Processing Syntactic Parsing

Spectral Unsupervised Parsing with Additive Tree Metrics

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Processing/Speech, NLP and the Web

Probabilistic Context-free Grammars

Natural Language Processing

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars. Many slides from Michael Collins

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Parsing with Context-Free Grammars

Marrying Dynamic Programming with Recurrent Neural Networks

Probabilistic Context-Free Grammar

Lecture 13: Structured Prediction

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Type-Based MCMC. Michael I. Jordan UC Berkeley 1 In NLP, this is sometimes referred to as simply the collapsed

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Chapter 14 (Partially) Unsupervised Parsing

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Lecture 5: UDOP, Dependency Grammars

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

NLP Programming Tutorial 11 - The Structured Perceptron

Features of Statistical Parsers

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

Lab 12: Structured Prediction

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Sequence Labeling: HMMs & Structured Perceptron

Decoding and Inference with Syntactic Translation Models

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

CS460/626 : Natural Language

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Smoothing for Bracketing Induction

Statistical Methods for NLP

Spatial Role Labeling CS365 Course Project

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

A* Search. 1 Dijkstra Shortest Path

Parsing with Context-Free Grammars

Alessandro Mazzei MASTER DI SCIENZE COGNITIVE GENOVA 2005

DT2118 Speech and Speaker Recognition

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

Random Generation of Nondeterministic Tree Automata

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

10/17/04. Today s Main Points

National Centre for Language Technology School of Computing Dublin City University

Text Mining. March 3, March 3, / 49

CS 545 Lecture XVI: Parsing

Lecture 9: Hidden Markov Model

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

TnT Part of Speech Tagger

Lecture 15. Probabilistic Models on Graph

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

The effect of non-tightness on Bayesian estimation of PCFGs

CS460/626 : Natural Language

Hierarchical Bayesian Nonparametrics

Collapsed Variational Bayesian Inference for Hidden Markov Models

Algorithms for Syntax-Aware Statistical Machine Translation

Learning to translate with neural networks. Michael Auli

Driving Semantic Parsing from the World s Response

Multilevel Coarse-to-Fine PCFG Parsing

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

The Infinite PCFG using Hierarchical Dirichlet Processes

Midterm sample questions

A Context-Free Grammar

Sharpening the empirical claims of generative syntax through formalization

Effectiveness of complex index terms in information retrieval

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Transition-Based Parsing

Computational Linguistics

Tuning as Linear Regression

Introduction to Semantic Parsing with CCG

Parsing Beyond Context-Free Grammars: Tree Adjoining Grammars

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Variational Decoding for Statistical Machine Translation

Introduction to Probablistic Natural Language Processing

Statistical methods in NLP, lecture 7 Tagging and parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Aspects of Tree-Based Statistical Machine Translation

Bayesian Inference for PCFGs via Markov chain Monte Carlo

The Infinite PCFG using Hierarchical Dirichlet Processes

CS : Speech, NLP and the Web/Topics in AI

CS 6120/CS4120: Natural Language Processing

Transcription:

Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011

Main Idea Use syntactic context to find multiword expressions

Main Idea Use syntactic context to find multiword expressions Syntactic context constituency parses

Main Idea Use syntactic context to find multiword expressions Syntactic context constituency parses Multiword expressions idiomatic constructions

Which languages? Results and analysis for French 3 / 42

Which languages? Results and analysis for French Lexicographic tradition of compiling MWE lists Annotated data! 3 / 42

Which languages? Results and analysis for French Lexicographic tradition of compiling MWE lists Annotated data! English examples in the talk 3 / 42

Motivating Example: Humans get this 1. He kicked the pail. 2. He kicked the bucket. He died. (Katz and Postal 1963) 4 / 42

Stanford parser can t tell the difference S NP He kicked VP NP the pail 5 / 42

Stanford parser can t tell the difference S S NP VP NP VP He kicked NP the pail He kicked NP the bucket 5 / 42

What does the lexicon contain? Single-word entries? kick : <agent, theme> die : <theme> NP S VP Multi-word entries? kick the bucket : <theme> He kicked NP the bucket 6 / 42

Lexicon-Grammar: He kicked the bucket S NP VP He died 7 / 42

Lexicon-Grammar: He kicked the bucket S S NP VP NP VP MWV He He died kicked the bucket (Gross 1986) 7 / 42

MWEs in Lexicon-Grammar Classified by global POS MWV Described by internal POS sequence VBD DT NN Flat structures! kicked the bucket 8 / 42

MWEs in Lexicon-Grammar Classified by global POS MWV Described by internal POS sequence VBD DT NN Flat structures! kicked the bucket Of theoretical interest but... 8 / 42

Why do we care (in NLP)? MWE knowledge improves: Dependency parsing (Nivre and Nilsson 2004) Constituency parsing (Arun and Keller 2005) Sentence generation (Hogan et al. 2007) Machine translation (Carpuat and Diab 2010) Shallow parsing (Korkontzelos and Manandhar 2010) 9 / 42

Why do we care (in NLP)? MWE knowledge improves: Dependency parsing (Nivre and Nilsson 2004) Constituency parsing (Arun and Keller 2005) Sentence generation (Hogan et al. 2007) Machine translation (Carpuat and Diab 2010) Shallow parsing (Korkontzelos and Manandhar 2010) Most experiments assume high accuracy identification! 9 / 42

French and the French Treebank MWEs common in French 5,000 multiword adverbs 10 / 42

French and the French Treebank MWEs common in French 5,000 multiword adverbs MWC Paris 7 French Treebank 16,000 trees 13% of tokens are MWE P sous N prétexte C que on the grounds that 10 / 42

French Treebank: MWE types Global POS I ET CL PRO ADV D V C P ADV N Lots of nominal compounds e.g. N N numéro deux 0 10 20 30 40 50 %Total MWEs 11 / 42

MWE Identification Evaluation Identification is a by-product of parsing 12 / 42

MWE Identification Evaluation Identification is a by-product of parsing Corpus: Paris 7 French Treebank (FTB) Split: same as (Crabbé and Candito 2008) Metrics: Precision and Recall Lengths 40 words 12 / 42

MWE Identification: Parent-Annotated PCFG 60 40 32.6 F1 20 0 PA-PCFG 13 / 42

MWE Identification: n-gram methods 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit 14 / 42

MWE Identification: n-gram methods 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Standard approach in 2008 MWE Shared Task, MWE Workshops, etc. 14 / 42

n-gram methods: mwetoolkit Based on surface statistics 15 / 42

n-gram methods: mwetoolkit Based on surface statistics Step 1: Lemmatize and POS tag corpus 15 / 42

n-gram methods: mwetoolkit Based on surface statistics Step 1: Lemmatize and POS tag corpus Step 2: Compute n-gram statistics: Maximum likelihood estimator Dice s coefficient Pointwise mutual information Student s t-score (Ramisch, Villavicencio, and Boitet 2010) 15 / 42

n-gram methods: mwetoolkit Step 3: Create n-gram feature vectors 16 / 42

n-gram methods: mwetoolkit Step 3: Create n-gram feature vectors Step 4: Train a binary classifier 16 / 42

n-gram methods: mwetoolkit Step 3: Create n-gram feature vectors Step 4: Train a binary classifier Exploits statistical idiomaticity of MWEs 16 / 42

Is statistical idiomaticity sufficient? French multiword verbs VN Tree maintains relationship between MWV parts MWV va MWADV d ailleurs MWV bon train is also well underway 17 / 42

Recap: French MWE Identification Baselines 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit 18 / 42

Recap: French MWE Identification Baselines 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Let s build a better grammar 18 / 42

Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) 19 / 42

Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) Has a verbal nucleus (VN) 19 / 42

Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) Has a verbal nucleus (VN) C Ou ADV bien COORD VN doit -il Otherwise he must... 19 / 42

Better PCFGs: Manual grammar splits Symbol refinement à la (Klein and Manning 2003) Has a verbal nucleus (VN) C Ou COORD-hasVN ADV bien VN doit -il Otherwise he must... 20 / 42

French MWE Identification: Manual Splits 80 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits 21 / 42

French MWE Identification: Manual Splits 80 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits MWE features: high frequency POS sequences 21 / 42

Capture more syntactic context? PCFGs work well! 22 / 42

Capture more syntactic context? PCFGs work well! Larger rules : Tree Substitution Grammars (TSG) 22 / 42

Capture more syntactic context? PCFGs work well! Larger rules : Tree Substitution Grammars (TSG) Relationship with Data-Oriented Parsing (DOP): Same grammar formalism (TSG) We include unlexicalized fragments Different parameter estimation 22 / 42

Which tree fragments do we select? S NP VP N MWV He V D N kicked the bucket 23 / 42

Which tree fragments do we select? S NP VP N MWV He V D N kicked the bucket 24 / 42

Which tree fragments do we select? NP V MWV S N kicked V D N NP VP He the bucket MWV 25 / 42

TSG Grammar Extraction as Tree Selection MWV V D the N bucket 26 / 42

TSG Grammar Extraction as Tree Selection MWV V D the N bucket Describes MWE context Allows for inflection: kick, kicked, kicking 26 / 42

Dirichlet process TSG (DP-TSG) Tree selection as non-parametric clustering 1 1 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 27 / 42

Dirichlet process TSG (DP-TSG) Tree selection as non-parametric clustering 1 Labeled Chinese Restaurant process Dirichlet process (DP) prior for each non-terminal type c 1 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 27 / 42

Dirichlet process TSG (DP-TSG) Tree selection as non-parametric clustering 1 Labeled Chinese Restaurant process Dirichlet process (DP) prior for each non-terminal type c Supervised case: segment the treebank 1 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 27 / 42

DP-TSG: Learning and Inference DP base distribution from manually-split CFG 28 / 42

DP-TSG: Learning and Inference DP base distribution from manually-split CFG Type-based Gibbs sampler (Liang, Jordan, and Klein 2010) Fast convergence: 400 iterations 28 / 42

DP-TSG: Learning and Inference DP base distribution from manually-split CFG Type-based Gibbs sampler (Liang, Jordan, and Klein 2010) Fast convergence: 400 iterations Derivations of a TSG are a CFG forest 28 / 42

DP-TSG: Learning and Inference DP base distribution from manually-split CFG Type-based Gibbs sampler (Liang, Jordan, and Klein 2010) Fast convergence: 400 iterations Derivations of a TSG are a CFG forest SCFG decoder: cdec (Dyer et al. 2010) 28 / 42

French MWE Identification: DP-TSG 80 71.1 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits DP-TSG 29 / 42

French MWE Identification: DP-TSG 80 71.1 63.1 60 40 32.6 34.7 F1 20 0 PA-PCFG mwetoolkit Splits DP-TSG DP-TSG result is a lower bound 29 / 42

Human-interpretable DP-TSG rules MWN coup de N coup de pied coup de coeur coup de foudre coup de main coup de grâce kick favorite love at first sight help death blow 30 / 42

Human-interpretable DP-TSG rules MWN coup de N coup de pied coup de coeur coup de foudre coup de main coup de grâce kick favorite love at first sight help death blow n-gram methods: separate feature vectors 30 / 42

DP-TSG errors: Overgeneration NP NP D Le N marché AP A national The national march Reference D Le MWN N A marché national DP-TSG 31 / 42

DP-TSG errors: Overgeneration NP NP D Le N marché AP A national The national march Reference D Le MWN N A marché national DP-TSG MWEs are subtle; reference sometimes inconsistent 31 / 42

Standard Parsing Evaluation Same setup as MWE identification! 32 / 42

Standard Parsing Evaluation Same setup as MWE identification! Corpus: Paris 7 French Treebank (FTB) Split: same as (Crabbé and Candito 2008) Metrics: Evalb and Leaf Ancestor Lengths 40 words 32 / 42

French Parsing Evaluation: All bracketings 90 Evalb F1 80 70 67.6 75.2 75.8 60 PA-PCFG Splits DP-TSG 33 / 42

French Parsing Evaluation: All bracketings 90 Evalb F1 80 70 67.6 75.2 75.8 60 PA-PCFG Splits DP-TSG Paper: more results (Stanford, Berkeley, etc.) 33 / 42

Future Directions Syntactic context for n-gram methods Parse the corpus! Adapt lexical context measures to syntactic context 34 / 42

Future Directions Syntactic context for n-gram methods Parse the corpus! Adapt lexical context measures to syntactic context DP-TSG Better base distribution 34 / 42

Conclusion Parsers work well for MWE identification 35 / 42

Conclusion Parsers work well for MWE identification Other languages: combine treebanks with MWE lists 35 / 42

Conclusion Parsers work well for MWE identification Other languages: combine treebanks with MWE lists Non- gold mode parsing results for French 35 / 42

Conclusion Parsers work well for MWE identification Other languages: combine treebanks with MWE lists Non- gold mode parsing results for French Code Google: Stanford parser 35 / 42

un grand merci. thanks a lot.

Questions?

MWE Identification Results 80 60 40 32.6 34.7 63.1 69.6 70.1 71.1 F1 20 0 PA-PCFG mwetoolkit Splits Berkeley Stanford DP-TSG 38 / 42

Dirichlet process TSG DP prior for each non-terminal type c V: θ c c, α c, P 0 ( c) DP(α c, P 0 ) e θ c θ c 2 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 39 / 42

Dirichlet process TSG DP prior for each non-terminal type c V: θ c c, α c, P 0 ( c) DP(α c, P 0 ) e θ c θ c Binary variable b s for each non-terminal node in corpus Supervised case: segment the treebank 2 2 Cohn, Goldwater, and Blunsom 2009; Post and Gildea 2009; O Donnell, Tenenbaum, and Goodman 2009. 39 / 42

DP-TSG: Base distribution P 0 Phrasal rules: P 0 (A + B C + ) = p MLE (A B C) s B (1 s C ) 40 / 42

DP-TSG: Base distribution P 0 Phrasal rules: P 0 (A + B C + ) = p MLE (A B C) s B (1 s C ) p MLE is the manually-split grammar! s B is the stop probability 40 / 42

DP-TSG: Base distribution P 0 Lexical insertion rules: P 0 (C + t) = p MLE (C t) p(t) 41 / 42

DP-TSG: Base distribution P 0 Lexical insertion rules: P 0 (C + t) = p MLE (C t) p(t) p(t) is unigram probability of word t 41 / 42

Tree substitution grammars A Probabilistic TSG is a 5-tuple V, Σ, R,, θ c V are non-terminals V is a unique start symbol t Σ are terminals e R are elementary trees θ c,e θ are parameters for each tree fragment 42 / 42

Tree substitution grammars A Probabilistic TSG is a 5-tuple V, Σ, R,, θ c V are non-terminals V is a unique start symbol t Σ are terminals e R are elementary trees θ c,e θ are parameters for each tree fragment elementary tree == tree fragment 42 / 42