Prepositional Phrase Attachment over Word Embedding Products

Similar documents
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Bayesian Paragraph Vectors

Towards Universal Sentence Embeddings

word2vec Parameter Learning Explained

arxiv: v2 [cs.cl] 1 Jan 2019

Deep Learning for NLP Part 2

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Natural Language Processing

Integrating Order Information and Event Relation for Script Event Prediction

From perceptrons to word embeddings. Simon Šuster University of Groningen

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Data Mining & Machine Learning

Structured Prediction Models via the Matrix-Tree Theorem

text classification 3: neural networks

Lecture 7: Word Embeddings

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Spatial Role Labeling CS365 Course Project

Distributional Semantics and Word Embeddings. Chase Geigle

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

arxiv: v3 [cs.cl] 30 Jan 2016

DISTRIBUTIONAL SEMANTICS

A Context-Free Grammar

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Multi-theme Sentiment Analysis using Quantified Contextual

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

An overview of word2vec

RaRE: Social Rank Regulated Large-scale Network Embedding

A Deep Interpretation of Classifier Chains

Logistic Regression. COMP 527 Danushka Bollegala

Learning to translate with neural networks. Michael Auli

Natural Language Processing

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Naïve Bayes Classifiers

Probabilistic Graphical Models

Natural Language Processing

Lab 12: Structured Prediction

STA141C: Big Data & High Performance Statistical Computing

A Support Vector Method for Multivariate Performance Measures

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 5 Neural models for NLP

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Spectral Learning for Non-Deterministic Dependency Parsing

STA141C: Big Data & High Performance Statistical Computing

Neural Word Embeddings from Scratch

Natural Language Processing

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis. July 31, 2014

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

ECS171: Machine Learning

Natural Language Processing with Deep Learning CS224N/Ling284

Deep Learning. Ali Ghodsi. University of Waterloo

ECE521 Lectures 9 Fully Connected Neural Networks

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning for NLP

Marrying Dynamic Programming with Recurrent Neural Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

A Convolutional Neural Network-based

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 6: Neural Networks for Representing Word Meaning

Latent Variable Models in NLP

Neural Networks: Backpropagation

CSC321 Lecture 7 Neural language models

Sequential Supervised Learning

Natural Language Processing SoSe Words and Language Model

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Word Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

STA141C: Big Data & High Performance Statistical Computing

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Machine Learning for NLP

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Probabilistic Context-free Grammars

Learning Task Grouping and Overlap in Multi-Task Learning

Deep Learning Recurrent Networks 2/28/2018

Multiple Aspect Ranking Using the Good Grief Algorithm. Benjamin Snyder and Regina Barzilay MIT

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Mixture of Gaussians Models

Machine Learning Basics

Modeling Topics and Knowledge Bases with Embeddings

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Cutting Plane Training of Structural SVM

ECS289: Scalable Machine Learning

Seman&cs with Dense Vectors. Dorota Glowacka

Midterm sample questions

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

Expectation Maximization (EM)

Linear classifiers: Logistic regression

Transcription:

Prepositional Phrase Attachment over Word Embedding Products Pranava Madhyastha (1), Xavier Carreras (2), Ariadna Quattoni (2) (1) University of Sheffield (2) Naver Labs Europe

Prepositional Phrases I went to the restaurant by bike v/s I went to the restaurant by the bridge 2

Prepositional Phrases prep? I went to the restaurant by bike v/s prep? I went to the restaurant by the bridge 2

Prepositional Phrases prep? prep? prep? I went to the restaurant by bike v/s prep? I went to the restaurant by the bridge 2

Prepositional Phrases prep? prep? prep? I went to the restaurant by bike v/s prep? I went to the restaurant by the bridge s/bridge/arno/? 2

Local PP Attachment Decisions Ratnaparkhi; 1998 approached PP attachment as a local classification decision Lose potential advantage for inference over the whole structure 3

Local PP Attachment Decisions Ratnaparkhi; 1998 approached PP attachment as a local classification decision Lose potential advantage for inference over the whole structure Two settings: Binary attachment problem[ratnaparkhi; 1998] Multi-class attachment problem[belinkov+; 2015] 3

Word Embeddings Previous approaches: features in a binary decision setting Word Embeddings are rich source of information How far can we go with only word embeddings? 4

Resolving Ternary Relations using Word Embeddings φ(prep) φ(mod) φ(head) The problem reduces to finding compatibility between: φ(head), φ(prep) and φ(mod) One simple approach is to do: f (h, p, m) = w [v h v p v m ] Exploit exploit each dimension alone - with a simple trick: append 1 5

Our Proposal: Unfolding the Tensors Unfolding W h p m This results in: f (h, p, m) = v h W [v p v m ] We can further explore compositionality as: f (h, p, m) = α(h) W β(p, m) 6

Training the Model We use logistic loss Std. conditional Max. Likelihood optimization arg min W logistic(tset, W ) + λ W (1) As a structural constraint, we use rank-constrained regularization 7

Empirical Overview Exploring Word Embeddings Exploring Compositionality Binary Attachment Datasets Multiple Attachment Datasets Semi-supervised multiple attachment experiments with PTB and WTB 8

Exploring Word Embeddings Word2Vec based skip-gram embeddings ([Mikolov+; 2013]) Skip-dep ([Bansal+; 2014]): uses word2vec with dependency contexts Exploration over different dimensions and data. 9

Exploring Word Embeddings Word Embedding Accuracy wrt. dimension (n) Type Source Data n = 50 n = 100 n = 300 Skip-gram BLLIP 83.23 83.77 83.84 Skip-gram Wikipedia 83.74 84.25 84.22 Skip-gram NYT 84.76 85.06 85.15 Skip-dep BLLIP 85.52 86.33 85.97 Skip-dep Wikipedia 84.23 84.39 84.32 Skip-dep NYT 85.27 85.48 Skip-gram & Skip-dep BLLIP 83.44 Attachment accuracy over RRR devset 10

Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : 11

Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) 11

Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) Concatenation : f (h, p, m) = α(h) W (w p ; w m ) 11

Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) Concatenation : f (h, p, m) = α(h) W (w p ; w m ) Products : f (h, p, m) = α(h) W (w p w m ) 11

Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) Concatenation : f (h, p, m) = α(h) W (w p ; w m ) Products : f (h, p, m) = α(h) W (w p w m ) Prep. Identities : f (h, p, m) = α(h) W p w m 11

Exploring Compositionality Composition of p and m Tensor Size Acc. Sum [v p + v m ] n n 84.42 Concatenation [v p ; v m ] n 2n 84.94 p Identities [i p v m ] n P n 84.36 Product [v p v m ] n n n 85.52 12

Empirical Results over Binary Attachment Datasets Test Accuracy Method Word Embedding RRR WIKI NYT Tensor product Skip-gram, Wikipedia, n = 100 84.96 83.48 82.13 '' Skip-gram, NYT, n = 100 85.11 83.52 82.65 '' Skip-dep, BLLIP, n = 100 86.13 83.60 82.30 '' Skip-dep, Wikipedia, n = 100 85.01 83.53 82.10 '' Skip-dep, NYT, n = 100 85.49 83.64 83.47 [Stetina+; 1997(*)] 88.1 - - [Collins+; 1999] 84.1 72.7 80.9 [Belinkov+; 2014(*)] 85.6 - - [Nakashole+; 2015(*)] 84.3 79.3 84.3 Scores refer to Accuracy measures 13

Multiple Heads and Positional Information f (h, p, m) = α(h) W β(p, m) In this case, we can modify α(h) to include the positional info: α(h) = δ h v h, where δ h R H is the position vector This can also be written as: f (h, p, m) = v h W δ h [v p v m ]. 14

Experiments with Multiple Attachment Setting Test Accuracy Arabic English Tensor product (n =50, l 2 ) - 87.8 Tensor product (n =50, l ) - 88.3 Tensor product (n =100, l ) 81.1 88.4 Belinkov+; 2014 (basic) 77.1 85.4 Belinkov+; 2014 (syn) 79.1 87.1 Belinkov+; 2014 (feat) 80.4 87.7 Belinkov+; 2014 (full) 82.6 88.7 Yu+; 2016 (full) - 90.3 15

Semi-supervised Experiments Embeddings trained on source and target raw-data Model trained on the source data Compare against Stanford neural network based parser and Turbo parser 16

Semi-supervised Experiments with PTB vs WTB PTB Web Treebank Test Test A E N R W Avg (2523) (868) (936) (839) (902) (788) (4333) Tensor BLLIP 89.0 82.7 82.6 87.4 82.5 86.3 84.2 Tensor BLLIP+WTB 88.9 83.3 85.2 90.1 85.9 86.6 86.1 Chen+; 2014 87.3 79.3 79.7 85.7 82.2 83.8 82.0 Martins+; 2010 (2 nd order) 88.8 83.6 83.7 87.6 84.2 87.8 85.3 Martins+; 2010 (3 rd order) 88.9 84.2 84.5 87.6 84.4 87.6 85.6 Scores refer to Accuracy measures 17

What Seems to Work Turbo 3 rd Tensor Stanford PP Came the disintegration of the Beatles minds with LSD... 18

Where it has a Problem... the return address for the letters to the Senators... 19

Conclusions We described a PP attachment model using simple tensor products Product of embeddings seem to be better at capturing compositions In out-of-domain datasets, we see the models shows a big promise Proposed models can serve as a building block to dependency parsing methods 20

Thank You 21

Learning with Structural Constraints Controlling the learned parameter metrix with regularization 22

Learning with Structural Constraints Controlling the learned parameter metrix with regularization We can obtain dense, dispersed parameter matrix with l 2 22

Learning with Structural Constraints Controlling the learned parameter metrix with regularization u 11 u 1k u 21 w 2k σ 1 0 v 11 v 1n............... 0 σ k v k1 v kn }{{}}{{} u n1 u nk }{{} Σ V } U {{ } SVD(W ) = UΣV 22

Learning with Structural Constraints Controlling the learned parameter metrix with regularization We can obtain dense, dispersed parameter matrix with l 2 Obtain a low-rank matrix with l regularization 22

Learning with Structural Constraints Controlling the learned parameter metrix with regularization Bonus everything can be convex optimized with FOBOS!!! 22

A Quicklook at Optimization Input: Gradient function g 1 W t+1 = argmin W W t+0.5 W 2 2 + η tλr(w ); Output: W t+1 2 W 1 = 0 3 while t < T do 4 η t = c t 5 W t+0.5 = W t η t g(w t ) 6 if l 2 regularizer then 7 W t+1 = 1 1+η tλ W t+0.5 8 else if l regularizer then 9 UΣV = SVD(W t+0.5 ) 10 Σ i,i = max(σ i,i η t λ, 0) 11 W t+1 = U ΣV 12 end 23

Data BLLIP with 1.8 million sentences and 43 million tokens of Wall Street Journal text (and excludes PTB evaluation sets); English Wikipedia 13.1 million sentences and 129 million tokens; The New York Times portion of the GigaWord corpus, with 52 million sentences and 1, 253 million tokens. 24