Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis. July 31, 2014

Similar documents
Deep Learning For Mathematical Functions

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Uncertainty in prediction. Can we usually expect to get a perfect classifier, if we have enough training data?

Logistic regression for conditional probability estimation

10/17/04. Today s Main Points

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Extraction of Opposite Sentiments in Classified Free Format Text Reviews

Spatial Role Labeling CS365 Course Project

APPLIED DEEP LEARNING PROF ALEXIEI DINGLI

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets

Lecture 13: Structured Prediction

Natural Language Processing with Deep Learning CS224N/Ling284

Multi-theme Sentiment Analysis using Quantified Contextual

Parsing with Context-Free Grammars

Semi-Supervised Recursive Autoencoder

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

POS-Tagging. Fabian M. Suchanek

Neural Networks for NLP. COMP-599 Nov 30, 2016

Lecture 2: Probability, Naive Bayes

Sequence Modeling with Neural Networks

DISTRIBUTIONAL SEMANTICS

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Machine Learning Basics

Bag of Words Meets Bags of Popcorn

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Task-Oriented Dialogue System (Young, 2000)

Lecture 11 Recurrent Neural Networks I

Applied Natural Language Processing

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Bringing machine learning & compositional semantics together: central concepts

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Multi-Class Sentiment Classification for Short Text Sequences

arxiv: v3 [cs.lg] 14 Jan 2018

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Generative Models for Sentences

Based on the original slides of Hung-yi Lee

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Lecture 11 Recurrent Neural Networks I

Degree (k)

A Deep Learning Model for Structured Outputs with High-order Interaction

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

Prepositional Phrase Attachment over Word Embedding Products

Intelligent Systems (AI-2)

Structured Prediction

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Deep Learning for NLP

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Deep Learning for Natural Language Processing

with Local Dependencies

Aspect Term Extraction with History Attention and Selective Transformation 1

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

lecture 6: modeling sequences (final part)

Learning and Inference over Constrained Output

Naïve Bayes, Maxent and Neural Models

Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

NEURAL LANGUAGE MODELS

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Introduction to Machine Learning Eran Halperin, Yishay Mansour, Lior Wolf Lecture 13: Applications

Data Mining 2018 Logistic Regression Text Classification

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 5 Neural models for NLP

Interpreting Deep Classifiers

Latent Variable Models in NLP

CSC321 Lecture 16: ResNets and Attention

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Information Extraction from Text

Intelligent Systems (AI-2)

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Better! Faster! Stronger*!

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

GloVe: Global Vectors for Word Representation 1

Natural Language Processing

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Chapter Finding parse trees

Deep Learning for NLP

Assignment 1. Learning distributed word representations. Jimmy Ba

Magic Of The Moonlight (Full Moon) By Ellen Schreiber

Deep Learning for NLP Part 2

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Representing structured relational data in Euclidean vector spaces. Tony Plate

Machine Learning for natural language processing

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

ECE521 Lecture 7/8. Logistic Regression

Statistical NLP for the Web

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

BEYOND WORD IMPORTANCE: CONTEXTUAL DE-

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Hidden Markov Models

Transcription:

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis July 31, 2014

Semantic Composition Principle of Compositionality The meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them Compositional nature of natural language Go beyond words towards sentences Examples red car -> red + car not very good -> not + ( very + good ) eat food -> eat + food

Recursive Neural Models (RNMs) Utilize the recursive structures of sentences to obtain the semantic representations The vector representations are used as features and fed into a softmax classifier to predict their labels Learn to recursively perform semantic compositions in vector space One family of the popular deep learning models Negative not very good Softmax very good not very good

Semantic Composition with Matrix/Tensor The main difference among the recursive neural models (RNMs) lies in semantic composition methods v v l v r intersection v = f + + + + + + v = f + + + + + + + + + + + + + + v = f W v l v r + b v = f RNN (Socher et al. 2011) RNTN (Yu et al. 2013, Socher et al. 2013) Problem: RNN and RNTN employ the same global composition function for all pair of input vectors v l v r T T [1:D] v l v r + W v l v r + b

Motivation of This Work Use different composition functions for different types of compositions Negation: not good, not bad Intensification: very good, pretty bad Contrast: the movie is good, but I love it Sentiment word + target/aspect: good movie, low price Model the composition as a distribution over multiple composition functions, and adaptively select them

One Global Composition Function Adaptive Multi-Compositionality

Adaptive Compositionality Use more than one composition functions and adaptively select them depending on the input vectors v = f C h=1 P g h v l, v r g h v l, v r Output vector Distribution of composition functions g 1 g 2 g 3 g 4 Classifier Softmax Input vectors

Adaptive Compositionality Use more than one composition functions and adaptively select them depending on the input vectors C Output vector v = f P g h v l, v r g h v l, v r h=1 The h -th composition function (Both the matrices and tensors can be used) Distribution of composition functions g 1 g 2 g 3 g 4 Classifier Softmax Input vectors

Adaptive Compositionality Use more than one composition functions and adaptively select them depending on the input vectors v = f C h=1 P g h v l, v r g h v l, v r Output vector Distribution of composition functions P g 1 v l, v r P g C v l, v r = softmax βs v l v r The Boltzmann distribution is used to adaptively select g h. g 1 g 2 g 3 g 4 Classifier Softmax Input vectors P g h v l, v r = 1 C P g 1 v l, v r P g C v l, v r = softmax S v l v r P g h v l, v r = 1, maxmum score 0, otherwise Avg-AdaMC Weighted-AdaMC Max-AdaMC

Objective Function Minimize the cross-entropy error Target vector t j = [0 1 0] Predicted distribution y j = [0.07 0.69 0.15] min Θ E Θ = AdaGrad (Duchi, Hazan, and Singer 2011) i j t j i log y j i + θ Θ λ θ θ 2 2 θ t = θ t 1 η G t = G t 1 + 1 G t E E θ θ=θt 1 θ θ=θt 1 2

Parameter Estimation Back-propagation algorithm: Classification: E U mn = i [v n i (y m i t m i )] δ m i r = k k y m i t m i U mk f (a m i ), r = i par(i) par(i) r a δ k m f (a i m ), r anc(i) v m i Composition selection: E S mn = i r bp(i) k i r δ k h i r δ k a k i,g h x n i βp g h v l i, v r i P g h v l i, v r i 1, h = m a k i,g h x n i βp g h v l i, v r i P g m v l i, v r i, h m i r bp(i) k h Linear composition: Tensor Composition: E W mn = i r bp(i) δ m i r x n i P g h v l i, v r i E V h [d] mn = i r bp(i) δ d i r x m i x n i P g h v l i, v r i Word Embedding: E L d w = i =w r bp(i) δ d i r

Stanford Sentiment Treebank 10,662 critic reviews in Rotten Tomatoes 215,154 phrases from results of Stanford Parser The workers in Amazon Mechanical Turk annotate polarity levels for all these phrases The sentiment scales are merged to five categories (very negative, negative, neutral, positive, very positive)

Results of evaluation on the Sentiment Treebank. The top three methods are in bold. Our methods achieve best performances when \beta is set to 2.

v = f C h=1 P g h v l, v r g h v l, v r P g h v l, v r = 1 C P g 1 v l, v r P g C v l, v r = softmax S v l v r P g h v l, v r = 1, max score 0, otherwise P g 1 v l, v r P g C v l, v r = softmax βs v l v r Avg-AdaMC Weighted-AdaMC Max-AdaMC

Vector Representations Word/Phrase good boring ingenious soundtrack good actors thought-provoking film painfully bad not a good movie Neighboring Words/Phrases in the Vector Space cool, fantasy, classic, watchable, attractive dull, bad, disappointing, horrible, annoying extraordinary, inspirational, imaginative, thoughtful, creative execution, animation, cast, colors, scene good ideas, good acting, good looks, good sense, great cast beautiful film, engaging film, lovely film, remarkable film, riveting story how bad, too bad, really bad, so bad, very bad isn t much fun, isn t very funny, nothing new, isn t as funny of clichés

fancy, good, cool, promising, interested Positive Objective plot, near, buy, surface, them, version Very negative failure, worst, disaster, horrible problem, slow, sick, mess, poor, wrong Negative Very positive creative, great, perfect, superb, amazing t-sne

Composition Pairs in the Composition Space For the composition pair (v l, v r ), we use the distribution of the composition functions P g 1 v l, v r P g C v l, v r to query its neighboring pairs Composition Pair really bad (is n t) (necessarily bad) great (Broadway play) Neighboring Composition Pairs very bad / only dull / much bad / extremely bad / (all that) bad (is n t) (painfully bad) / not mean-spirited / not (too slow) / not well-acted / (have otherwise) (been bland) great (cinematic innovation) / great subject / great performance / energetic entertainment / great (comedy filmmaker) (arty and) jazzy (Smart and) fun / (verve and) fun / (unique and) entertaining / (gentle and) engrossing / (warmth and) humor

these/this/the * Visualization: Composition Pairs P g 1 v l, v r P g C v l, v r * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * of * t-sne

these/this/the * * and for/with * and * adj noun (*) NE Negation Intensification verb * * s a/an/two * Best films Riveting story Solid cast Talented director Gorgeous visuals of *

these/this/the * * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * Really good Quite funny Damn fine Very good Particularly funny of *

these/this/the * * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * Is never dull Not smart Not a good movie Is n t much fun Wo n t be disappointed of *

these/this/the * * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * Roberto Alagna Pearl Harbor Elizabeth Hurley Diane Lane Pauly Shore of *

Future Work Use AdaMC for the other NLP tasks Utilize external information to adaptively select the composition functions Part-of-speech tags Syntactic parsing results Mix different composition types together Linear combination approach (RNN) Tensor-based approach (RNTN) Multiplication approach

THANKS!