Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis July 31, 2014
Semantic Composition Principle of Compositionality The meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them Compositional nature of natural language Go beyond words towards sentences Examples red car -> red + car not very good -> not + ( very + good ) eat food -> eat + food
Recursive Neural Models (RNMs) Utilize the recursive structures of sentences to obtain the semantic representations The vector representations are used as features and fed into a softmax classifier to predict their labels Learn to recursively perform semantic compositions in vector space One family of the popular deep learning models Negative not very good Softmax very good not very good
Semantic Composition with Matrix/Tensor The main difference among the recursive neural models (RNMs) lies in semantic composition methods v v l v r intersection v = f + + + + + + v = f + + + + + + + + + + + + + + v = f W v l v r + b v = f RNN (Socher et al. 2011) RNTN (Yu et al. 2013, Socher et al. 2013) Problem: RNN and RNTN employ the same global composition function for all pair of input vectors v l v r T T [1:D] v l v r + W v l v r + b
Motivation of This Work Use different composition functions for different types of compositions Negation: not good, not bad Intensification: very good, pretty bad Contrast: the movie is good, but I love it Sentiment word + target/aspect: good movie, low price Model the composition as a distribution over multiple composition functions, and adaptively select them
One Global Composition Function Adaptive Multi-Compositionality
Adaptive Compositionality Use more than one composition functions and adaptively select them depending on the input vectors v = f C h=1 P g h v l, v r g h v l, v r Output vector Distribution of composition functions g 1 g 2 g 3 g 4 Classifier Softmax Input vectors
Adaptive Compositionality Use more than one composition functions and adaptively select them depending on the input vectors C Output vector v = f P g h v l, v r g h v l, v r h=1 The h -th composition function (Both the matrices and tensors can be used) Distribution of composition functions g 1 g 2 g 3 g 4 Classifier Softmax Input vectors
Adaptive Compositionality Use more than one composition functions and adaptively select them depending on the input vectors v = f C h=1 P g h v l, v r g h v l, v r Output vector Distribution of composition functions P g 1 v l, v r P g C v l, v r = softmax βs v l v r The Boltzmann distribution is used to adaptively select g h. g 1 g 2 g 3 g 4 Classifier Softmax Input vectors P g h v l, v r = 1 C P g 1 v l, v r P g C v l, v r = softmax S v l v r P g h v l, v r = 1, maxmum score 0, otherwise Avg-AdaMC Weighted-AdaMC Max-AdaMC
Objective Function Minimize the cross-entropy error Target vector t j = [0 1 0] Predicted distribution y j = [0.07 0.69 0.15] min Θ E Θ = AdaGrad (Duchi, Hazan, and Singer 2011) i j t j i log y j i + θ Θ λ θ θ 2 2 θ t = θ t 1 η G t = G t 1 + 1 G t E E θ θ=θt 1 θ θ=θt 1 2
Parameter Estimation Back-propagation algorithm: Classification: E U mn = i [v n i (y m i t m i )] δ m i r = k k y m i t m i U mk f (a m i ), r = i par(i) par(i) r a δ k m f (a i m ), r anc(i) v m i Composition selection: E S mn = i r bp(i) k i r δ k h i r δ k a k i,g h x n i βp g h v l i, v r i P g h v l i, v r i 1, h = m a k i,g h x n i βp g h v l i, v r i P g m v l i, v r i, h m i r bp(i) k h Linear composition: Tensor Composition: E W mn = i r bp(i) δ m i r x n i P g h v l i, v r i E V h [d] mn = i r bp(i) δ d i r x m i x n i P g h v l i, v r i Word Embedding: E L d w = i =w r bp(i) δ d i r
Stanford Sentiment Treebank 10,662 critic reviews in Rotten Tomatoes 215,154 phrases from results of Stanford Parser The workers in Amazon Mechanical Turk annotate polarity levels for all these phrases The sentiment scales are merged to five categories (very negative, negative, neutral, positive, very positive)
Results of evaluation on the Sentiment Treebank. The top three methods are in bold. Our methods achieve best performances when \beta is set to 2.
v = f C h=1 P g h v l, v r g h v l, v r P g h v l, v r = 1 C P g 1 v l, v r P g C v l, v r = softmax S v l v r P g h v l, v r = 1, max score 0, otherwise P g 1 v l, v r P g C v l, v r = softmax βs v l v r Avg-AdaMC Weighted-AdaMC Max-AdaMC
Vector Representations Word/Phrase good boring ingenious soundtrack good actors thought-provoking film painfully bad not a good movie Neighboring Words/Phrases in the Vector Space cool, fantasy, classic, watchable, attractive dull, bad, disappointing, horrible, annoying extraordinary, inspirational, imaginative, thoughtful, creative execution, animation, cast, colors, scene good ideas, good acting, good looks, good sense, great cast beautiful film, engaging film, lovely film, remarkable film, riveting story how bad, too bad, really bad, so bad, very bad isn t much fun, isn t very funny, nothing new, isn t as funny of clichés
fancy, good, cool, promising, interested Positive Objective plot, near, buy, surface, them, version Very negative failure, worst, disaster, horrible problem, slow, sick, mess, poor, wrong Negative Very positive creative, great, perfect, superb, amazing t-sne
Composition Pairs in the Composition Space For the composition pair (v l, v r ), we use the distribution of the composition functions P g 1 v l, v r P g C v l, v r to query its neighboring pairs Composition Pair really bad (is n t) (necessarily bad) great (Broadway play) Neighboring Composition Pairs very bad / only dull / much bad / extremely bad / (all that) bad (is n t) (painfully bad) / not mean-spirited / not (too slow) / not well-acted / (have otherwise) (been bland) great (cinematic innovation) / great subject / great performance / energetic entertainment / great (comedy filmmaker) (arty and) jazzy (Smart and) fun / (verve and) fun / (unique and) entertaining / (gentle and) engrossing / (warmth and) humor
these/this/the * Visualization: Composition Pairs P g 1 v l, v r P g C v l, v r * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * of * t-sne
these/this/the * * and for/with * and * adj noun (*) NE Negation Intensification verb * * s a/an/two * Best films Riveting story Solid cast Talented director Gorgeous visuals of *
these/this/the * * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * Really good Quite funny Damn fine Very good Particularly funny of *
these/this/the * * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * Is never dull Not smart Not a good movie Is n t much fun Wo n t be disappointed of *
these/this/the * * and for/with * and * adj noun (*) Entity Negation Intensification verb * * s a/an/two * Roberto Alagna Pearl Harbor Elizabeth Hurley Diane Lane Pauly Shore of *
Future Work Use AdaMC for the other NLP tasks Utilize external information to adaptively select the composition functions Part-of-speech tags Syntactic parsing results Mix different composition types together Linear combination approach (RNN) Tensor-based approach (RNTN) Multiplication approach
THANKS!