Multiple Aspect Ranking Using the Good Grief Algorithm. Benjamin Snyder and Regina Barzilay MIT

Similar documents
Multiple Aspect Ranking for Opinion Analysis. Benjamin Snyder

Classification: Analyzing Sentiment

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Classification: Analyzing Sentiment

Lecture 3: Multiclass Classification

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Linear classifiers: Logistic regression

Support Vector and Kernel Methods

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Active Learning from Single and Multiple Annotators

MIRA, SVM, k-nn. Lirong Xia

A Regularized Recommendation Algorithm with Probabilistic Sentiment-Ratings

Large-Margin Thresholded Ensembles for Ordinal Regression

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Practical Agnostic Active Learning

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Multi-Class Confidence Weighted Algorithms

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

Decision trees COMS 4771

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Natural Language Processing

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 5 Neural models for NLP

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Applied Natural Language Processing

A Borda Count for Collective Sentiment Analysis

Ad Placement Strategies

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Logistic regression and linear classifiers COMS 4771

Collaborative Filtering. Radek Pelánek

Algorithm-Independent Learning Issues

Machine Learning for NLP

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Numerical Learning Algorithms

Machine Learning and Data Mining. Linear classification. Kalev Kask

Adaptive Crowdsourcing via EM with Prior

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning for NLP

John Pavlopoulos and Ion Androutsopoulos NLP Group, Department of Informatics Athens University of Economics and Business, Greece

Learning with Rejection

Lecture 2 Machine Learning Review

Novel Quantization Strategies for Linear Prediction with Guarantees

PAC Generalization Bounds for Co-training

Ordinal Classification with Decision Rules

Manual for a computer class in ML

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Evaluation. Andrea Passerini Machine Learning. Evaluation

Learning From Crowds. Presented by: Bei Peng 03/24/15

Lecture Support Vector Machine (SVM) Classifiers

Supervised Learning. George Konidaris

6.036 midterm review. Wednesday, March 18, 15

Machine Learning (CS 567) Lecture 2

Energy Based Learning. Energy Based Learning

Lecture #11: Classification & Logistic Regression

Machine Learning for Structured Prediction

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Linear Classifiers. Michael Collins. January 18, 2012

Evaluation requires to define performance measures to be optimized

Data Mining und Maschinelles Lernen

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Advanced Introduction to Machine Learning CMU-10715

Statistics and learning: Big Data

Linear Methods for Classification

Linear Models in Machine Learning

Listwise Approach to Learning to Rank Theory and Algorithm

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Linear Regression (continued)

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

CSC242: Intro to AI. Lecture 21

Introduction to Boosting and Joint Boosting

Machine Learning (CS 567) Lecture 5

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Machine Learning and Deep Learning! Vincent Lepetit!

ML4NLP Multiclass Classification

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Exploiting Geographic Dependencies for Real Estate Appraisal

Online Passive- Aggressive Algorithms

The Perceptron. Volker Tresp Summer 2016

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Expectation propagation for signal detection in flat-fading channels

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Bootstrapping. 1 Overview. 2 Problem Setting and Notation. Steven Abney AT&T Laboratories Research 180 Park Avenue Florham Park, NJ, USA, 07932

The Perceptron algorithm

From Sentiment Analysis to Preference Aggregation

Loss Functions, Decision Theory, and Linear Models

Perceptron. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

Training the linear classifier

Generalized Boosting Algorithms for Convex Optimization

Classification and Pattern Recognition

Transcription:

Multiple Aspect Ranking Using the Good Grief Algorithm Benjamin Snyder and Regina Barzilay MIT

From One Opinion To Many Much previous work assumes one opinion per text. (Turney 2002; Pang et al 2002; Pang & Lee 2005) Real texts contain multiple, related, opinions. The food was a little greasy, but it was priced pretty well. Our only complaint was the service after our order was taken. Food Price Service

http://www.we8there.com

Multiple Aspect Opinion The Task: Analysis Predict writer s opinion on a fixed set of aspects (e.g. food, service, price etc) using a fixed scale (e.g. from 1-5). Simple Approach: Treat each aspect as an independent ranking (rating) problem.

Shortcomings of Independent Ranking Multiple opinions in a single text are correlated. Real text relates opinions in coherent, meaningful ways. The food was a little greasy, but it was priced pretty well. Our only complaint was the service after our order was taken. Service < Food < Price

Shortcomings of Independent Ranking Multiple opinions in a single text are correlated. Real text relates opinions in coherent, meaningful ways. The food was a little greasy, but it was priced pretty well. Our only complaint was the service after our order was taken. Service < Food < Price

Shortcomings of Independent Ranking Multiple opinions in a single text are correlated. Real text relates opinions in coherent, meaningful ways. The food was a little greasy, but it was priced pretty well. Our only complaint was the service after our order was taken. Service < Food < Price Independent ranking fails to model correlations and to exploit discourse cues.

Key Goals Build on previous success of simple linear models trained with Perceptron in NLP: Simple, fast training High performance Exact, fast decoding Extend framework to tasks with complex label dependencies: Task-specific dependency space Label dependencies sensitive to input features Factorization of label prediction and dependency models

Our Idea: The Model Individual ranking model for each aspect. Add a meta-model which predicts discourse relations between aspects. e.g., Service < Food < Price (order) Service = Food = Price ~[Service = Food = Price] (agreement) (disagreement)

Our Idea: The Model Individual ranking model for each aspect. Add a meta-model which predicts discourse relations between aspects. e.g., Service < Food < Price (order) { Service = Food = Price ~[Service = Food = Price] (agreement) Binary Agreement Model (disagreement) }

Our Idea: Learning and Inference Resolve differences between individual rankers and meta-model using overall confidence of all model components. Meta-model glues together individual ranking decisions in coherent way. Optimize individual ranker parameters to operate with meta-model through joint Perceptron updates.

Meta-Model Food Service Price Ambience The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight.

Basic Ranking Framework (Crammer & Singer 2001) Goal: Assign each input Model: weight vector: a rank in x R n {1,..., k} w R n boundaries divide real line into k segments: b = (b 1,..., b k 1 )

Rank Decoding Input: x b 0 = b 1 b 2 b 3 b 4 b 5 = Output:

Rank Decoding Input: x score(x) = w x b 0 = b 1 b 2 b 3 b 4 b 5 = Output:

Rank Decoding Input: x score(x) = w x b 0 = b 1 b 2 b 3 b 4 b 5 = Output: ŷ = 3

Multiple Aspect Ranking Each input x has m aspects. e.g. reviews rate products for different qualities -- food, service, ambience etc. Associated with each input vector. (The component is the rank for aspect.) is a rank x r r i {1,..., k} i r =< 5, 3, 4, 4, 4 > food service...

Joint Ranking Model Ranking model for each aspect i : (w[i], b[i]) Linear agreement model to a R n predict unanimity across aspects: sign(a x) Combine predictions of individual rankers and agreement model: Introduce grief terms and choose joint rank which minimizes their sum. }Component Models

Grief of Component Models g i (x, r i ) : Measure of dissatisfaction of ranking model with rank input x. r i aspect for i th g a (x, r) : Measure of dissatisfaction of agreement model with rank vector r for input x.

Grief of i th Ranking Model g i (x, r) = min c c s.t. b[i] r 1 score i (x) + c < b[i] r score i (x) b r 2 b r 1 b r

Grief of i th Ranking Model g i (x, r) = min c c s.t. b[i] r 1 score i (x) + c < b[i] r c { b r 2 b r 1 b r

Grief of Agreement Model g a (x, r) = min c c s.t. [(a x + c > 0) (r 1 = r 2 =... = r m )] [(a x + c 0) (r 1 = r 2 =... = r m )] disagree 0 agree r 1 = r 2 =... = r m

Grief of Agreement Model g a (x, r) = min c c s.t. [(a x + c > 0) (r 1 = r 2 =... = r m )] [(a x + c 0) (r 1 = r 2 =... = r m )] disagree {r1 = r2 =... = rm c 0 agree

Good Grief Decoding Select joint rank which minimizes total grief: H(x) = arg min r Y m Exact search is linear in number of aspects: O(m) r {1,..., k} m [ g a (x, r) + m i=1 g i (x, r i ) ]

Disagreement with high confidence food: b 1 b 2 b 3 ambience: b 1 b 2 b 3 disagree 0 Agreement model agree

Disagreement with high confidence c { food: b 1 b 2 b 3 ambience: b 1 b 2 b 3 disagree 0 Agreement model agree

Disagreement with low confidence food: b 1 b 2 b 3 ambience: b 1 b 2 b 3 disagree 0 Agreement model agree

Disagreement with low confidence food: b 1 b 2 b 3 ambience: b 1 b 2 b 3 c { disagree 0 Agreement model agree

Joint Learning Idea: optimize individual ranker parameters for Good Grief Decoding. 1. Train agreement model on corpus. 2. Incorporate Grief Minimization into online learning procedure for rankers: Jointly decode each training instance. Simultaneously update all rankers.

Joint Online Learning Input : (x 1, y 1 ),..., (x T, y T ), Agreement model a, Decoder defintion H(x). Initialize : Set w[i] 1 = 0, b[i] 1 1,..., b[i] 1 k 1 = 0, b[i]1 k =, i 1...m. Loop : For t = 1, 2,..., T : 1. Get a new instance x t R n. 2. Predict ŷ t = H(x; w t, b t, a). 3. Get a new label y t. 4. For aspect i = 1,..., m: If ŷ[i] t y[i] t update model: 4.a For r = 1,..., k 1 : If y[i] t r then y[i] t r = 1 else y[i] t r = 1. 4.b For r = 1,..., k 1 : If (ŷ[i] t r)y[i] t r 0 then τ[i] t r = y[i] t r else τ[i] t r = 0. 4.c Update w[i] t+1 w[i] t + ( r τ[i]t r)x t. For r = 1,..., k 1 update : b[i] t+1 r b[i] t r τ[i] t r. Output : H(x; w T +1, b T +1, a). Update rule based on (Crammer & Singer 2001)

Joint Online Learning Input : (x 1, y 1 ),..., (x T, y T ), Agreement model a, Decoder defintion H(x). Initialize : Set w[i] 1 = 0, b[i] 1 1,..., b[i] 1 k 1 = 0, b[i]1 k =, i 1...m. Loop : For t = 1, 2,..., T : 1. Get a new instance x t R n. Grief Minimization 2. Predict ŷ t = H(x; w t, b t, a). 3. Get a new label y t. 4. For aspect i = 1,..., m: If ŷ[i] t y[i] t update model: 4.a For r = 1,..., k 1 : If y[i] t r then y[i] t r = 1 else y[i] t r = 1. 4.b For r = 1,..., k 1 : If (ŷ[i] t r)y[i] t r 0 then τ[i] t r = y[i] t r else τ[i] t r = 0. 4.c Update w[i] t+1 w[i] t + ( r τ[i]t r)x t. For r = 1,..., k 1 update : b[i] t+1 r b[i] t r τ[i] t r. Output : H(x; w T +1, b T +1, a). Update rule based on (Crammer & Singer 2001)

Joint Online Learning Input : (x 1, y 1 ),..., (x T, y T ), Agreement model a, Decoder defintion H(x). Initialize : Set w[i] 1 = 0, b[i] 1 1,..., b[i] 1 k 1 = 0, b[i]1 k =, i 1...m. Loop : For t = 1, 2,..., T : 1. Get a new instance x t R n. Grief Minimization 2. Predict ŷ t = H(x; w t, b t, a). 3. Get a new label y t. 4. For aspect i = 1,..., m: If ŷ[i] t y[i] t update model: Simultaneous Updates 4.a For r = 1,..., k 1 : If y[i] t r then y[i] t r = 1 else y[i] t r = 1. 4.b For r = 1,..., k 1 : If (ŷ[i] t r)y[i] t r 0 then τ[i] t r = y[i] t r else τ[i] t r = 0. 4.c Update w[i] t+1 w[i] t + ( r τ[i]t r)x t. For r = 1,..., k 1 update : b[i] t+1 r b[i] t r τ[i] t r. Output : H(x; w T +1, b T +1, a). Update rule based on (Crammer & Singer 2001)

Joint Online Learning Input : (x 1, y 1 ),..., (x T, y T ), Agreement model a, Decoder defintion H(x). Initialize : Set w[i] 1 = 0, b[i] 1 1,..., b[i] 1 k 1 = 0, b[i]1 k =, i 1...m. Loop : For t = 1, 2,..., T : 1. Get a new instance x t R n. Grief Minimization 2. Predict ŷ t = H(x; w t, b t, a). 3. Get a new label y t. 4. For aspect i = 1,..., m: If ŷ[i] t y[i] t update model: Simultaneous Updates { 4.a For r = 1,..., k 1 : If y[i] t r then y[i] t r = 1 else y[i] t r = 1. 4.b For r = 1,..., k 1 : If (ŷ[i] t r)y[i] t r 0 then τ[i] t r = y[i] t r else τ[i] t r = 0. 4.c Update w[i] t+1 w[i] t + ( r τ[i]t r)x t. For r = 1,..., k 1 update : b[i] t+1 r b[i] t r τ[i] t r. Output : H(x; w T +1, b T +1, a). Update rule based on (Crammer & Singer 2001)

Feature Representation Each review represented as binary feature vector. Features track presence or absence of words and word bigrams. About 30,000 total features. Lexical features previously found effective for: Sentiment Analysis (Wiebe 2000; Pang et al 2004) Discourse Analysis (Marcu & Echihabi 2002)

Evaluation 4,500 restaurant reviews (www.we8there.com) 3,500 / 500 / 500 random split into training, development, and test data. Average review length: 115 words. Each review ranks restaurant with respect to: food, service, ambience, value, and overall experience on a scale of 1-5. Average Rank Loss : T t=1 ˆr t r t /T

Performance of Agreement Model Majority Baseline (disagreement): 58% Agreement Model accuracy: 67% According to Good Grief Criterion: Raw accuracy not what matters, rather accuracy as function of confidence. As confidence goes up, so does accuracy

Accuracy of Agreement Model

Accuracy of Agreement Model 33% of data with highest confidence classified at 80% accuracy.

Accuracy of Agreement Model 10% of data with highest confidence classified at 90% accuracy.

Baselines PRANK: Independent rankers for each aspect trained using PRank algorithm (Crammer & Singer 2001) MAJORITY: < 5, 5, 5, 5, 5 > SIM: Joint model using cosine similarity between aspects (Basilico & Hofmann 2004) score i (x) = w[i] x + j sim(i, j)(w[j] x) GG DECODE: Good Grief decoding but independent training

Average Rank Loss on Test Set Food Service Value Atmosphere Experience Total majority 0.848 1.056 1.030 1.044 1.028 1.001 prank 0.606 0.676 0.700 0.776 0.618 0.675 sim 0.562 0.648 0.706 0.798 0.600 0.663 GG decode 0.544 0.648 0.704 0.798 0.584* 0.656 GG train+decode 0.534* 0.622* 0.644* 0.774* 0.584 0.632* * = Statistically Significant improvement over closest rival using Fisher Sign Test.

Average Rank Loss Agreement Disagreement prank 0.414 0.864 GG train+decode 0.324 0.854 Cases of Disagreement: 58% of corpus relative reduction in error: 1% Cases of Agreement: 42% of corpus relative reduction in error: 22%

Technical Contributions Novel framework for tasks with complex label dependencies: simple, fast, exact, and accurate Explicit Meta-Model: task-specific dependency spaces features tailored for dependency prediction joint Perceptron updates for label predictors

Conclusions & Future Work Applied Good Grief framework to Multiple Aspect Sentiment Analysis: Agreement Model guides aspect rank predictions Outperform all baseline models. Future Work: apply GG framework to other tasks classification, regression etc more complex label dependency spaces

Data and Code available: http://people.csail.mit.edu/bsnyder Thank You!

Model Expressivity

Model Expressivity Ambience Fully Implicit Opinions: Our only complaint was the service after our order was taken

Model Expressivity Ambience Fully Implicit Opinions: Our only complaint was the service after our order was taken

Model Expressivity Ambience Fully Implicit Opinions: Our only complaint was the service after our order was taken One Opinion expressed in terms of another: The food was good, but not the ambience

Model Expressivity Ambience Fully Implicit Opinions: Our only complaint was the service after our order was taken One Opinion expressed in terms of another: The food was good, but not the ambience

Decoding The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Ambience Agreement

Decoding The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Ambience Agreement

Decoding The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Ambience Agreement

Decoding The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Ambience Agreement

Decoding The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Food Ambience Ambience X Agreement

Online Training The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Ambience Agreement

Online Training The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food X Food Ambience Ambience Agreement

Online Training The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food X Food Ambience Ambience Agreement

Online Training The restaurant was a bit uneven. Although the food was tasty, the window curtains blocked out all sunlight. Food Food Ambience Ambience Agreement