Classification: Analyzing Sentiment

Similar documents
Classification: Analyzing Sentiment

Linear classifiers: Logistic regression

Least Squares Classification

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Performance Evaluation

Stephen Scott.

Linear classifiers: Overfitting and regularization

Evaluation requires to define performance measures to be optimized

Linear Classifiers. Michael Collins. January 18, 2012

Evaluation. Andrea Passerini Machine Learning. Evaluation

CSC 411: Lecture 03: Linear Classification

Diagnostics. Gad Kimmel

Ad Placement Strategies

Decision Trees: Overfitting

Online Advertising is Big Business

Lecture 3 Classification, Logistic Regression

Uncertainty in prediction. Can we usually expect to get a perfect classifier, if we have enough training data?

Performance evaluation of binary classifiers

Multiple Aspect Ranking Using the Good Grief Algorithm. Benjamin Snyder and Regina Barzilay MIT

Applied Natural Language Processing

Classification Based on Probability

Machine Learning Linear Classification. Prof. Matteo Matteucci

Large-Margin Thresholded Ensembles for Ordinal Regression

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Applied Machine Learning Annalisa Marsico

CS 188: Artificial Intelligence. Machine Learning

Classification objectives COMS 4771

Logistic regression for conditional probability estimation

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CS 188: Artificial Intelligence Spring Today

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Introduction to Logistic Regression

CS 5522: Artificial Intelligence II

Lecture 4 Discriminant Analysis, k-nearest Neighbors

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Classification with Perceptrons. Reading:

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

Learning Theory Continued

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Machine Learning for NLP

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

The Naïve Bayes Classifier. Machine Learning Fall 2017

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Machine Learning (CS 567) Lecture 2

Loss Functions, Decision Theory, and Linear Models

Performance Metrics for Machine Learning. Sargur N. Srihari

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Bayesian Decision Theory

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

CS 343: Artificial Intelligence

Lecture 4: Training a Classifier

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

PAC-learning, VC Dimension and Margin-based Bounds

Classification, Linear Models, Naïve Bayes

Statistical NLP Spring A Discriminative Approach

Midterm Exam, Spring 2005

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Natural Language Processing (CSEP 517): Text Classification

CS 188: Artificial Intelligence. Outline

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Day 3: Classification, logistic regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Probabilistic Graphical Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

CS 188: Artificial Intelligence Fall 2011

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Linear and Logistic Regression. Dr. Xiaowei Huang

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logarithms and Exponentials

CS229 Supplemental Lecture notes

ECE521 Lecture7. Logistic Regression

Deconstructing Data Science

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Information Retrieval and Organisation

Machine Learning, Fall 2009: Midterm

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Online Advertising is Big Business

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

9 Classification. 9.1 Linear Classifiers

Warm up: risk prediction with logistic regression

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Models, Data, Learning Problems

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Deconstructing Data Science

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Stat 20 Midterm 1 Review

Transcription:

Classification: Analyzing Sentiment STAT/CSE 416: Machine Learning Emily Fox University of Washington April 17, 2018 Predicting sentiment by topic: An intelligent restaurant review system 1

It s a big day & I want to book a table at a nice Japanese restaurant Seattle has many sushi restaurants What are people saying about the food? the ambiance?... 3 Positive reviews not positive about everything Experience Sample review: Watching the chefs create incredible edible art made the experience very unique. My wife tried their ramen and it was pretty forgettable. All the sushi was delicious! Easily best sushi in Seattle. 4 2

From reviews to topic sentiments All reviews for restaurant Novel intelligent restaurant review app Experience Ramen Sushi Easily best sushi in Seattle. 5 Intelligent restaurant review system All reviews for restaurant Break all reviews into sentences The seaweed salad was just OK, vegetable salad was just ordinary. I like the interior decoration and the blackboard menu on the wall. All the sushi was delicious. My wife tried their ramen and it was pretty forgettable. The sushi was amazing, and the rice is just outstanding. The service is somewhat hectic. Easily best sushi in Seattle. 6 3

Core building block Easily best sushi in Seattle. Sentence Sentiment Classifier Easily best sushi in Seattle. 7 Intelligent restaurant review system All reviews for restaurant All the sushi was delicious. The sushi was amazing, and the rice is just outstanding. Easily best sushi in Seattle. Break Select all reviews sentences into sentences about sushi The seaweed salad was just OK, vegetable salad was just ordinary. I like the interior decoration and the blackboard menu on the wall. All the sushi was delicious. My wife tried their ramen and it was pretty forgettable. The sushi was amazing, and the rice is just outstanding. The service is somewhat hectic. Easily best sushi in Seattle. Sentence Sentiment Classifier Average predictions Sushi Most & Easily best sushi in Seattle. 8 4

Classifier applications Classifier Sentence from review Input: x Sushi was awesome, the food was awesome, but the service was awful. Classifier MODEL Output: y Predicted class 10 5

Spam filtering Not spam Text of email, sender, IP, Spam Input: x Output: y 11 Multiclass classifier Output y has more than 2 categories Education Finance Technology Input: x Webpage Output: y 12 6

Image classification Input: x Output: y Image pixels Predicted object 13 Personalized medical diagnosis Input: x Output: y Healthy Disease Classifier MODEL Cold Flu Pneumonia 14 7

Reading your mind Output y Hammer Inputs x are brain region intensities House 15 Linear classifiers 8

Training Data x Feature extraction h(x) ML model ŷ y ŵ ML algorithm Quality metric 17 Representing classifiers How does it work??? Sentence from review Input: x Classifier MODEL Output: y Predicted class 18 9

List of positive words great, awesome, good, amazing, List of negative words bad, terrible, disgusting, sucks, 19 Sentence from review Input: x Simple threshold classifier Count positive & negative words in sentence If number of positive words > number of negative words: ŷ = Else: ŷ = List of positive words great, awesome, good, amazing, List of negative words bad, terrible, disgusting, sucks, 20 Sushi was great, the food was awesome, but the service was terrible. Simple threshold classifier Count positive & negative words in sentence If number of positive words > number of negative words: ŷ = 1 Else: ŷ = 2 10

Problems with threshold classifier How do we get list of positive/negative words? Words have different degrees of sentiment: - Great > good - How do we weigh different words? Single words are not enough: - Good è Positive - Not good è Negative Addressed by learning a classifier Addressed by more elaborate features 21 A (linear) classifier Will use training data to learn a weight for each word Word Weight good 1.0 great 1.5 awesome 2.7 bad -1.0 terrible -2.1 awful -3.3 restaurant, the, we, where, 0.0 22 11

Scoring a sentence Word Weight good 1.0 great 1.2 awesome 1.7 bad -1.0 terrible -2.1 awful -3.3 restaurant, the, we, where, 0.0 Input x: Sushi was great, the food was awesome, but the service was terrible. 23 Called a linear classifier, because output is weighted sum of input. Word Weight Sentence from review Input: x Simple linear classifier Score(x) = weighted count of words in sentence If Score (x) > 0: ŷ = Else: ŷ = 24 12

More generically Model: ŷ i = sign(score(x i )) Score(x i ) = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i ) DX = w j h j (x i ) = w T h(x i ) j=0 25 feature 1 = h 0 (x) e.g., 1 feature 2 = h 1 (x) e.g., x[1] = #awesome feature 3 = h 2 (x) e.g., x[2] = #awful or, log(x[7]) x[2] = log(#bad) x #awful or, tf-idf( awful ) feature D+1 = h D (x) some other function of x[1],, x[d] Training Data x Feature extraction h(x) ML model ŷ = sign(ŵ T h(x)) (either -1 or +1) y ŵ ML algorithm Quality metric 26 13

Decision boundaries Suppose only two words had non-zero coefficient Input Coefficient Value w 0 0.0 #awesome w 1 1.0 #awful w 2-1.5 Score(x) = 1.0 #awesome 1.5 #awful #awful 4 3 Sushi was awesome, the food was awesome, but the service was awful. 28 2 1 0 0 1 2 3 4 #awesome 14

Decision boundary example Input Coefficient Value w 0 0.0 #awesome w 1 1.0 #awful w 2-1.5 Score(x) = 1.0 #awesome 1.5 #awful 29 #awful 4 3 2 1 0 Score(x) < 0 1.0 #awesome 1.5 #awful = 0 0 1 2 3 4 Score(x) > 0 #awesome Decision boundary separates + and predictions Decision boundary: effect of changing coefficients Input Coefficient Value w 0 1.0 0.0 #awesome w 1 1.0 #awful w 2-1.5 Score(x) = 1.0 #awesome + 1.5 #awful 30 #awful 4 3 2 1 0 Score(x) < 0 1.0 + 1.0 #awesome 1.5 #awful = 0 1.0 #awesome 1.5 #awful = 0 0 1 2 3 4 Score(x) > 0 #awesome 15

Decision boundary: effect of changing coefficients Input Coefficient Value 1.0 w 0 #awesome w 1 1.0 #awful w 2-3.0-1.5 Score(x) = 1.0 + 1.0 #awesome 3.0 1.5 #awful 31 #awful 4 3 2 1 0 Score(x) < 0 1.0 + 1.0 #awesome 1.5 #awful = 0 0 1 2 3 4 Score(x) > 0 #awesome For more inputs (linear features) #awful x[2] 32 #awesome x[1] #great x[3] Score(x) = w 0 + w 1 #awesome + w 2 #awful + w 3 #great 16

For general features For more general classifiers (not just linear features) è more complicated shapes 33 Training and evaluating a classifier 17

Training Data x Feature extraction h(x) ML model ŷ y ŵ ML algorithm Quality metric 35 Training a classifier = Learning the weights Training set Learn classifier Word Weight good 1.0 awesome 1.7 bad -1.0 awful -3.3 Data (x,y) (Sentence1, ) (Sentence2, ) Test set Evaluate? 36 18

Classification error Learned classifier ŷ = Test example (Sushi (Food was great, OK, ) ) Correct! Mistake! Correct Mistakes 01 01 Hide label 37 Classification error & accuracy Error measures fraction of mistakes error =. - Best possible value is 0.0 Often, measure accuracy - Fraction of correct predictions accuracy=. - Best possible value is 1.0 38 19

What s a good accuracy? What if you ignore the sentence, and just guess? For binary classification: - Half the time, you ll get it right! (on average) è accuracy = For k classes, accuracy = 1/k - for 3 classes, for 4 classes, 40 At the very, very, very least, you should healthily beat random Otherwise, it s (usually) pointless 20

Is a classifier with 90% accuracy good? Depends 2010 data shows: 90% emails sent are spam! Predicting every email is spam gets you 90% accuracy!!! Majority class prediction 41 Amazing performance when there is class imbalance (but silly approach) One class is more common than others Beats random (if you know the majority class) So, always be digging in and asking the hard questions about reported accuracies Is there class imbalance? How does it compare to a simple, baseline approach? - Random guessing - Majority class - Most importantly: What accuracy does my application need? - What is good enough for my user s experience? - What is the impact of the mistakes we make? 42 21

False positives, false negatives, and confusion matrices Types of mistakes Predicted label True label True Positive False Positive Negative (FP) (FN) False Negative Positive (FN) (FP) True Negative 44 22

Cost of different types of mistakes can be different (& high) in some applications False negative False positive Spam filtering Annoying Email lost Medical diagnosis Disease not treated Wasteful treatment 45 Confusion matrix binary classification Predicted label True label True Positive False Negative (FN) False Positive (FP) True Negative 46 23

Confusion matrix multiclass classification Predicted label Healthy Cold Flu Healthy True label Cold Flu 47 Learning curves (again): How much data do I need? 24

How much data does a model need to learn? The more the merrier J - But data quality is most important factor Theoretical techniques can sometimes bound how much data is needed - Typically too loose for practical application - But provide guidance In practice: - More complex models require more data - Empirical analysis can provide guidance 49 Test error vs. amount of training data Test error Bias of model Amount of training data 50 25

More complex models tend to have less bias Sentiment classifier using single words can do OK, but Never classifies correctly: The sushi was not good. More complex model: consider pairs of words (bigrams) 51 Word Weight good +1.5 not good -2.1 Less bias è potentially more accurate, needs more data to learn Models with less bias tend to need more data to learn well, but do better with sufficient data Test error Classifier based on single words Amount of training data 52 26

Summary of classification intro Training Data x Feature extraction h(x) ML model ŷ y ŵ ML algorithm Quality metric 54 27

What you can do now Identify a classification problem and some common applications Describe decision boundaries and linear classifiers Train a classifier Measure its error - Some rules of thumb for good accuracy Interpret the types of error associated with classification Describe the tradeoffs between model bias and data set size 55 Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 17, 2018 28

Are you sure about the prediction? Class probability How confident is your prediction? Thus far, we ve outputted a prediction or But, how sure are you about the prediction? The sushi & everything else were awesome! The sushi was good, the service was OK Definite Not sure 58 29

Using probabilities in classification How confident is your prediction? The sushi & everything else were awesome! The sushi was good, the service was OK Definite The sushi & everything P(y=+1 x= ) else were awesome! = 0.99 Not sure The sushi was good, P(y=+1 x= ) the service was OK = 0.55 60 Many classifiers provide a degree of certainty: Output label Input sentence P(y x) Extremely useful in practice 30

Goal: Learn conditional probabilities from data Training data: N observations (x i,y i ) x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 0 2-1 3 3-1 4 1 +1 Optimize quality metric on training data Find best model P by finding best ŵ Useful for predicting ŷ 61 Sentence from review Input: x Predict most likely class P(y x) = estimate of class probabilities If P(y=+1 x) > 0.5: ŷ = Else: ŷ = Estimating P(y x) improves interpretability: - Predict ŷ = +1 and tell me how sure you are 62 31