ANLP Lecture 10 Text Categorization with Naive Bayes

Similar documents
Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes, Maxent and Neural Models

The Noisy Channel Model and Markov Models

Classification, Linear Models, Naïve Bayes

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

The Bayes classifier

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Information Retrieval and Organisation

Machine Learning for natural language processing

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Generative Models for Classification

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Dimensionality reduction

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Machine Learning Algorithm. Heejun Kim

Naïve Bayes classification

Statistical Methods for NLP

Bayesian Methods: Naïve Bayes

Ranked Retrieval (2)

Probability Review and Naïve Bayes

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

CS 188: Artificial Intelligence Spring Today

Applied Natural Language Processing

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Text Classification and Naïve Bayes

Machine Learning, Fall 2012 Homework 2

Loss Functions, Decision Theory, and Linear Models

Natural Language Processing (CSEP 517): Text Classification

Maxent Models and Discriminative Estimation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Midterm sample questions

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Vectors and their uses

Logistic Regression. Machine Learning Fall 2018

CSC 411: Lecture 09: Naive Bayes

Probabilistic Graphical Models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Day 5: Generative models, structured classification

ML4NLP Multiclass Classification

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Lecture 5 Neural models for NLP

Machine Learning. Naïve Bayes classifiers

CS 188: Artificial Intelligence. Outline

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Classification & Information Theory Lecture #8

Support Vector Machines

Classification: Analyzing Sentiment

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Generative Clustering, Topic Modeling, & Bayesian Inference

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Hidden Markov Models

Gaussian Models

Naïve Bayes Classifiers

Classification: Analyzing Sentiment

MLE/MAP + Naïve Bayes

Classification Algorithms

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

More Smoothing, Tuning, and Evaluation

Performance Metrics for Machine Learning. Sargur N. Srihari

COMP 328: Machine Learning

Language as a Stochastic Process

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

CSE446: Naïve Bayes Winter 2016

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Feature Engineering, Model Evaluations

How to Read 100 Million Blogs (& Classify Deaths Without Physicians)

Evaluation Strategies

Lecture 15: Logistic Regression

Hidden Markov Models

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

CPSC 340: Machine Learning and Data Mining

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

How to Read 100 Million Blogs (& Classify Deaths Without Physicians)

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Bayesian Classification. Bayesian Classification: Why?

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

CSC 411: Lecture 03: Linear Classification

Feature selection. Micha Elsner. January 29, 2014

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Transcription:

ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014

Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition

Text categorization 2 Spam detection (spam/not-spam) Sentiment analysis (pos/neg attitude, 1-5 star review) Author attribution (male/female, specific author, healthy/mental illness) Topic classification (sport/finance/travel/etc)

3 Formalizing the task Given document d and set of categories C, we want to assign it to category ĉ = argmax c C = argmax c C P (c d) P (d c)p (c)

How to approximate P (d c)? 4 Each document d is represented by features f 1, f 2,... f nd (words in doc). In fact, we only care about the feature counts: the your model cash Viagra class account orderz doc 1 12 3 1 0 0 2 0 0 doc 2 10 4 0 4 0 0 2 0 doc 3 25 4 0 0 0 1 1 0 doc 4 14 2 0 1 3 0 1 1 doc 5 17 5 0 2 0 0 1 1 A model where we care only about (unigram) word counts is called a bag-ofwords model.

How to approximate P (d c)? 5 Represent d using its feature values: P (d c) = P (f 1, f 2,... f n c) Naive Bayes assumption: features are conditionally independent given class. P (d c) P (f 1 c)p (f 2 c)... P (f n c)

Full model 6 Given document with features f 1, f 2,... f n and set of categories C, choose ĉ = argmax c C P (c) n i=1 P (f i c) This is called a Naive Bayes classifier (classification = categorization)

Learning the class priors 7 P (c) normally estimated with MLE: ˆP (c) = N c N N c = the number of training documents in class c N = the total number of training documents

Learning the class priors: example Given training documents with correct labels: 8 the your model cash Viagra class account orderz spam? doc 1 12 3 1 0 0 2 0 0 - doc 2 10 4 0 4 0 0 2 0 + doc 3 25 4 0 0 0 1 1 0 - doc 4 14 2 0 1 3 0 1 1 + doc 5 17 5 0 2 0 0 1 1 + ˆP (spam) = 3/5

Learning the feature probabilities 9 P (f i c) normally estimated with simple smoothing: ˆP (f i c) = f F count(f i, c) + α (count(f, c) + α) count(f i, c) = the number of times f i occurs in class c F = the set of possible features

Learning the feature probabilities: example 10 the your model cash Viagra class account orderz spam? doc 1 12 3 1 0 0 2 0 0 - doc 2 10 4 0 4 0 0 2 0 + doc 3 25 4 0 0 0 1 1 0 - doc 4 14 2 0 1 3 0 1 1 + doc 5 17 5 0 2 0 0 1 1 + ˆP (your +) = ˆP (orderz +) = (4+2+5+α) (all words in + class)+αf (2+α) (all words in + class)+αf = (11 + α)/(70 + αf ) = (2 + α)/(70 + αf )

Classifying a test document: example 11 Test document d: get your cash and your orderz Suppose all features not shown earlier have ˆP (f i +) = P (+ d) P (+) = P (+) n P (f i +) i=1 α (70 + αf ) α (70 + αf ) 11 + α (70 + αf ) 11 + α (70 + αf ) 2 + α (70 + αf ) α (70+αF ) 7 + α (70 + αf )

Classifying a test document: example 12 Test document d: get your cash and your orderz Do the same for P ( d) Choose the one with the larger value

Alternative feature values and feature sets Use only binary values for f i : did this word occur in d or not? Use only a subset of the vocabulary for F Ignore stopwords (function words and others with little content) Choose a small task-relevant set (e.g., using a sentiment lexicon) Use more complex features (bigrams, syntactic features, morphological features,...) 13

Advantages of Naive Bayes 14 Very easy to implement Very fast to train and test Doesn t require as much training data and some other methods Usually works reasonably well This should be your baseline method for any classification task

Problems with Naive Bayes 15 Naive Bayes assumption is naive! Consider categories Travel, Finance, Sport. Are the following features independent given the category? beach, sun, ski, snow, pitch, palm, football, relax, ocean

Problems with Naive Bayes 16 Naive Bayes assumption is naive! Consider categories Travel, Finance, Sport. Are the following features independent given the category? beach, sun, ski, snow, pitch, palm, football, relax, ocean No! They might be closer if we defined finer-grained categories (beach vacations vs. ski vacations), but we don t usually want to.

Non-independent features 17 Features are not usually independent given the class Adding multiple feature types (e.g., words and morphemes) often leads to even stronger correlations between features Accuracy of classifier can sometimes still be ok, overconfident in its decisions. but it will be highly Ex: NB sees 5 features that all point to class 1, treats them as five independent sources of evidence. Like asking 5 friends for an opinion when some got theirs from each other.

How to evaluate performance? Important question for any NLP task Intrinsic evaluation: design a measure inherent to the task Language modeling: perplexity POS tagging: accuracy (% of tags correct) Categorization: F-score (coming up next) 18

How to evaluate performance? Important question for any NLP task Intrinsic evaluation: design a measure inherent to the task Language modeling: perplexity POS tagging: accuracy (% of tags correct) Categorization: F-score (coming up next) Extrinsic evaluation: measure effects on a downstream task Language modeling: does it improve my ASR/MT system? POS tagging: does improve my parser/ir system? Categorization: does it reduce user search time in an IR setting? 19

Intrinsic evaluation for categorization Categorization as detection: document about sport or not? Classes may be very unbalanced. Can get 95% accuracy by always choosing not ; but this isn t useful. Need a better measure. 20

Two measures 21 Assume we have a gold standard: correct labels for test set We also have a system for detecting the items of interest (docs about sport) Precision = Recall = # items system got right # items system detected # items system got right # items system should have detected

Example of precision and recall 22 Item in class? Gold standard + + - - + - - - System output - + - + - - - - # correct = 1 # detected = 2 # in GS = 3 Precision = 1/2 Recall = 1/3

Precision-Recall curves If system has a tunable parameter to vary the precision/recall: Figure from: http://ivrgwww.epfl.ch/supplementary_material/rk_cvpr09/

F-measure 24 Can also combine precision and recall into single F-measure: F β = (β2 + 1)P R β 2 P + R Normally we just set β = 1 to get F 1 : F 1 = 2P R P + R F 1 is the harmonic mean of P and R: similar to arithmetic mean when P and R are close, but penalizes large differences between P and R.