Machine Learning for natural language processing

Similar documents
Machine Learning for natural language processing

Machine Learning for natural language processing

Classification, Linear Models, Naïve Bayes

Machine Learning for natural language processing

Text Classification and Naïve Bayes

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Naïve Bayes, Maxent and Neural Models

naive bayes document classification

Day 6: Classification and Machine Learning

Probability Review and Naïve Bayes

Machine Learning for natural language processing

Lecture 2: Probability, Naive Bayes

Naïve Bayes classification

Natural Language Processing (CSEP 517): Text Classification

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

PATTERN RECOGNITION AND MACHINE LEARNING

The Naïve Bayes Classifier. Machine Learning Fall 2017

Generative Models for Classification

Performance evaluation of binary classifiers

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

CS145: INTRODUCTION TO DATA MINING

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Probabilistic Information Retrieval

COMP 328: Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Techniques: Bayes Rule and the Axioms of Probability

DATA MINING LECTURE 10

Bayesian Decision Theory

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

8: Hidden Markov Models

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Bayesian Methods: Naïve Bayes

Stephen Scott.

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

More Smoothing, Tuning, and Evaluation

Directed and Undirected Graphical Models

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Machine Learning Algorithm. Heejun Kim

Behavioral Data Mining. Lecture 2

Probabilistic Graphical Models

Info 2950, Lecture 4

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Machine Learning for Signal Processing Bayes Classification

An Introduction to Statistical and Probabilistic Linear Models

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayesian Learning Features of Bayesian learning methods:

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Applied Natural Language Processing

Introduction to Bayesian Learning. Machine Learning Fall 2018

Be able to define the following terms and answer basic questions about them:

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Midterm sample questions

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Dynamic Programming: Hidden Markov Models

8: Hidden Markov Models

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Loss Functions, Decision Theory, and Linear Models

MLE/MAP + Naïve Bayes

CS 188: Artificial Intelligence Spring Today

Modern Information Retrieval

The Bayes classifier

Information Retrieval Tutorial 6: Evaluation

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Language Models. Hongning Wang

Seq2Seq Losses (CTC)

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

PV211: Introduction to Information Retrieval

Introduction to Machine Learning

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning

CSC 411: Lecture 09: Naive Bayes

Sequences and Information

The Noisy Channel Model and Markov Models

Introduction to Machine Learning

Learning from Data 1 Naive Bayes

Statistical Methods for NLP

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Probability Review and Naive Bayes. Mladen Kolar and Rob McCulloch

Machine Learning for Signal Processing Bayes Classification and Regression

Classification & Information Theory Lecture #8

Entropy. Expected Surprise

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Bayesian Networks Basic and simple graphs

Transcription:

Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20

Introduction Classification = supervised method for classifying an input, given a finite set of possible classes. Today: Generative classifier that builds a model for each class. Jurafsky & Martin (2015), chapter 7, and Manning et al. (2008), chapter 13 2 / 20

Table of contents 1 Motivation 2 Multinomial naive Bayes classifier 3 Training the classifier 4 Evaluation 3 / 20

Motivation In the following, we are concerned with text classification: the task of classifying an entire text by assigning it a label drawn from some finite of labels. Common text categorization tasks: sentiment analysis spam detection authorship attribution Some classifiers operate with hand-written rules. Our focus is, however, on supervised machine learning. Generative classifiers (e.g., naive Bayes) build a model for each class while discriminative classifiers learn useful features for discriminating between the different classes. 4 / 20

Multinomial naive Bayes classifier Intuition: Represent a text document as a bag-of-words keeping only frequency information but ignoring word order. Example It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. a 6 of 6 is 3 truth 2 that 2 man 2 or 2 it 1 universally 1 acknowledged 1... 5 / 20

Multinomial naive Bayes classifier Naive Bayes returns the class ĉ out of the set C of classes wich has the maximum posterior probability given the document d: ĉ = arg max c C P(c d) Reminder: Bayes rule P(x y) = P(y x)p(x) P(y) ĉ = arg max c C P(c d) = arg max c C P(d c)p(c) P(d) = arg max c C P(d c)p(c) P(c): prior probability of the class c P(d c): likelihood of the document d given the class c P(d c)p(c) = P(d, c): joint probability of class and document 6 / 20

Multinomial naive Bayes classifier We represent d as a set of features f 1,..., f n and make the naive Bayes assumption that P(f 1, f 2,..., f n c) = P(f 1 c)... P(f n c) Each word w 1, w 2,..., w d in the document d is a feature: ĉ = arg max c C P(c) As usual, we calculate in log space: ĉ = arg max c C (log P(c) + d i=1 d i=1 P(w i c) log P(w i c)) 7 / 20

Training the classifier First try: Maximum likelihood estimates, based on frequencies in the training data. Our training data consists of N doc documents, each of which is in a unique class c C. N c is the number of documents belonging to class c. C(w, c) gives the number of times word w occurs in a document from class c. ˆP(c) = N c N doc ˆP(w c) = C(w, c) w C(w, c) 8 / 20

Training the classifier Example Classes A and B, documents to be classified are all d {a, b}. Training data: d c d c aa A ba A ab A bb B (Note that without any smoothing, this example does not allow to calculate in log space because log P(a B) = log 0 is not defined.) P(A) = 0.75, P(B) = 0.25 P(a A) = 4 6 = 2 3, P(b A) = 2 6 = 1 3, P(a B) = 0 2 = 0, P(b B) = 2 2 = 1 Classification of new documents: d P(d A) P(d A)P(A) P(d B) P(d B)P(B) class aaba 0.1 0.075 0 0 A aaa 0.3 0.225 0 0 A bbba 0.02 0.015 0 0 A bbbb 0.01 0.0075 1 0.25 B 9 / 20

Training the classifier Problems: Unseen combinations of w and c. Unknown words. Simplest solution for unseen w, c combinations: add-one (Laplace) smoothing, commonly used in naive Bayes text categorization. ˆP(w c) = (V being the vocabulary.) C(w, c) + 1 w (C(w, c) + 1) = C(w, c) + 1 w C(w, c) + V Standard solution for unknown words: simply remove them from the test document. 10 / 20

Training the classifier Example from Jurafsky & Martin (2015), chapter 7 c d Training - just plain boring - entirely predictable and lacks energy - no surprises and very few laughs + very powerful + the most fun film of the summer Test? S = predictable with no originality P( ) = 3 5, P(+) = 2, V = 20 5 P(S )P( ) = 1 + 1 1 + 1 3 14 + 20 14 + 20 5 = 2 2 3 34 34 5 = 0.002076 1 1 2 P(S +)P(+) = 9 + 20 9 + 20 5 = 2 29 29 5 = 0.000476 11 / 20

Evaluation First consider the simple case of C = 2, i.e., we have only 2 possible classes, i.e., we label is in c or is not in c. The classifier is evaluated on human labeled data (gold labels). For each document, we have a gold label and a system label. Four possibilities: gold positive gold negative system positive true positive false positive system negative false negative true negative 12 / 20

Evaluation The following evaluation metrics are used (t/fp/n = number of true/false positives/negatives): 1 Precision: How many of the items the system classified as positive are actually positive? Precision = tp tp + fp 2 Recall: How many of the positives are classified as positive by the system? tp Recall = tp + fn 3 Accuracy: How many of the classes the system has assigned are correct? tp + tn Accuracy = tp + fp + tn + fn 13 / 20

Evaluation The F-measure combines precision P and recall R: F β = (β2 + 1)PR β 2 P + R β weights the importance of precision and recall: β > 1 favors recall; β < 1 favors precision; β = 1: both are equally important. With β = 1, we get F 1 = 2PR P + R (F 1 = Harmonic mean of P and R.) 14 / 20

Evaluation Example Training data: d c d c a A aa A b A bb B Test data: d c d c ab A bba B ba A bbbb B log P(A) = log 3 4 = 0.12 log P(B) = log 1 4 = 0.6 log P(a A) = log 4 6 = 0.18 log P(b A) = log 2 6 = 0.48 log P(a B) = log 1 4 = 0.6 log P(b B) = log 3 4 = 0.12 Classes assigned to test data: ab: log P(A) + log P(a A) + log P(b A) = 0.12 0.18 0.48 = 0.78 log P(B) + log P(a B) + log P(b B) = 0.6 0.6 0.12 = 1.32 class A since 0.78 > 1.32 bba: A: (0.12 + 2 0.48 + 0.18) = 1.26 B: (0.6 + 2 0.12 + 0.6) = 1.44 class A 15 / 20

Evaluation Example continued Test data: d c gold c system d c gold c system ab A A bba B A ba A A bbbb B B Evaluation with respect to class A (B = not A, i.e., A): gold A gold A P = 2 2+1 = 0.67 A = 3 4 = 0.75 system A 2 1 R = 2 system A 0 1 2+0 = 1 F 1 = 2 2 3 2 = 0.8 +1 Evaluation with respect to class B: gold B gold B system B 1 0 system B 1 2 P = 1 1 = 1 A = 3 4 = 0.75 R = 1 2 = 0.5 F 1 = 2 1 2 1 = 0.67 +1 2 3 16 / 20

Evaluation Evaluation for classifiers with more than 2 classes but a unique class for each document (multinomial classification): Results can be represented in a confusion matrix with one column for every gold class and one row for every sytsem class. We can compute precision and recall for every single class c as before based on a separate contingency matrix for that class. The contingency tables can be pooled into one combined contingency table. Two ways of combining this into an overall evaluation: 1 Macroaveraging: average P/R over all classes. 2 Microaveraging: compute P/R from the pooled contingency table. 17 / 20

Evaluation Example We classify documents as to whether they are from the 18th, 19th or 20th century. Our test set comprises 600 gold labeled documents. Possible confusion matrix: gold labels 18th 19th 20th system 18th 150 35 0 labels 19th 20 110 5 20th 10 10 260 Separate contingency tables: 18th yes no 19th yes no yes 150 35 yes 110 25 no 30 385 no 45 420 Pooled table: yes no y 520 80 n 80 1120 20th yes no yes 260 20 no 5 315 18 / 20

Evaluation Example continued 18 yes no y 150 35 n 30 385 19 yes no y 110 25 n 45 420 20 yes no y 260 20 n 5 315 yes no y 520 80 n 80 1120 Single class P and R: 18th: P 18th = 150 150+35 = 0.81, R 18th = 150 150+30 19th: P 19th = 110 110+25 = 0.81, R 19th = 110 110+45 20th: P 20th = 260 260+20 = 0.93, R 20th = 260 260+5 Macroaverage P: 0.81+0.81+0.93 3 = 0.85 520 Microaverage P: 520+80 = 0.87 Microaverage is dominated by the more frequent classes. 19 / 20

References Jurafsky, Daniel & James H. Martin. 2015. Speech and language processing. an introduction to natural language processing, computational linguistics, and speech recognition. Draft of the 3rd edition. Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. 20 / 20