Probability Review and Naïve Bayes

Similar documents
Text Classification and Naïve Bayes

Classification, Linear Models, Naïve Bayes

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Lecture 2: Probability, Naive Bayes

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes, Maxent and Neural Models

Machine Learning for natural language processing

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 340: Machine Learning and Data Mining

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

MLE/MAP + Naïve Bayes

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 8: Maximum Likelihood Estimation (MLE) (cont d.) Maximum a posteriori (MAP) estimation Naïve Bayes Classifier

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Naïve Bayes classification

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Information Retrieval and Organisation

Classification & Information Theory Lecture #8

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

MLE/MAP + Naïve Bayes

Bayesian Methods: Naïve Bayes

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Modeling Environment

Introduction to Machine Learning

Generative Models for Discrete Data

Bayesian Learning (II)

Introduction to Machine Learning

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Behavioral Data Mining. Lecture 2

Midterm sample questions

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

CS 188: Artificial Intelligence Spring Today

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Logistic Regression. Machine Learning Fall 2018

CS 188: Artificial Intelligence. Machine Learning

Lecture : Probabilistic Machine Learning

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

COMP90051 Statistical Machine Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Natural Language Processing (CSEP 517): Text Classification

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

CSE 473: Artificial Intelligence Autumn Topics

Introduction to Machine Learning Midterm Exam

Machine Learning, Fall 2012 Homework 2

Bayesian Analysis for Natural Language Processing Lecture 2

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Naive Bayes Classifier. Danushka Bollegala

Bayesian Learning Features of Bayesian learning methods:

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

III. Naïve Bayes (pp.70-72) Probability review

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Applied Natural Language Processing

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Language as a Stochastic Process

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 2: Probability and Language Models

More Smoothing, Tuning, and Evaluation

Loss Functions, Decision Theory, and Linear Models

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Learning. Bayesian Learning Criteria

Text Classification and Naïve Bayes

CSC321 Lecture 18: Learning Probabilistic Models

naive bayes document classification

Day 6: Classification and Machine Learning

Computational Cognitive Science

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Machine Learning for Signal Processing Bayes Classification

COMP 328: Machine Learning

The Bayes classifier

Classifying Documents with Poisson Mixtures

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSCE 478/878 Lecture 6: Bayesian Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Machine Learning: Assignment 1

Bayesian Learning Extension

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Parameter estimation Conditional risk

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Why Probability? It's the right way to look at the world.

Be able to define the following terms and answer basic questions about them:

Day 5: Generative models, structured classification

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Transcription:

Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor

What is Probability? The probability the coin will land heads is 0.5 Q: what does this mean? 2 Interpretations: Frequentist (Repeated trials) If we flip the coin many times Bayesian We believe there is equal chance of heads/tails Advantage: events that do not have long term frequencies Q: What is the probability the polar ice caps will melt by 2050?

Probability Review Conditional Probability Chain Rule X P (X = x) =1 x P (A, B) = P (A B) P (B) P (A B)P (B) =P (A, B)

Probability Review X P (X = x, Y )=P(Y) x Disjunction / Union: P (A _ B) =P (A)+P (B) P (A ^ B) Negation: P ( A) =1 P (A)

Hypothesis (Unknown) H Bayes Rule P (D H) Generative Model of How Hypothesis Causes Data Bayesian Inferece P (H D) Bayes Rule tells us how to flip the conditional Reason about effects to causes Useful if you assume a generative model for your data Data (Observed Evidence) D P (H D) = P (D H)P (H) P (D)

Bayes Rule Bayes Rule tells us how to flip the conditional Reason about effects to causes Useful if you assume a generative model for your data Likelihood Prior Posterior P (H D) = P (D H)P (H) P (D) Normalizer

Bayes Rule Bayes Rule tells us how to flip the conditional Reason about effects to causes Useful if you assume a generative model for your data Likelihood Prior Posterior P (H D) = P (D H)P (H) P P (D H)P (H) h Normalizer

Bayes Rule Bayes Rule tells us how to flip the conditional Reason about effects to causes Useful if you assume a generative model for your data Likelihood Prior P (H D) / P (D H)P (H) Posterior Proportional To (Doesn t sum to 1)

Bayes Rule Example There is a disease that affects a tiny fraction of the population (0.01%) Symptoms include a headache and stiff neck 99% of patients with the disease have these symptoms 1% of the general population has these symptoms Q: assume you have the symptom, what is your probability of having the disease?

Text Classification

Is this Spam?

Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 12 of the letters in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton

What is the subject of this article? MEDLINE Article MeSH Subject Category Hierarchy Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology

Positive or negative movie review? unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists this is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes.

Input: Text Classification: definition a document d a fixed set of classes C = {c 1, c 2,, c J } Output: a predicted class c C

Classification Methods: Hand-coded rules Rules based on combinations of words or other features spam: black-list-address OR ( dollars AND have been selected ) Accuracy can be high If rules carefully refined by expert But building and maintaining these rules is expensive

Classification Methods: Supervised Machine Learning Input: a document d a fixed set of classes C = {c 1, c 2,, c J } A training set of m hand-labeled documents (d 1,c 1 ),...,(d m,c m ) Output: a learned classifier γ:d à c

Classification Methods: Supervised Machine Learning Any kind of classifier Naïve Bayes Logistic regression Support-vector machines k-nearest Neighbors

Naïve Bayes Intuition Simple ( naïve ) classification method based on Bayes rule Relies on very simple representation of document Bag of words

Bag of words for document classification Test document parser language label translation Machine Learning learning training algorithm shrinkage network... NLP parser tag training translation language...? Garbage Collection garbage collection memory optimization region... Planning planning temporal reasoning plan language... GUI...

Bayes Rule Applied to Documents and Classes For a document d and a class c P(c d) = P(d c)p(c) P(d)

Naïve Bayes Classifier (I) c MAP = argmax c C = argmax c C = argmax c C P(c d) P(d c)p(c) P(d) P(d c)p(c) MAP is maximum a posteriori = most likely class Bayes Rule Dropping the denominator

Naïve Bayes Classifier (II) c MAP = argmax c C P(d c)p(c) = argmax c C P(x 1, x 2,, x n c)p(c)

Naïve Bayes Classifier (IV) c MAP = argmax c C P(x 1, x 2,, x n c)p(c) O( X n C ) parameters Could only be estimated if a very, very large number of training examples was available. How often does this class occur? We can just count the relative frequencies in a corpus

Multinomial Naïve Bayes Independence Assumptions P(x 1, x 2,, x n c) Bag of Words assumption: Assume position doesn t matter Conditional Independence: Assume the feature probabilities P(x i c j ) are independent given the class c.

Multinomial Naïve Bayes Classifier c MAP = argmax c C c NB = argmax c C P(x 1, x 2,, x n c)p(c) x X P(c j ) P(x c)

Applying Multinomial Naive Bayes Classifiers to Text Classification positions all word positions in test document c NB = argmax c j C i positions P(c j ) P(x i c j )

Learning the Multinomial Naïve Bayes Model First attempt: maximum likelihood estimates simply use the frequencies in the data ˆP(c j ) = doccount(c = c j ) N doc ˆP(w i c j ) = count(w i, c j ) count(w,c j ) w V

Parameter estimation ˆP(w i c j ) = count(w i,c j ) count(w,c j ) w V fraction of times word w i appears among all words in documents of topic c j Create mega-document for topic j by concatenating all docs in this topic Use frequency of w in mega-document

Problem with Maximum Likelihood What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)? Zero probabilities cannot be conditioned away, no matter the other evidence! c = argmax ˆP(c) ˆP(xi c) MAP c i

Laplace (add-1) smoothing for Naïve Bayes ˆP(w i c) = = # % $ w V count(w i, c)+1 & count(w, c) ( + V ' w V count(w i,c) count(w, c) ( )

Multinomial Naïve Bayes: Learning Calculate P(c j ) terms For each c j in C do docs j all docs with class =c j P(c j ) docs j total # documents

Multinomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(w k c j ) terms Text j single doc containing all docs j Foreach word w k in Vocabulary n k # of occurrences of w k in Text j P(w k c j ) n k +α n +α Vocabulary

Exercise

Naïve Bayes Classification: Practical Issues c MAP = argmax c P (c x 1,...,x n ) = argmax c P (x 1,...,x n c)p (c) ny = argmax c P (c) P (x i c) i=1 Multiplying together lots of probabilities Probabilities are numbers between 0 and 1 Q: What could go wrong here?

Working with probabilities in log space

Log Identities (review) log(a b) = log(a) + log(b) log( a b ) = log(a) log(b) log(a n )=n log(a)

Naïve Bayes with Log Probabilities c MAP = argmax c P (c x 1,...,x n ) ny = argmax c P (c) P (x i c) i=1! ny = argmax c log P (c) P (x i c) i=1 nx = argmax c log P (c)+ log P (x i c) i=1

Naïve Bayes with Log Probabilities c MAP = argmax c log P (c)+ nx i=1 log P (x i c) Q: Why don t we have to worry about floating point underflow anymore?

What if we want to calculate posterior P (c x 1,...,x n )= log-probabilities? log P (c x 1,...,x n ) = log = log P (c)+ nx i=1 P (x i c) P (c) Q n i=1 P (x i c) Pc 0 P (c 0 ) Q n i=1 P (x i c 0 ) P (c) Q n i=1 P (x i c) Pc P (c 0 ) Q n 0 i=1 P (x i c 0 ) log " X c 0 P (c 0 ) ny i=1 P (x i c 0 ) #

Log Exp Sum Trick: motivation We have: a bunch of log probabilities. log(p1), log(p2), log(p3), log(pn) We want: log(p1 + p2 + p3 + pn) We could convert back from log space, sum then take the log. If the probabilities are very small, this will result in floating point underflow

Log Exp Sum Trick: log[ X i exp(x i )] = x max + log[ X i exp(x i x max )]

Another issue: Smoothing ˆP (w i c) = P w 0 2V count(w, c)+1 count(w,c) + V

Another issue: Smoothing Alpha doesn t necessarily need to be 1 (hyperparmeter) ˆP (w i c) = P count(w, c)+ w 0 2V count(w0,c)+ V

Another issue: Smoothing Can think of alpha as a pseudocount. Imaginary number of times this word has been seen. ˆP (w i c) = P count(w, c)+ w 0 2V count(w0,c)+ V

Another issue: Smoothing ˆP (w i c) = P count(w, c)+ w 0 2V count(w0,c)+ V Q: What if alpha = 0? Q: what if alpha = 0.000001? Q: what happens as alpha gets very large?

Overfitting Model cares too much about the training data How to check for overfitting? Training vs. test accuracy Pseudocount parameter combats overfitting

Q: how to pick Alpha? Split train vs. Test Try a bunch of different values Pick the value of alpha that performs best What values to try? Grid search (10^-2,10^-1,...,10^2) accuracy Use this one

Data Splitting Train vs. Test Better: Train (used for fitting model parameters) Dev (used for tuning hyperparameters) Test (reserve for final evaluation) Cross-validation

Feature Engineering What is your word / feature representation Tokenization rules: splitting on whitespace? Uppercase is the same as lowercase? Numbers? Punctuation? Stemming?