Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Similar documents
Introduction to AI Learning Bayesian networks. Vibhav Gogate

Naïve Bayes classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning, Fall 2012 Homework 2

Bayesian Methods: Naïve Bayes

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence. Machine Learning

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Logistic Regression. William Cohen

Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Stochastic Gradient Descent

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Generative v. Discriminative classifiers Intuition

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

CS 188: Artificial Intelligence Spring Today

Ad Placement Strategies

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Introduction to Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bias-Variance Tradeoff

Machine Learning

Naïve Bayes, Maxent and Neural Models

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Machine Learning Gaussian Naïve Bayes Big Picture

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Estimating Parameters

CSC 411: Lecture 09: Naive Bayes

Machine Learning for Signal Processing Bayes Classification and Regression

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Generative v. Discriminative classifiers Intuition

CPSC 340: Machine Learning and Data Mining

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning (II)

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Notes on Machine Learning for and

Introduction to Machine Learning

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Probability Review and Naïve Bayes

MLE/MAP + Naïve Bayes

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Generative Clustering, Topic Modeling, & Bayesian Inference

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Introduction to Machine Learning

Introduction to Bayesian Learning

Support Vector Machines

Machine Learning

Probabilistic Graphical Models

Classification Logistic Regression

Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

MLE/MAP + Naïve Bayes

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

COMP 328: Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Probabilistic Classification

ECE 5984: Introduction to Machine Learning

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Week 5: Logistic Regression & Neural Networks

CSCE 478/878 Lecture 6: Bayesian Learning

Some Probability and Statistics

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Review: Bayesian learning and inference

Midterm, Fall 2003

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 2 Machine Learning Review

Machine Learning

Mathematical Formulation of Our Example

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Midterm sample questions

Expectation maximization tutorial

COMP90051 Statistical Machine Learning

Dimensionality reduction

Stephen Scott.

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Deconstructing Data Science

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

A brief introduction to Conditional Random Fields

Expectation Maximization Algorithm

Generative Models for Discrete Data

PAC-learning, VC Dimension and Margin-based Bounds

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

The Bayes classifier

Probabilistic Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Transcription:

Naïve Bayes Classifiers and Logistic Regression Doug Downey Northwestern EECS 349 Winter 2014

Naïve Bayes Classifiers Combines all ideas we ve covered Conditional Independence Bayes Rule Statistical Estimation Bayes Nets in a simple, yet accurate classifier Classifier: Function f(x) from X = {<x 1,, x d >} to Class E.g., X = {<GRE, GPA, Letters>}, Class = {yes, no, wait}

Probability => Classification (1 of 2) Classification Task: Learn function f(x) from X = {<x 1,, x d >} to Class Given: Examples D={(x, y)} Probabilistic Approach Learn P(Class = y X = x) from D Given x, pick the maximally probable y

Probability => Classification (2 of 2) More formally f(x) = arg max y P(Class = y X = x, MAP ) MAP : MAP parameters, learned from data That is, parameters of P(Class = y X = x) we ll focus on using MAP estimate, but can also use ML or Bayesian Predict next coin flip? Instance of this problem X = null Given D= hhht tht, estimate P( D), find MAP Predict Class = heads iff MAP > ½

Example: Text Classification Dear Sir/Madam, We are pleased to inform you of the result of the Lottery Winners International programs held on the 30/8/2004. Your e-mail address attached to ticket number: EL-23133 with serial Number: EL-123542, batch number: 8/163/EL-35, lottery Ref number: EL-9318 and drew lucky numbers 7-1-8-36-4-22 which consequently won in the 1st category, you have therefore been approved for a lump sum pay out of US$1,500,000.00 (One Million, Five Hundred Thousand United States dollars) SPAM NOT SPAM?

X = document Representation Estimate P(Class = {spam, non-spam} X) Question: how to represent X? One dimension for each possible e-mail, i.e. possible permutation of words? No. Lots of possibilities, common choice: bag of words Dear Sir/Madam, We are pleased to inform you of the result of the Lottery Winners International programs held on the 30/8/2004. Your e-mail address attached to ticket number: EL-23133 with serial Number: EL- 123542, batch number: 8/163/EL-35, lottery Ref number: EL-9318 and drew lucky numbers 7-1-8-36-4-22 which consequently won in the 1st category, you have therefore been approved for a lump sum pay out of US$1,500,000.00 (One Million, Five Hundred Thousand United States dollars) Sir 1 Lottery 10 Dollars 7 With 38

Bag of Words Ignores Word Order, i.e. No emphasis on title No compositional meaning ( Cold War -> cold and war ) Etc. But, massively reduces dimensionality/complexity Still and all Recording presence or absence of a 100,000-word vocab entails 2^100,000 distinct vectors

Naïve Bayes Classifiers P(Class X) for Val(X) = 2^100,000 requires 2^100,000 parameters Problematic. Bayes Rule: P(Class X) = P(X Class) P(Class) / P(X) Assume presence of word i is independent of all other words given Class: P(Class X) = i P(w i Class) P(Class) / P(X) Now only 200,001 parameters for P(Class X)

Naïve Bayes Assumption Features are conditionally independent given class Not P( Republican, Democrat ) = P( Republican )P( Democrat ) but instead P( Republican, Democrat Class = Politics) = P( Republican Class = Politics)P( Democrat Class = Politics) Still, an absurd assumption ( Lottery Winner SPAM)? ( lunch noon Not SPAM)? But: offers massive tractability advantages and works quite well in practice Lesson: Overly strong independence assumptions sometimes allow you to build an accurate model where you otherwise couldn t

Getting the parameters from data Parameters = < ij =P(w i Class = j) > Maximum Likelihood: Estimate P(w i Class = j) from D by counting Fraction of documents in class j containing word i But if word i never occurs in class j? Commonly used MAP estimate: (# docs in class j with word i) + 1 (# docs in class j) + V

Caveats Naïve Bayes effective as a classifier Not as effective in producing probability estimates i P(w i Class) pushes estimates toward 0 or 1 In practice, numerical underflow is typical at classification time Compare sum of logs instead of product

Discriminative vs. Generative training Say our graph G has variables X, Y Previous method learns P(X, Y ) But often, the only inferences we care about are of form P(Y X) P(Disease Symptoms = e) P(StockMarketCrash RecentPriceActivity = e)

Discriminative vs. Generative training Learning P(X, Y ): generative training Learned model can generate the full data X, Y Learning only P(Y X): discriminative training Model can t assign probs. to X only Y given X Idea: Only model what we care about Don t waste data on params irrelevant to task Side-step false independence assumptions in training (example to follow)

Generative Model Example Naïve Bayes model Y binary {1=spam, 0=not spam} X an n-vector: message has word (1) or not (0) Re-write P(Y X) using Bayes Rule, apply Naïve Bayes assumption 2n + 1 parameters, for n observed variables Spam Lottery winner... Dear

Generative => Discriminative (1 of 3) But P(Y X) can be written more compactly P(Y X) = 1 1 + exp(w 0 + w 1 x 1 + + w n x n ) Total of n + 1 parameters w i Spam Lottery winner... Dear

Generative => Discriminative (2 of 3) One way to do conversion (vars binary): exp(w 0 )= P(Y = 0) P(X 1 =0 Y=0) P(X 2 =0 Y=0) P(Y = 1) P(X 1 =0 Y=1) P(X 2 =0 Y=1) for i > 0: exp(w i )= P(X i =0 Y=1) P(X i =1 Y=0) P(X i =0 Y=0) P(X i =1 Y=1)

Generative => Discriminative (3 of 3) We reduced 2n + 1 parameters to n + 1 Bias vs. Variance arguments says this must be better, right? Not exactly. If we construct P(Y X) to be equivalent to Naïve Bayes (as before) then it s equivalent to Naïve Bayes Idea: optimize the n + 1 parameters directly, using training data

Discriminative Training In our example: P(Y X) = 1 1 + exp(w 0 + w 1 x 1 + + w n x n ) Goal: find w i that maximize likelihood of training data Ys given training data Xs Known as logistic regression Solved with gradient ascent techniques A convex (actually concave) optimization problem

Naïve Bayes vs. LR Naïve Bayes trusts its assumptions in training Logistic Regression doesn t recovers better when assumptions violated

NB vs. LR: Example Training Data SPAM Lottery Winner Lunch Noon 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 0 1 1 0 1 Naïve Bayes will classify the last example incorrectly, even after training on it! Whereas Logistic Regression is perfect with e.g., w 0 = 0.1 w lottery = w winner = w lunch = -0.2 w noon = 0.4

Logistic Regression in practice Can be employed for any numeric variables X i or for other variable types, by converting to numeric (e.g. indicator) functions Regularization plays the role of priors in Naïve Bayes Optimization tractable, but (way) more expensive than counting (as in Naïve Bayes)

Discriminative Training Naïve Bayes vs. Logistic Regression one illustrative case Applicable more broadly, whenever queries P(Y X) known a priori