CSE446: Naïve Bayes Winter 2016

Similar documents
CSE546: Naïve Bayes Winter 2012

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence. Machine Learning

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Dimensionality reduction

CS 188: Artificial Intelligence Spring Today

Bayesian Methods: Naïve Bayes

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Forward algorithm vs. particle filtering

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Naïve Bayes classification

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

EE 511 Online Learning Perceptron

The Naïve Bayes Classifier. Machine Learning Fall 2017

Estimating Parameters

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Day 5: Generative models, structured classification

Bias-Variance Tradeoff

Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Generative v. Discriminative classifiers Intuition

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Expectation Maximization Algorithm

CSE 473: Artificial Intelligence Autumn Topics

CPSC 340: Machine Learning and Data Mining

The Bayes classifier

Generative v. Discriminative classifiers Intuition

CSE446: Clustering and EM Spring 2017

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Generative Clustering, Topic Modeling, & Bayesian Inference

Logistic Regression. Machine Learning Fall 2018

Probabilistic Graphical Models

Generative Learning algorithms

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Generative Models for Classification

MLE/MAP + Naïve Bayes

ECE 5984: Introduction to Machine Learning

CS 188: Artificial Intelligence Learning :

Machine Learning, Fall 2012 Homework 2

Naïve Bayes Lecture 17

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

CS 188: Artificial Intelligence. Outline

Probability Review and Naïve Bayes

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

CS 188: Artificial Intelligence Fall 2011

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Mixtures of Gaussians continued

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

MLE/MAP + Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

CS 343: Artificial Intelligence

Generative v. Discriminative classifiers Intuition

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Algorithms for Classification: The Basic Methods

COMP 328: Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning (CS 567) Lecture 2

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Learning theory Lecture 4

Least Squares Classification

Machine Learning and Deep Learning! Vincent Lepetit!

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

An Introduction to Machine Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Information Retrieval and Organisation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

CS 5522: Artificial Intelligence II

Probabilistic Graphical Models

Machine Learning. Naïve Bayes classifiers

Probability Based Learning

ECE521 week 3: 23/26 January 2017

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Probabilistic Classification

STA 4273H: Statistical Machine Learning

Machine Learning

Machine Learning Overview

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Decision trees COMS 4771

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning for Signal Processing Bayes Classification and Regression

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Transcription:

CSE446: Naïve Bayes Winter 2016 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, Luke ZeIlemoyer

Supervised Learning: find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximanon to f : X à Y Examples: what are X and Y? Spam Detection Map email to {Spam,Ham} Digit recognition Map pixels to {0,1,2,3,4,5,6,7,8,9} Stock Prediction Map new, historic prices, etc. to (the real numbers) R ClassificaNon

Example: Spam Filter Input: email Output: spam/ham Setup: Get a large collection of example emails, each labeled spam or ham Note: someone has to hand label all this data! Want to learn to predict labels of new, future emails Features: The attributes used to make the ham / spam decision Words: FREE! Text Patterns: $dd, CAPS Non-text: SenderInContacts Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Example: Digit Recognition Input: images / pixel grids Output: a digit 0-9 Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data! Want to learn to predict labels of new, future digit images Features: The attributes used to make the digit decision Pixels: (6,8)=ON Shape Patterns: NumComponents, AspectRatio, NumLoops 0 1 2 1??

Other Classification Tasks In classification, we predict labels y (classes) for inputs x Examples: Spam detection (input: document, classes: spam / ham) OCR (input: images, classes: characters) Medical diagnosis (input: symptoms, classes: diseases) Automatic essay grader (input: document, classes: grades) Fraud detection (input: account activity, classes: fraud / no fraud) Customer service email routing many more Classification is an important commercial technology!

Lets take a probabilisnc approach!!! Can we directly esnmate the data distribunon P(X,Y)? How do we represent these? How many parameters? Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X Y): Suppose X is composed of n binary features mpg cylinders displacement horsepower weight acceleration modelyear maker good 4 low low low high 75to78 asia bad 6 medium medium medium medium 70to74 america bad 4 medium medium medium low 75to78 europe bad 8 high high high low 70to74 america bad 6 medium medium medium medium 70to74 america bad 4 low medium low medium 70to74 asia bad 4 low medium low low 70to74 asia bad 8 high high high low 75to78 america : : : : : : : : : : : : : : : : : : : : : : : : bad 8 high high high low 70to74 america good 8 high medium high high 79to83 america bad 8 high high high low 75to78 america good 4 low low low low 79to83 america bad 6 medium medium medium high 75to78 america good 4 medium low low low 79to83 america good 4 low low medium high 79to83 america bad 8 high high high low 70to74 america good 4 low medium low medium 75to78 europe bad 5 medium medium medium medium 75to78 europe Complex model! High variance with limited data!!!

CondiNonal Independence X is condi(onally independent of Y given Z, if the probability distribunon for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

Naïve Bayes Naïve Bayes assumpnon: Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features

The Naïve Bayes Classifier Given: Prior P(Y) n condinonally independent features X given the class Y For each X i, we have likelihood P(X i Y) X 1 X 2 Y X n Decision rule: If certain assumption holds, NB is optimal classifier! Will discuss at end of lecture!

A Digit Recognizer Input: pixel grids Output: a digit 0-9

Naïve Bayes for Digits (Binary Inputs) Simple version: One feature F ij for each grid position <i,j> Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image Each input maps to a feature vector, e.g. Here: lots of features, each is binary valued Naïve Bayes model: Are the features independent given class? What do we need to learn?

Example Distributions 1 0.1 2 0.1 3 0.1 4 0.1 5 0.1 6 0.1 7 0.1 8 0.1 9 0.1 0 0.1 1 0.01 2 0.05 3 0.05 4 0.30 5 0.80 6 0.90 7 0.05 8 0.60 9 0.50 0 0.80 1 0.05 2 0.01 3 0.90 4 0.80 5 0.90 6 0.90 7 0.25 8 0.85 9 0.60 0 0.80

MLE for the parameters of NB Given dataset Count(A=a,B=b): number of examples with A=a and B=b MLE for discrete NB, simply: Prior: Likelihood: X X X X P (Y = y) = P P (X i = x Y = y) = P X X Count(Y = y) P y 0 Count(Y = y ) 0 Count(X i = x, Y = y) P x 0 Count(X i = x,y = y)

SubtleNes of NB classifier 1 ViolaNng the NB assumpnon Usually, features are not condinonally independent: Actual probabilines P(Y X) ojen biased towards 0 or 1 Nonetheless, NB is the single most used classifier out there NB ojen performs well, even when assumpnon is violated [Domingos & Pazzani 96] discuss some condinons for good performance

SubtleNes of NB classifier 2: Overfitting 2 wins!!

For Binary Features: We already know the answer! MAP: use most likely parameter H + H 1 H + H + T + T 2 Beta prior equivalent to extra observanons for each feature As N, prior is forgoien But, for small sample size, prior is important!

Multinomials: Laplace Smoothing Laplace s estimate: Pretend you saw every outcome k extra times H H T What s Laplace with k = 0? k is the strength of the prior Can derive this as a MAP estimate for multinomial with Dirichlet priors Laplace for conditionals: Smooth each condition independently:

Text classificanon Classify e- mails Y = {Spam,NotSpam} Classify news arncles Y = {what is the topic of the arncle?} Classify webpages Y = {Student, professor, project, } What about the features X? The text!

Features X are ennre document X i for i th word in arncle

NB for Text classificanon P(X Y) is huge!!! ArNcle at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is ennre vocabulary, e.g., Webster DicNonary (or more), 10,000 words, etc. NB assumpnon helps a lot!!! P(X i =x i Y=y) is just the probability of observing word x i in a document on topic y

Bag of words model Typical addinonal assumpnon Posi(on in document doesn t ma;er: P(X i =x i Y=y) = P(X k =x i Y=y) (all posinon have the same distribunon) Bag of words model order of words on the page ignored Sounds really silly, but ojen works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

Bag of words model Typical addinonal assumpnon Posi(on in document doesn t ma;er: P(X i =x i Y=y) = P(X k =x i Y=y) (all posinon have the same distribunon) Bag of words model order of words on the page ignored Sounds really silly, but ojen works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you

Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0

NB with Bag of Words for text classificanon Learning phase: Prior P(Y) Count how many documents from each topic (prior) P(X i Y) For each topic, count how many Nmes you saw word in documents of this topic (+ prior); remember this dist n is shared across all posinons i Test phase: For each document Use naïve Bayes decision rule

Twenty News Groups results

Learning curve for Twenty News Groups

What if we have connnuous X i? Eg., character recogninon: X i is i th pixel Gaussian Naïve Bayes (GNB): SomeNmes assume variance is independent of Y (i.e., σ i ), or independent of X i (i.e., σ k ) or both (i.e., σ)

EsNmaNng Parameters: Y discrete, X i connnuous Maximum likelihood esnmates: Mean: jth training example Variance: δ(x)=1 if x true, else 0

X X Bayes Classifier is OpNmal! Learn: h : X! Y X features Y target classes P Suppose: you know true P(Y X): P Bayes classifier: Why? h Bayes (x) =argmax y 0 P (Y = y X = x)

Theorem: P P OpNmal classificanon Bayes classifier h Bayes is opnmal! error true (h Bayes ) error true (h), h Z Why? p h (error) = Z = x x p h (error x)p(x) Z (h(x),y)p(y x)p(x) y

What you need to know about Naïve Bayes Naïve Bayes classifier What s the assumpnon Why we use it How do we learn it Why is Bayesian esnmanon important Text classificanon Bag of words model Gaussian NB Features are snll condinonally independent Each feature has a Gaussian distribunon given class OpNmal decision using Bayes Classifier