Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Similar documents
Dimensionality reduction

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

CSE546: Naïve Bayes Winter 2012

CSE446: Naïve Bayes Winter 2016

Bayesian Methods: Naïve Bayes

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Machine Learning

Estimating Parameters

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Bias-Variance Tradeoff

The Bayes classifier

Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Gaussian Naïve Bayes Big Picture

MLE/MAP + Naïve Bayes

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

MLE/MAP + Naïve Bayes

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Machine Learning, Fall 2012 Homework 2

Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

ECE 5984: Introduction to Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 8: Maximum Likelihood Estimation (MLE) (cont d.) Maximum a posteriori (MAP) estimation Naïve Bayes Classifier

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CPSC 340: Machine Learning and Data Mining

Logistic Regression. Machine Learning Fall 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Notes on Machine Learning for and

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Bayesian Models in Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Generative Learning algorithms

Machine Learning

Bayesian Learning (II)

Announcements. Proposals graded

Introduction to Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Generative v. Discriminative classifiers Intuition

COMP 328: Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Support Vector Machines

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Non-Parametric Bayes

COMP90051 Statistical Machine Learning

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Probability Based Learning

Machine Learning. Naïve Bayes classifiers

Algorithmisches Lernen/Machine Learning

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning 4771

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Probabilistic Graphical Models

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

CSC 411: Lecture 09: Naive Bayes

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

CS 188: Artificial Intelligence. Machine Learning

Logistic regression and linear classifiers COMS 4771

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Forward algorithm vs. particle filtering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Generative Models for Classification

Expect Values and Probability Density Functions

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

An Introduction to Statistical and Probabilistic Linear Models

Learning with Probabilities

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Intelligent Systems Discriminative Learning, Neural Networks

Transcription:

Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Naïve Bayes Classifier Eric Xing Lecture 3, January 23, 2006 Reading: Chap. 4 CB and handouts Classification Representing data: Choosing hypothesis class Learning: h:x a Y X features Y target classes 1

Suppose you know the following Classification-specific Dist.: P(X Y) p( X Y = 1) = p ( X ; r µ, Σ ) 1 1 1 p( X Y = 2) = p ( X ; r µ, Σ ) 2 2 2 Bayes classifier: Class prior (i.e., "weight"): P(Y) This is a generative model of the data! Optimal classification Theorem: Bayes classifier is optimal! That is Proof: Recall last lecture: How to learn a Bayes classifier? 2

Recall Bayes Rule Which is shorthand for: Equivalently: Recall Bayes Rule Which is shorthand for: Common abbreviation: 3

Learning Bayes Classifier Training data: Learning = estimating P(X Y), and P(Y) Classification = using Bayes rule to calculate P(Y X new ) How hard is it to learn the optimal classifier? How do we represent these? How many parameters? Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X Y): Suppose X is composed of n binary features Complex model High variance with limited data!!! 4

Naïve Bayes: assuming that X i and X j are conditionally independent given Y, for all i j Conditional Independence X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write e.g., Equivalent to: 5

Summary Bayes classifier is the best classifierwhich minimizes the probability of classification error. Nonparametric and parametric classifier A nonparametric classifier does not rely on any assumption concerning the structure of the underlying density function. A classifier becomes the Bayes classifier if the density estimates converge to the true densities when an infinite number of samples are used The resulting error is the Bayes error, the smallest achievable error given the underlying distributions. The Naïve Bayes assumption Naïve Bayes assumption: Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features 6

The Naïve Bayes Classifier Given: Prior P(Y) n conditionally independent features X given the class Y For each X i, we have likelihood P(X i Y) Decision rule: If assumption holds, NB is optimal classifier! Naïve Bayes Algorithm Train Naïve Bayes (examples) for each* value y k estimate for each* value x ij of each attribute X i estimate Classify (X new ) * probabilities must sum to 1, so need estimate only n-1 parameters... 7

Learning NB: parameter estimation Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data D Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data MLE for the parameters of NB Discrete features: Maximum likelihood estimates (MLE s): Given dataset Count(A=a,B=b) number of examples where A=a and B=b 8

Subtleties of NB classifier 1 Violating the NB assumption Often the X i are not really conditionally independent We use Naïve Bayes in many cases anyway, and it often works pretty well often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996]) But the resulting probabilities P(Y X new ) are biased toward 1 or 0 (why?) Subtleties of NB classifier 2 Insufficient training data What if you never see a training instance where X 1000 =a when Y=b? e.g., Y={SpamEmail or not}, X 1000 ={ Rolex } P(X 1000 =T Y=T) = 0 Thus, no matter what the values X 2,,X n take: P(Y=T X 1,X 2,,X 1000 =T,, X n ) = 0 What now??? 9

MAP for the parameters of NB Discrete features: Maximum a Posteriori (MAP) estimate: (MAP s): Given prior: Consider binary feature θ is a Bernoulli rate Γ( αt + α F ) P( θ; αt, α F ) = θ Γ( α ) Γ( α ) T F ( 1 θ ) αt 1 α F θ ( θ ) = B( α, α ) αt 1 α F 1 1 Let β a =Count(X=a) number of examples where X=a T F 1 Bayesian learning for NB parameters a.k.a. smoothing Posterior distribution of θ Bernoulli: Multinomial MAP estimate: Beta prior equivalent to extra thumbtack flips As N, prior is forgotten But, for small sample size, prior is important! 10

MAP for the parameters of NB Dataset of N examples Let β iab =Count(X i =a,y=b) number of examples where X i =a and Y=b Let γ b =Count(Y=b) Prior Q(X i Y) Multinomial(α i1,, α ik ) or Multinomial(α/K) Q(Y) Multinomial(τ i1,, τ im ) or Multinomial(τ/M) m virtual examples MAP estimate Now, even if you never observe a feature/class, posterior probability never zero Text classification Classify e-mails Y = {Spam,NotSpam} Classify news articles Y = {what is the topic of the article?} Classify webpages Y = {Student, professor, project, } What about the features X? The text! 11

Features X are entire document X i for i th word in article NB for Text classification P(X Y) is huge!!! Article at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. NB assumption helps a lot!!! P(X i =x i Y=y) is just the probability of observing word x i in a document on topic y 12

Bag of words model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. Bag of words model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you 13

Bag of words model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 14

NB with Bag of Words for text classification Learning phase: Prior P(Y) Count how many documents you have from each topic (+ prior) P(X i Y) For each topic, count how many times you saw word in documents of this topic (+ prior) Test phase: For each document Use naïve Bayes decision rule Twenty News Groups results 15

Learning curve for Twenty News Groups What if we have continuous X i? Eg., character recognition: X i is i th pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., σ i ), or independent of X i (i.e., σ k ) or both (i.e., σ) 16

Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates: jth training example δ(x)=1 if x true, else 0 Gaussian Naïve Bayes 17

Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response 10 sec Typical impulse response Brain scans can track activation with precision and sensitivity [Mitchell et al.] 18

Gaussian Naïve Bayes: Learned µ voxel,word P(BrainActivity WordCategory = {People,Animal}) [Mitchell et al.] Learned Bayes Models Means for P(BrainActivity WordCategory) [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words 19

What you need to know about Naïve Bayes Optimal decision using Bayes Classifier Naïve Bayes classifier What s the assumption Why we use it How do we learn it Why is Bayesian estimation important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class 20