CSC 411: Lecture 09: Naive Bayes

Similar documents
Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes classification

CSC 411: Lecture 04: Logistic Regression

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Generative Learning algorithms

CSC 411: Lecture 03: Linear Classification

Machine Learning, Fall 2012 Homework 2

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Gaussian discriminant analysis Naive Bayes

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Inf2b Learning and Data

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

The Bayes classifier

ECE 5984: Introduction to Machine Learning

Introduction to Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

MLE/MAP + Naïve Bayes

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 2: Simple Classifiers

Lecture 3: Pattern Classification

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

The generative approach to classification. A classification problem. Generative models CSE 250B

Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COM336: Neural Computing

LDA, QDA, Naive Bayes

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

The Gaussian distribution

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Intro to Probability. Andrei Barbu

Support Vector Machines

CSC 411: Lecture 02: Linear Regression

Generative v. Discriminative classifiers Intuition

L5: Quadratic classifiers

Machine learning - HT Maximum Likelihood

Machine Learning

Linear Models for Classification

Bias-Variance Tradeoff

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Classification. Sandro Cumani. Politecnico di Torino

Lecture 4 Discriminant Analysis, k-nearest Neighbors

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Notes on Discriminant Functions and Optimal Classification

Machine Learning (CS 567) Lecture 5

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Naive Bayes & Introduction to Gaussians

MLE/MAP + Naïve Bayes

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

CSCI-567: Machine Learning (Spring 2019)

Introduction to Machine Learning

Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Ch 4. Linear Models for Classification

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Methods: Naïve Bayes

Estimating Parameters

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Week 5: Logistic Regression & Neural Networks

Lecture 3: Pattern Classification. Pattern classification

Generative v. Discriminative classifiers Intuition

COMP 328: Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Multivariate statistical methods and data mining in particle physics

Naïve Bayes, Maxent and Neural Models

Mathematical Formulation of Our Example

Algorithmisches Lernen/Machine Learning

Probability Based Learning

5. Discriminant analysis

Linear Classification: Probabilistic Generative Models

Generative Model (Naïve Bayes, LDA)

CPSC 340: Machine Learning and Data Mining

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Transcription:

CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1 / 28

Today Classification Multi-dimensional (Gaussian) Bayes classifier Estimate probability densities from data Naive Bayes classifier Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 2 / 28

Generative vs Discriminative Two approaches to classification: Discriminative classifiers estimate parameters of decision boundary/class separator directly from labeled examples learn p(y x) directly (logistic regression models) learn mappings from inputs to classes (least-squares, neural nets) Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier) Build a model of p(x y) Apply Bayes Rule Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 3 / 28

Bayes Classifier Aim to diagnose whether patient has diabetes: classify into one of two classes (yes C=1; no C=0) Run battery of tests Given patient s results: x = [x 1, x 2,, x d ] T we want to update class probabilities using Bayes Rule: More formally posterior = p(c x) = p(x C)p(C) p(x) Class likelihood prior Evidence How can we compute p(x) for the two class case? p(x) = p(x C = 0)p(C = 0) + p(x C = 1)p(C = 1) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 4 / 28

Classification: Diabetes Example Last class we had a single observation per patient: white blood cell count p(c = 1 x = 48) = Add second observation: Plasma glucose value Now our input x is 2-dimensional p(x = 48 C = 1)p(C = 1) p(x = 48) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 5 / 28

Gaussian Discriminant Analysis (Gaussian Bayes Classifier) Gaussian Discriminant Analysis in its general form assumes that p(x t) is distributed according to a multivariate normal (Gaussian) distribution Multivariate Gaussian distribution: p(x t = k) = 1 (2π) d/2 Σ k 1/2 exp [ (x µ k ) T Σ 1 k (x µ k) ] where Σ k denotes the determinant of the matrix, and d is dimension of x Each class k has associated mean vector µ k and covariance matrix Σ k Typically the classes share a single covariance matrix Σ ( share means that they have the same parameters; the covariance matrix in this case): Σ = Σ 1 = = Σ k Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 6 / 28

Multivariate Data Multiple measurements (sensors) d inputs/features/attributes N instances/observations/examples x (1) 1 x (1) 2 x (1) d x (2) X = 1 x (2) 2 x (2) d...... x (N) 1 x (N) 2 x (N) d Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 7 / 28

Multivariate Parameters Mean E[x] = [µ 1,, µ d ] T Covariance σ1 2 σ 12 σ 1d Σ = Cov(x) = E[(x µ) T σ 12 σ2 2 σ 2d (x µ)] =...... σ d1 σ d2 σd 2 Correlation = Corr(x) is the covariance divided by the product of standard deviation ρ ij = σ ij σ i σ j Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 8 / 28

Multivariate Gaussian Distribution x N (µ, Σ), a Gaussian (or normal) distribution defined as p(x) = 1 (2π) d/2 Σ exp [ (x µ) T Σ 1 (x µ) ] 1/2 Mahalanobis distance (x µ k ) T Σ 1 (x µ k ) measures the distance from x to µ in terms of Σ It normalizes for difference in variances and correlations Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 9 / 28

Bivariate Normal Σ = ( ) 1 0 0 1 Σ = 0.5 ( ) 1 0 0 1 Σ = 2 ( ) 1 0 0 1 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 10 / 28

Bivariate Normal var(x 1 ) = var(x 2 ) var(x 1 ) > var(x 2 ) var(x 1 ) < var(x 2 ) Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 11 / 28

Bivariate Normal Σ = ( ) 1 0 0 1 Σ = ( 1 ) 0.5 0.5 1 Σ = ( 1 ) 0.8 0.8 1 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 12 / 28

Bivariate Normal Cov(x 1, x 2 ) = 0 Cov(x 1, x 2 ) > 0 Cov(x 1, x 2 ) < 0 Figure: Probability density function Figure: Contour plot of the pdf Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 13 / 28

Gaussian Discriminant Analysis (Gaussian Bayes Classifier) GDA (GBC) decision boundary is based on class posterior: log p(t k x) = log p(x t k ) + log p(t k ) log p(x) = d 2 log(2π) 1 log Σ 1 k 2 1 2 (x µ k) T Σ 1 k (x µ k) + + log p(t k ) log p(x) Decision: take the class with the highest posterior probability Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 14 / 28

Decision Boundary likelihoods) discriminant:!! P!(t 1 x")!=!0.5! posterior)for)t 1) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 15 / 28

Decision Boundary when Shared Covariance Matrix Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 16 / 28

Learning Learn the parameters using maximum likelihood N l(φ, µ 0, µ 1, Σ) = log p(x (n), t (n) φ, µ 0, µ 1, Σ) What have we assumed? = log n=1 N p(x (n) t (n), µ 0, µ 1, Σ)p(t (n) φ) n=1 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 17 / 28

More on MLE Assume the prior is Bernoulli (we have two classes) p(t φ) = φ t (1 φ) 1 t You can compute the ML estimate in closed form φ = 1 N N 1[t (n) = 1] n=1 µ 0 = N n=1 1[t(n) = 0] x (n) N n=1 1[t(n) = 0] µ 1 = N n=1 1[t(n) = 1] x (n) N n=1 1[t(n) = 1] Σ = 1 N N (x (n) µ t (n))(x (n) µ t (n)) T n=1 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 18 / 28

Gaussian Discriminative Analysis vs Logistic Regression If you examine p(t = 1 x) under GDA, you will find that it looks like this: p(t x, φ, µ 0, µ 1, Σ) = 1 1 + exp( w T x) where w is an appropriate function of (φ, µ 0, µ 1, Σ) So the decision boundary has the same form as logistic regression! When should we prefer GDA to LR, and vice versa? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 19 / 28

Gaussian Discriminative Analysis vs Logistic Regression GDA makes stronger modeling assumption: assumes class-conditional data is multivariate Gaussian If this is true, GDA is asymptotically efficient (best model in limit of large N) But LR is more robust, less sensitive to incorrect modeling assumptions Many class-conditional distributions lead to logistic classifier When these distributions are non-gaussian, in limit of large N, LR beats GDA Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 20 / 28

Simplifying the Model What if x is high-dimensional? For Gaussian Bayes Classifier, if input x is high-dimensional, then covariance matrix has many parameters Save some parameters by using a shared covariance for the classes Any other idea you can think of? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 21 / 28

Naive Bayes Naive Bayes is an alternative generative model: Assumes features independent given the class p(x t = k) = d p(x i t = k) i=1 Assuming likelihoods are Gaussian, how many parameters required for Naive Bayes classifier? Important note: Naive Bayes does not assume a particular distribution Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 22 / 28

Naive Bayes Classifier Given prior p(t = k) assuming features are conditionally independent given the class likelihood p(x i t = k) for each x i The decision rule y = arg max k p(t = k) d p(x i t = k) If the assumption of conditional independence holds, NB is the optimal classifier If not, a heavily regularized version of generative classifier What s the regularization? Note: NB s assumptions (cond. independence) typically do not hold in practice. However, the resulting algorithm still works well on many problems, and it typically serves as a decent baseline for more sophisticated models Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 23 / 28 i=1

Gaussian Naive Bayes Gaussian Naive Bayes classifier assumes that the likelihoods are Gaussian: [ 1 (xi µ p(x i t = k) = ik ) 2 ] exp 2πσik 2σ 2 ik (this is just a 1-dim Gaussian, one for each input dimension) Model the same as Gaussian Discriminative Analysis with diagonal covariance matrix Maximum likelihood estimate of parameters µ ik = N n=1 1[t(n) = k] x (n) i N n=1 1[t(n) = k] σ 2 ik = N n=1 1[t(n) = k] (x (n) i µ ik ) 2 N n=1 1[t(n) = k] Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 24 / 28

Decision Boundary: Shared Variances (between Classes) variances may be different Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 25 / 28

Decision Boundary: isotropic *? Same variance across all classes and input dimensions, all class priors equal Classification only depends on distance to the mean. Why? Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 26 / 28

Decision Boundary: isotropic In this case: σ i,k = σ (just one parameter), class priors equal (e.g., p(t k ) = 0.5 for 2-class case) Going back to class posterior for GDA: log p(t k x) = log p(x t k ) + log p(t k ) log p(x) = d 2 log(2π) 1 log Σ 1 k 2 1 2 (x µ k) T Σ 1 k (x µ k) + + log p(t k ) log p(x) where we take Σ k = σ 2 I and ignore terms that don t depend on k (don t matter when we take max over classes): log p(t k x) = 1 2σ 2 (x µ k) T (x µ k ) Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 27 / 28

Spam Classification You have examples of emails that are spam and non-spam How would you classify spam vs non-spam? Think about it at home, solution in the next tutorial Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 28 / 28