Naive Bayes and Gaussian Bayes Classifier

Similar documents
Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSC 411: Lecture 09: Naive Bayes

Gaussian discriminant analysis Naive Bayes

Naïve Bayes classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Generative Learning algorithms

Introduction to Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

MLE/MAP + Naïve Bayes

Introduction to Machine Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Maximum likelihood estimation

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Classification: Probabilistic Generative Models

Machine Learning - MT Classification: Generative Models

MLE/MAP + Naïve Bayes

Introduction to Machine Learning

COM336: Neural Computing

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Bayesian Decision Theory

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 1 October 9, 2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

5. Discriminant analysis

Support Vector Machines

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Bayesian Methods: Naïve Bayes

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning

Machine Learning, Fall 2012 Homework 2

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Logistic Regression. Seungjin Choi

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Notes on Discriminant Functions and Optimal Classification

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 8 October Bayes Estimators and Average Risk Optimality

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Chapter 17: Undirected Graphical Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 4 Logistic Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 2: Logistic Regression and Neural Networks

The Bayes classifier

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Linear Classification

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

ECE 5984: Introduction to Machine Learning

CS 195-5: Machine Learning Problem Set 1

CS340 Machine learning Gaussian classifiers

Minimum Error Rate Classification

Statistical Machine Learning Hilary Term 2018

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical Data Mining and Machine Learning Hilary Term 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Naive Bayes & Introduction to Gaussians

Classification. Sandro Cumani. Politecnico di Torino

p(x ω i 0.4 ω 2 ω

Machine Learning for Signal Processing Bayes Classification

CMU-Q Lecture 24:

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

10-701/ Machine Learning - Midterm Exam, Fall 2010

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Linear Models for Regression CS534

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Introduction to Machine Learning Spring 2018 Note 18

An Introduction to Statistical and Probabilistic Linear Models

Latent Variable Models and EM Algorithm

Introduction to Machine Learning

LDA, QDA, Naive Bayes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 2: Simple Classifiers

Bayes Decision Theory

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Bayesian Decision and Bayesian Learning

Inf2b Learning and Data

Generative v. Discriminative classifiers Intuition

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Lecture 2 Machine Learning Review

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Transcription:

Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21

Naive Bayes Bayes Rules: Naive Bayes Assumption: p(t x) = p(x t)p(t) p(x) D p(x t) = p(x j t) j=1 Likelihood function: L(θ) = p(x, t θ) = p(x t, θ)p(t θ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 2 / 21

Example: Spam Classification Each vocabulary is one feature dimension. We encode each email as a feature vector x {0, 1} V x j = 1 iff the vocabulary x j appears in the email. We want to model the probability of any word x j appearing in an email given the email is spam or not. Example: $10,000, Toronto, Piazza, etc. Idea: Use Bernoulli distribution to model p(x j t) Example: p( $10, 000 spam) = 0.3 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 3 / 21

Bernoulli Naive Bayes Assuming all data points x (i) are i.i.d. samples, and p(x j t) follows a Bernoulli distribution with parameter µ jt p(t x) p(x (i) t (i) ) = D j=1 N p(t (i) )p(x (i) t (i) ) = i=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) N D p(t (i) ) i=1 j=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) where p(t) = π t. Parameters π t, µ jt can be learnt using maximum likelihood. Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 4 / 21

Derivation of maximum likelihood estimator (MLE) θ = [µ, π] log L(θ) = log p(x, t θ) = N log π t (i) + i=1 D j=1 x (i) j log µ jt (i) + (1 x (i) j ) log(1 µ jt (i)) Want: arg max θ log L(θ) subject to k π k = 1 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 5 / 21

Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L(θ) µ jk = 0 N i=1 N ( ) 1 t (i) = k x (i) j 1 x (i) j = 0 µ jk 1 µ jk i=1 ( ) [ 1 t (i) = k N i=1 x (i) j ( ) 1 t (i) = k µ jk = ( (1 µ jk ) N i=1 1 x (i) j ) ] µ jk = 0 ( ) 1 t (i) = k x (i) j µ jk = N i=1 1 ( t (i) = k ) x (i) j N i=1 1 ( t (i) = k ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 6 / 21

Derivation of maximum likelihood estimator (MLE) Use Lagrange multiplier to derive π L(θ) π k + λ κ π κ π k = 0 λ = π k = λ Apply constraint: k π k = 1 λ = N π k = N i=1 N i=1 1 ( t (i) = k) ) N i=1 1 ( t (i) = k) ) N ( ) 1 1 t (i) = k) π k Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 7 / 21

Spam Classification Demo Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 8 / 21

Gaussian Bayes Classifier Instead of assuming conditional independence of x j, we model p(x t) as a Gaussian distribution and the dependence relation of x j is encoded in the covariance matrix. Multivariate Gaussian distribution: 1 f (x) = ( (2π) D det(σ) exp 1 ) 2 (x µ)t Σ 1 (x µ) µ: mean, Σ: covariance matrix, D: dim(x) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 9 / 21

Derivation of maximum likelihood estimator (MLE) θ = [µ, Σ, π], Z = (2π) D det(σ) p(x t) = 1 ( Z exp 1 ) 2 (x µ)t Σ 1 (x µ) log L(θ) = log p(x, t θ) = log p(t θ) + log p(x t, θ) = N log π t (i) log Z 1 2 i=1 ( ) T ) x (i) µ t (i) Σ 1 (x (i) µ t (i) t (i) Want: arg max θ log L(θ) subject to k π k = 1 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 10 / 21

Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L µ k = N i=0 ( ) 1 t (i) = k Σ 1 (x (i) µ k ) = 0 µ k = N i=1 1 ( t (i) = k ) x (i) N i=1 1 ( t (i) = k ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 11 / 21

Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ 1 (not Σ) Note: det(a) = det(a)a 1T A det(a) 1 = det ( A 1) x T Ax A = xx T Σ T = Σ log L Σ 1 k = N i=0 ( ) [ 1 t (i) = k log Z k Σ 1 k 1 2 (x (i) µ k )(x (i) µ k ) T ] = 0 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 12 / 21

Derivation of maximum likelihood estimator (MLE) log Z k Σ 1 k = 1 Z k Z k = det(σ 1 k ) 1 2 Σ 1 k ( 1 2 Z k = (2π) D det(σ k ) = (2π) D 2 det(σk ) 1 2 (2π) D 2 det ( Σ 1 ) 1 2 k Σ 1 k ) det ( Σ 1 ) 3 2 k det ( Σ 1 ) k Σ T k = 1 2 Σ k log L Σ 1 k = N i=0 ( ) [ 1 t (i) 1 = k 2 Σ k 1 ] 2 (x (i) µ k )(x (i) µ k ) T = 0 Σ k = N i=1 1 ( t (i) = k ) ( x (i) µ k ) ( x (i) µ k ) T N i=1 1 ( t (i) = k ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 13 / 21

Derivation of maximum likelihood estimator (MLE) N i=1 1 ( t (i) = k) ) π k = N (Same as Bernoulli) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 14 / 21

Gaussian Bayes Classifier Demo Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 15 / 21

Gaussian Bayes Classifier If we constrain Σ to be diagonal, then we can rewrite p(x j t) as a product of p(x j t) 1 p(x t) = ( (2π) D det(σ t ) exp 1 ) 2 (x j µ jt ) T Σ 1 t (x k µ kt ) = D j=1 ( 1 exp 1 ) x j µ jt 2 (2π) D 2 = Σ t,jj 2Σ t,jj D p(x j t) j=1 Diagonal covariance matrix satisfies the naive Bayes assumption. Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 16 / 21

Gaussian Bayes Classifier Case 1: The covariance matrix is shared among classes p(x t) = N (x µ t, Σ) Case 2: Each class has its own covariance p(x t) = N (x µ t, Σ t ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 17 / 21

Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p(x t = 1) = p(x t = 0) log π 1 1 2 (x µ 1) T Σ 1 (x µ 1 ) = log π 0 1 2 (x µ 0) T Σ 1 (x µ 0 ) C + x T Σ 1 x 2µ T 1 Σ 1 x + µ T 1 Σ 1 µ 1 = x T Σ 1 x 2µ T 0 Σ 1 x + µ T 0 Σ 1 µ 0 [ 2(µ 0 µ 1 ) T Σ 1] x (µ 0 µ 1 ) T Σ 1 (µ 0 µ 1 ) = C a T x b = 0 The decision boundary is a linear function (a hyperplane in general). Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 18 / 21

Relation to Logistic Regression We can write the posterior distribution p(t = 0 x) as = = p(x, t = 0) p(x, t = 0) + p(x, t = 1) = π 0 N (x µ 0, Σ) π 0 N (x µ 0, Σ) + π 1 N (x µ 1, Σ) { 1 + π [ 1 exp 1 π 0 2 (x µ 1) T Σ 1 (x µ 1 ) + 1 ]} 1 2 (x µ 0) T Σ 1 (x µ 0 ) { [ 1 + exp log π 1 + (µ 1 µ 0 ) T Σ 1 x + 1 ) (µ ]} 1 T 1 Σ 1 µ 1 µ T 0 Σ 1 µ 0 π 0 2 = 1 1 + exp( w T x b) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 19 / 21

Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p(x t = 1) = p(x t = 0) log π 1 1 2 (x µ 1) T Σ 1 1 (x µ 1) = log π 0 1 2 (x µ 0) T Σ 1 0 (x µ 0) x T ( Σ 1 1 Σ 1 ) ) ) 0 x 2 (µ T 1 Σ 1 1 µ T 0 Σ 1 0 x + (µ T 0 Σ 0 µ 0 µ T 1 Σ 1 µ 1 = C x T Qx 2b T x + c = 0 The decision boundary is a quadratic function. In 2-d case, it looks like an ellipse, or a parabola, or a hyperbola. Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 20 / 21

Thanks! Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 21 / 21