Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21
Naive Bayes Bayes Rules: Naive Bayes Assumption: p(t x) = p(x t)p(t) p(x) D p(x t) = p(x j t) j=1 Likelihood function: L(θ) = p(x, t θ) = p(x t, θ)p(t θ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 2 / 21
Example: Spam Classification Each vocabulary is one feature dimension. We encode each email as a feature vector x {0, 1} V x j = 1 iff the vocabulary x j appears in the email. We want to model the probability of any word x j appearing in an email given the email is spam or not. Example: $10,000, Toronto, Piazza, etc. Idea: Use Bernoulli distribution to model p(x j t) Example: p( $10, 000 spam) = 0.3 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 3 / 21
Bernoulli Naive Bayes Assuming all data points x (i) are i.i.d. samples, and p(x j t) follows a Bernoulli distribution with parameter µ jt p(t x) p(x (i) t (i) ) = D j=1 N p(t (i) )p(x (i) t (i) ) = i=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) N D p(t (i) ) i=1 j=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) where p(t) = π t. Parameters π t, µ jt can be learnt using maximum likelihood. Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 4 / 21
Derivation of maximum likelihood estimator (MLE) θ = [µ, π] log L(θ) = log p(x, t θ) = N log π t (i) + i=1 D j=1 x (i) j log µ jt (i) + (1 x (i) j ) log(1 µ jt (i)) Want: arg max θ log L(θ) subject to k π k = 1 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 5 / 21
Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L(θ) µ jk = 0 N i=1 N ( ) 1 t (i) = k x (i) j 1 x (i) j = 0 µ jk 1 µ jk i=1 ( ) [ 1 t (i) = k N i=1 x (i) j ( ) 1 t (i) = k µ jk = ( (1 µ jk ) N i=1 1 x (i) j ) ] µ jk = 0 ( ) 1 t (i) = k x (i) j µ jk = N i=1 1 ( t (i) = k ) x (i) j N i=1 1 ( t (i) = k ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 6 / 21
Derivation of maximum likelihood estimator (MLE) Use Lagrange multiplier to derive π L(θ) π k + λ κ π κ π k = 0 λ = π k = λ Apply constraint: k π k = 1 λ = N π k = N i=1 N i=1 1 ( t (i) = k) ) N i=1 1 ( t (i) = k) ) N ( ) 1 1 t (i) = k) π k Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 7 / 21
Spam Classification Demo Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 8 / 21
Gaussian Bayes Classifier Instead of assuming conditional independence of x j, we model p(x t) as a Gaussian distribution and the dependence relation of x j is encoded in the covariance matrix. Multivariate Gaussian distribution: 1 f (x) = ( (2π) D det(σ) exp 1 ) 2 (x µ)t Σ 1 (x µ) µ: mean, Σ: covariance matrix, D: dim(x) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 9 / 21
Derivation of maximum likelihood estimator (MLE) θ = [µ, Σ, π], Z = (2π) D det(σ) p(x t) = 1 ( Z exp 1 ) 2 (x µ)t Σ 1 (x µ) log L(θ) = log p(x, t θ) = log p(t θ) + log p(x t, θ) = N log π t (i) log Z 1 2 i=1 ( ) T ) x (i) µ t (i) Σ 1 (x (i) µ t (i) t (i) Want: arg max θ log L(θ) subject to k π k = 1 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 10 / 21
Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L µ k = N i=0 ( ) 1 t (i) = k Σ 1 (x (i) µ k ) = 0 µ k = N i=1 1 ( t (i) = k ) x (i) N i=1 1 ( t (i) = k ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 11 / 21
Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ 1 (not Σ) Note: det(a) = det(a)a 1T A det(a) 1 = det ( A 1) x T Ax A = xx T Σ T = Σ log L Σ 1 k = N i=0 ( ) [ 1 t (i) = k log Z k Σ 1 k 1 2 (x (i) µ k )(x (i) µ k ) T ] = 0 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 12 / 21
Derivation of maximum likelihood estimator (MLE) log Z k Σ 1 k = 1 Z k Z k = det(σ 1 k ) 1 2 Σ 1 k ( 1 2 Z k = (2π) D det(σ k ) = (2π) D 2 det(σk ) 1 2 (2π) D 2 det ( Σ 1 ) 1 2 k Σ 1 k ) det ( Σ 1 ) 3 2 k det ( Σ 1 ) k Σ T k = 1 2 Σ k log L Σ 1 k = N i=0 ( ) [ 1 t (i) 1 = k 2 Σ k 1 ] 2 (x (i) µ k )(x (i) µ k ) T = 0 Σ k = N i=1 1 ( t (i) = k ) ( x (i) µ k ) ( x (i) µ k ) T N i=1 1 ( t (i) = k ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 13 / 21
Derivation of maximum likelihood estimator (MLE) N i=1 1 ( t (i) = k) ) π k = N (Same as Bernoulli) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 14 / 21
Gaussian Bayes Classifier Demo Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 15 / 21
Gaussian Bayes Classifier If we constrain Σ to be diagonal, then we can rewrite p(x j t) as a product of p(x j t) 1 p(x t) = ( (2π) D det(σ t ) exp 1 ) 2 (x j µ jt ) T Σ 1 t (x k µ kt ) = D j=1 ( 1 exp 1 ) x j µ jt 2 (2π) D 2 = Σ t,jj 2Σ t,jj D p(x j t) j=1 Diagonal covariance matrix satisfies the naive Bayes assumption. Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 16 / 21
Gaussian Bayes Classifier Case 1: The covariance matrix is shared among classes p(x t) = N (x µ t, Σ) Case 2: Each class has its own covariance p(x t) = N (x µ t, Σ t ) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 17 / 21
Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p(x t = 1) = p(x t = 0) log π 1 1 2 (x µ 1) T Σ 1 (x µ 1 ) = log π 0 1 2 (x µ 0) T Σ 1 (x µ 0 ) C + x T Σ 1 x 2µ T 1 Σ 1 x + µ T 1 Σ 1 µ 1 = x T Σ 1 x 2µ T 0 Σ 1 x + µ T 0 Σ 1 µ 0 [ 2(µ 0 µ 1 ) T Σ 1] x (µ 0 µ 1 ) T Σ 1 (µ 0 µ 1 ) = C a T x b = 0 The decision boundary is a linear function (a hyperplane in general). Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 18 / 21
Relation to Logistic Regression We can write the posterior distribution p(t = 0 x) as = = p(x, t = 0) p(x, t = 0) + p(x, t = 1) = π 0 N (x µ 0, Σ) π 0 N (x µ 0, Σ) + π 1 N (x µ 1, Σ) { 1 + π [ 1 exp 1 π 0 2 (x µ 1) T Σ 1 (x µ 1 ) + 1 ]} 1 2 (x µ 0) T Σ 1 (x µ 0 ) { [ 1 + exp log π 1 + (µ 1 µ 0 ) T Σ 1 x + 1 ) (µ ]} 1 T 1 Σ 1 µ 1 µ T 0 Σ 1 µ 0 π 0 2 = 1 1 + exp( w T x b) Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 19 / 21
Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p(x t = 1) = p(x t = 0) log π 1 1 2 (x µ 1) T Σ 1 1 (x µ 1) = log π 0 1 2 (x µ 0) T Σ 1 0 (x µ 0) x T ( Σ 1 1 Σ 1 ) ) ) 0 x 2 (µ T 1 Σ 1 1 µ T 0 Σ 1 0 x + (µ T 0 Σ 0 µ 0 µ T 1 Σ 1 µ 1 = C x T Qx 2b T x + c = 0 The decision boundary is a quadratic function. In 2-d case, it looks like an ellipse, or a parabola, or a hyperbola. Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 20 / 21
Thanks! Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 21 / 21