Naive Bayes and Gaussian Bayes Classifier

Size: px

Start display at page:

Download "Naive Bayes and Gaussian Bayes Classifier"

Gavin Pierce
5 years ago
Views:

1 Naive Bayes and Gaussian Bayes Classifier Elias Tragas October 3, 2016 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

2 Naive Bayes Bayes Rules: Naive Bayes Assumption: p(t x) = p(x t)p(t) p(x) D p(x t) = p(x j t) j=1 Likelihood function: L(θ) = p(x, t θ) = p(x t, θ)p(t θ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

3 Example: Spam Classification Each vocabulary is one feature dimension. We encode each as a feature vector x {0, 1} V x j = 1 iff the vocabulary x j appears in the . Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

4 Example: Spam Classification Each vocabulary is one feature dimension. We encode each as a feature vector x {0, 1} V x j = 1 iff the vocabulary x j appears in the . We want to model the probability of any word x j appearing in an given the is spam or not. Example: $10,000, Toronto, Piazza, etc. Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

5 Example: Spam Classification Each vocabulary is one feature dimension. We encode each as a feature vector x {0, 1} V x j = 1 iff the vocabulary x j appears in the . We want to model the probability of any word x j appearing in an given the is spam or not. Example: $10,000, Toronto, Piazza, etc. Idea: Use Bernoulli distribution to model p(x j t) Example: p( $10, 000 spam) = 0.3 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

6 Bernoulli Naive Bayes Assuming all data points x (i) are i.i.d. samples, and p(x j t) follows a Bernoulli distribution with parameter µ jt p(x (i) t (i) ) = D j=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

7 Bernoulli Naive Bayes Assuming all data points x (i) are i.i.d. samples, and p(x j t) follows a Bernoulli distribution with parameter µ jt p(t x) p(x (i) t (i) ) = D j=1 N p(t (i) )p(x (i) t (i) ) = i=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) N D p(t (i) ) i=1 j=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

8 Bernoulli Naive Bayes Assuming all data points x (i) are i.i.d. samples, and p(x j t) follows a Bernoulli distribution with parameter µ jt p(t x) p(x (i) t (i) ) = D j=1 N p(t (i) )p(x (i) t (i) ) = i=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) N D p(t (i) ) i=1 j=1 µ x(i) j (1 µ jt (i) jt (i)) (1 x(i) j ) where p(t) = π t. Parameters π t, µ jt can be learnt using maximum likelihood. Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

9 Derivation of maximum likelihood estimator (MLE) θ = [µ, π] log L(θ) = log p(x, t θ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

10 Derivation of maximum likelihood estimator (MLE) θ = [µ, π] log L(θ) = log p(x, t θ) = N log π t (i) + i=1 D j=1 x (i) j log µ jt (i) + (1 x (i) j ) log(1 µ jt (i)) Want: arg max θ log L(θ) subject to k π k = 1 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

11 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L(θ) µ jk = 0 N ( ) 1 t (i) = k x (i) j 1 x (i) j = 0 µ jk 1 µ jk i=1 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

12 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L(θ) µ jk = 0 N i=1 N ( ) 1 t (i) = k x (i) j 1 x (i) j = 0 µ jk 1 µ jk i=1 ( ) [ 1 t (i) = k x (i) j ( (1 µ jk ) 1 x (i) j ) ] µ jk = 0 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

13 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L(θ) µ jk = 0 N i=1 N ( ) 1 t (i) = k x (i) j 1 x (i) j = 0 µ jk 1 µ jk i=1 ( ) [ 1 t (i) = k N i=1 x (i) j ( ) 1 t (i) = k µ jk = ( (1 µ jk ) N i=1 1 x (i) j ) ] µ jk = 0 ( ) 1 t (i) = k x (i) j Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

14 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L(θ) µ jk = 0 N i=1 N ( ) 1 t (i) = k x (i) j 1 x (i) j = 0 µ jk 1 µ jk i=1 ( ) [ 1 t (i) = k N i=1 x (i) j ( ) 1 t (i) = k µ jk = ( (1 µ jk ) N i=1 1 x (i) j ) ] µ jk = 0 ( ) 1 t (i) = k x (i) j µ jk = N i=1 1 ( t (i) = k ) x (i) j N i=1 1 ( t (i) = k ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

15 Derivation of maximum likelihood estimator (MLE) Use Lagrange multiplier to derive π L(θ) π k + λ κ π κ π k = 0 λ = N i=1 ( ) 1 1 t (i) = k) π k Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

16 Derivation of maximum likelihood estimator (MLE) Use Lagrange multiplier to derive π L(θ) π k + λ κ π κ π k π k = = 0 λ = N i=1 N i=1 1 ( t (i) = k) ) λ ( ) 1 1 t (i) = k) π k Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

17 Derivation of maximum likelihood estimator (MLE) Use Lagrange multiplier to derive π L(θ) π k + λ κ π κ π k = 0 λ = π k = λ Apply constraint: k π k = 1 λ = N π k = N i=1 N i=1 1 ( t (i) = k) ) N i=1 1 ( t (i) = k) ) N ( ) 1 1 t (i) = k) π k Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

18 Sanity check µ jk = N i=1 1 ( t (i) = k ) x (i) j N i=1 1 ( t (i) = k ) Which means each word affects the probability of some class proportionally to the number of times the class and it co-occurred, over the number of times the class appeared. π k = N i=1 1 ( t (i) = k) ) Which means the prior probability of a class is the amount of times it appeared over the total number of class appearances. N Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

19 Tips Remember that these symbols are supposed to mean something, when you re doing a derivation, focus on keeping the context of all the symbols you introduce. It will help you realize when your results are nonsense. It s important to think about how this model will behave in the real world, for example the NBC output is based on a product numbers x < 1. What does this say about the behaviour of the output over large sequences? We encode our word vector with binary occurence statements, not the number of times they appear. Why does this make sense to do? What are the implications of this approach? Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

20 Spam Classification Demo Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

21 Gaussian Bayes Classifier Instead of assuming conditional independence of x j, we model p(x t) as a Gaussian distribution and the dependence relation of x j is encoded in the covariance matrix. Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

22 Gaussian Bayes Classifier Instead of assuming conditional independence of x j, we model p(x t) as a Gaussian distribution and the dependence relation of x j is encoded in the covariance matrix. Multivariate Gaussian distribution: 1 f (x) = ( (2π) D det(σ) exp 1 ) 2 (x µ)t Σ 1 (x µ) µ: mean, Σ: covariance matrix, D: dim(x) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

23 Derivation of maximum likelihood estimator (MLE) θ = [µ, Σ, π], Z = (2π) D det(σ) p(x t) = 1 ( Z exp 1 ) 2 (x µ)t Σ 1 (x µ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

24 Derivation of maximum likelihood estimator (MLE) θ = [µ, Σ, π], Z = (2π) D det(σ) p(x t) = 1 ( Z exp 1 ) 2 (x µ)t Σ 1 (x µ) log L(θ) = log p(x, t θ) = log p(t θ) + log p(x t, θ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

25 Derivation of maximum likelihood estimator (MLE) θ = [µ, Σ, π], Z = (2π) D det(σ) p(x t) = 1 ( Z exp 1 ) 2 (x µ)t Σ 1 (x µ) log L(θ) = log p(x, t θ) = log p(t θ) + log p(x t, θ) = N log π t (i) log Z 1 2 i=1 ( ) T ) x (i) µ t (i) Σ 1 (x (i) µ t (i) t (i) Want: arg max θ log L(θ) subject to k π k = 1 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

26 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L µ k = N i=0 ( ) 1 t (i) = k Σ 1 (x (i) µ k ) = 0 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

27 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ log L µ k = N i=0 ( ) 1 t (i) = k Σ 1 (x (i) µ k ) = 0 µ k = N i=1 1 ( t (i) = k ) x (i) N i=1 1 ( t (i) = k ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

28 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ 1 (not Σ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

29 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ 1 (not Σ) Note: det(a) = det(a)a 1T A det(a) 1 = det ( A 1) x T Ax A = xx T Σ T = Σ Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

30 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ 1 (not Σ) Note: det(a) = det(a)a 1T A det(a) 1 = det ( A 1) x T Ax A = xx T Σ T = Σ log L Σ 1 k = N i=0 ( ) [ 1 t (i) = k log Z k Σ 1 k 1 2 (x (i) µ k )(x (i) µ k ) T ] = 0 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

31 Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ 1 (not Σ) Note: det(a) = det(a)a 1T A det(a) 1 = det ( A 1) x T Ax A = xx T Σ T = Σ log L Σ 1 k = N i=0 ( ) [ 1 t (i) = k log Z k Σ 1 k 1 2 (x (i) µ k )(x (i) µ k ) T ] = 0 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

32 Derivation of maximum likelihood estimator (MLE) Z k = (2π) D det(σ k ) log Z k Σ 1 k = 1 Z k Z k Σ 1 k = (2π) D 2 det(σk ) 1 2 (2π) D 2 det ( Σ 1 ) 1 2 k Σ 1 k Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

33 Derivation of maximum likelihood estimator (MLE) log Z k Σ 1 k = 1 Z k Z k = det(σ 1 k ) 1 2 Σ 1 k ( 1 2 Z k = (2π) D det(σ k ) = (2π) D 2 det(σk ) 1 2 (2π) D 2 det ( Σ 1 ) 1 2 k Σ 1 k ) det ( Σ 1 ) 3 2 k det ( Σ 1 ) k Σ T k = 1 2 Σ k Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

34 Derivation of maximum likelihood estimator (MLE) log Z k Σ 1 k = 1 Z k Z k = det(σ 1 k ) 1 2 Σ 1 k ( 1 2 Z k = (2π) D det(σ k ) = (2π) D 2 det(σk ) 1 2 (2π) D 2 det ( Σ 1 ) 1 2 k Σ 1 k ) det ( Σ 1 ) 3 2 k det ( Σ 1 ) k Σ T k = 1 2 Σ k log L Σ 1 k = N i=0 ( ) [ 1 t (i) 1 = k 2 Σ k 1 ] 2 (x (i) µ k )(x (i) µ k ) T = 0 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

35 Derivation of maximum likelihood estimator (MLE) log Z k Σ 1 k = 1 Z k Z k = det(σ 1 k ) 1 2 Σ 1 k ( 1 2 Z k = (2π) D det(σ k ) = (2π) D 2 det(σk ) 1 2 (2π) D 2 det ( Σ 1 ) 1 2 k Σ 1 k ) det ( Σ 1 ) 3 2 k det ( Σ 1 ) k Σ T k = 1 2 Σ k log L Σ 1 k = N i=0 ( ) [ 1 t (i) 1 = k 2 Σ k 1 ] 2 (x (i) µ k )(x (i) µ k ) T = 0 Σ k = N i=1 1 ( t (i) = k ) ( x (i) µ k ) ( x (i) µ k ) T N i=1 1 ( t (i) = k ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

36 Derivation of maximum likelihood estimator (MLE) N i=1 1 ( t (i) = k) ) π k = N (Same as Bernoulli) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

37 Gaussian Bayes Classifier Demo Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

38 Gaussian Bayes Classifier If we constrain Σ to be diagonal, then we can rewrite p(x j t) as a product of p(x j t) 1 p(x t) = ( (2π) D det(σ t ) exp 1 ) 2 (x j µ jt ) T Σ 1 t (x k µ kt ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

39 Gaussian Bayes Classifier If we constrain Σ to be diagonal, then we can rewrite p(x j t) as a product of p(x j t) 1 p(x t) = ( (2π) D det(σ t ) exp 1 ) 2 (x j µ jt ) T Σ 1 t (x k µ kt ) = D j=1 ( 1 exp 1 ) x j µ jt 2 (2π) D 2 = Σ t,jj 2Σ t,jj D p(x j t) j=1 Diagonal covariance matrix satisfies the naive Bayes assumption. Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

40 Gaussian Bayes Classifier Case 1: The covariance matrix is shared among classes p(x t) = N (x µ t, Σ) Case 2: Each class has its own covariance p(x t) = N (x µ t, Σ t ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

41 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p(x t = 1) = p(x t = 0) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

42 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p(x t = 1) = p(x t = 0) log π (x µ 1) T Σ 1 (x µ 1 ) = log π (x µ 0) T Σ 1 (x µ 0 ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

43 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p(x t = 1) = p(x t = 0) log π (x µ 1) T Σ 1 (x µ 1 ) = log π (x µ 0) T Σ 1 (x µ 0 ) C + x T Σ 1 x 2µ T 1 Σ 1 x + µ T 1 Σ 1 µ 1 = x T Σ 1 x 2µ T 0 Σ 1 x + µ T 0 Σ 1 µ 0 [ 2(µ 0 µ 1 ) T Σ 1] x (µ 0 µ 1 ) T Σ 1 (µ 0 µ 1 ) = C Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

44 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p(x t = 1) = p(x t = 0) log π (x µ 1) T Σ 1 (x µ 1 ) = log π (x µ 0) T Σ 1 (x µ 0 ) C + x T Σ 1 x 2µ T 1 Σ 1 x + µ T 1 Σ 1 µ 1 = x T Σ 1 x 2µ T 0 Σ 1 x + µ T 0 Σ 1 µ 0 [ 2(µ 0 µ 1 ) T Σ 1] x (µ 0 µ 1 ) T Σ 1 (µ 0 µ 1 ) = C a T x b = 0 The decision boundary is a linear function (a hyperplane in general). Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

45 Relation to Logistic Regression We can write the posterior distribution p(t = 0 x) as p(x, t = 0) p(x, t = 0) + p(x, t = 1) = π 0 N (x µ 0, Σ) π 0 N (x µ 0, Σ) + π 1 N (x µ 1, Σ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

46 Relation to Logistic Regression We can write the posterior distribution p(t = 0 x) as = p(x, t = 0) p(x, t = 0) + p(x, t = 1) = π 0 N (x µ 0, Σ) π 0 N (x µ 0, Σ) + π 1 N (x µ 1, Σ) { 1 + π [ 1 exp 1 π 0 2 (x µ 1) T Σ 1 (x µ 1 ) + 1 ]} 1 2 (x µ 0) T Σ 1 (x µ 0 ) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

47 Relation to Logistic Regression We can write the posterior distribution p(t = 0 x) as = = p(x, t = 0) p(x, t = 0) + p(x, t = 1) = π 0 N (x µ 0, Σ) π 0 N (x µ 0, Σ) + π 1 N (x µ 1, Σ) { 1 + π [ 1 exp 1 π 0 2 (x µ 1) T Σ 1 (x µ 1 ) + 1 ]} 1 2 (x µ 0) T Σ 1 (x µ 0 ) { [ 1 + exp log π 1 + (µ 1 µ 0 ) T Σ 1 x + 1 ) (µ ]} 1 T 1 Σ 1 µ 1 µ T 0 Σ 1 µ 0 π 0 2 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

48 Relation to Logistic Regression We can write the posterior distribution p(t = 0 x) as = = p(x, t = 0) p(x, t = 0) + p(x, t = 1) = π 0 N (x µ 0, Σ) π 0 N (x µ 0, Σ) + π 1 N (x µ 1, Σ) { 1 + π [ 1 exp 1 π 0 2 (x µ 1) T Σ 1 (x µ 1 ) + 1 ]} 1 2 (x µ 0) T Σ 1 (x µ 0 ) { [ 1 + exp log π 1 + (µ 1 µ 0 ) T Σ 1 x + 1 ) (µ ]} 1 T 1 Σ 1 µ 1 µ T 0 Σ 1 µ 0 π 0 2 = exp( w T x b) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

49 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p(x t = 1) = p(x t = 0) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

50 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p(x t = 1) = p(x t = 0) log π (x µ 1) T Σ 1 1 (x µ 1) = log π (x µ 0) T Σ 1 0 (x µ 0) Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

51 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p(x t = 1) = p(x t = 0) log π (x µ 1) T Σ 1 1 (x µ 1) = log π (x µ 0) T Σ 1 0 (x µ 0) x T ( Σ 1 1 Σ 1 ) ) ) 0 x 2 (µ T 1 Σ 1 1 µ T 0 Σ 1 0 x + (µ T 0 Σ 0 µ 0 µ T 1 Σ 1 µ 1 = C Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

52 Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p(x t = 1) = p(x t = 0) log π (x µ 1) T Σ 1 1 (x µ 1) = log π (x µ 0) T Σ 1 0 (x µ 0) x T ( Σ 1 1 Σ 1 ) ) ) 0 x 2 (µ T 1 Σ 1 1 µ T 0 Σ 1 0 x + (µ T 0 Σ 0 µ 0 µ T 1 Σ 1 µ 1 = C x T Qx 2b T x + c = 0 The decision boundary is a quadratic function. In 2-d case, it looks like an ellipse, or a parabola, or a hyperbola. Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

53 Thanks! Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, / 23

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21 Naive Bayes Bayes Rules: Naive Bayes