Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016
Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Naïve Bayes Discriminative models Logistic regression 2
Classification problem: probabilistic view Each feature as a random variable Class label also as a random variable We observe the feature values for a random sample and we intend to find its class label Evidence: feature vector x Query: class label 3
Definitions Posterior probability: p C k x Likelihood or class conditional probability: p x C k Prior probability: p(c k ) K p(x): pdf of feature vector x (p x = k=1 p x C k p(c k )) p(x C k ): pdf of feature vector x for samples of class C k p(c k ): probability of the label be C k 4
Bayes decision rule If P C 1 x > P(C 2 x) decide C 1 otherwise decide C 2 K = 2 p error x = p(c 2 x) if we decide C 1 P(C 1 x) if we decide C 2 If we use Bayes decision rule: P error x = min{p C 1 x, P(C 2 x)} Using Bayes rule, for each x, P error x is as small as possible and thus this rule minimizes the probability of error 5
Optimal classifier The optimal decision is the one that minimizes the expected number of mistakes We show that Bayes classifier is an optimal classifier 6
Bayes decision rule Minimizing misclassification rate Decision regions: R k = {x α x = k} K = 2 All points in R k are assigned to class C k p error = E x,y I(α(x) y) = p x R 1, C 2 + p x R 2, C 1 = R 1 p x, C 2 dx + R 2 p x, C 1 dx = R 1 p C 2 x p x dx + R 2 p C 1 x p x dx Choose class with highest p C k x as α x 7
Bayes minimum error Bayes minimum error classifier: min α(.) E x,y I(α(x) y) Zero-one loss If we know the probabilities in advance then the above optimization problem will be solved easily. α x = argmax y p(y x) In practice, we can estimate p(y x) based on a set of training samples D 8
Bayes theorem Bayes theorem Posterior Likelihood p C k x = p x C k p(c k ) p(x) Prior Posterior probability: p C k x Likelihood or class conditional probability: p x C k Prior probability: p(c k ) p(x): pdf of feature vector x (p x = k=1 K p x C k p(c k )) p(x C k ): pdf of feature vector x for samples of class C k p(c k ): probability of the label be C k 9
Bayes decision rule: example Bayes decision: Choose the class with highest p C k x p(c 1 x) p(x C 2 ) p(x C 1 ) p(c 2 x) R 2 R 2 p C 1 = 2 3 p C 2 = 1 3 p C k x = p x C k p(c k ) p(x) p x = p C 1 p x C 1 + p C 2 p x C 2 10
Bayesian decision rule If P C 1 x > P(C 2 x) decide C 1 otherwise decide C 2 Equivalent If p x C 1 P(C 1 ) p(x) > p x C 2 P(C 2 ) p(x) decide C 1 otherwise decide C 2 Equivalent If p x C 1 P(C 1 ) > p x C 2 P(C 2 ) decide C 1 otherwise decide C 2 11
Bayes decision rule: example Bayes decision: Choose the class with highest p C k x 2 p(x C 1 ) p(x C 2 ) p(x C 1 ) R 2 R 2 p C 1 = 2 3 p(c 1 x) p C 2 = 1 3 p(c 2 x) 12 R 2 R 2
Bayes Classier Simple Bayes classifier: estimate posterior probability of each class What should the decision criterion be? Choose class with highest p C k x The optimal decision is the one that minimizes the expected number of mistakes 13
Diabetes example white blood cell count 14 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411
Diabetes example Doctor has a prior p y = 1 = 0.2 Prior: In the absence of any observation, what do I know about the probability of the classes? A patient comes in with white blood cell count x Does the patient have diabetes p y = 1 x? given a new observation, we still need to compute the posterior 15
Diabetes example p x y = 0 p x y = 1 16 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411
Estimate probability densities from data If we assume Gaussian distributions for p(x C 1 ) and p(x C 2 ) Recall that for samples {x 1,, x N }, if we assume a Gaussian distribution, the MLE estimates will be 17
Diabetes example p x y = 0 p x y = 1 p x y = 1 = N μ 1, σ 1 2 μ 1 = n: y(n) =1 x(n) n: y (n) =1 σ 1 2 = n: y(n) =1 x n μ 1 N 1 n: y(n)=1 x(n) = 1 N 1 2 18 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411
Diabetes example Add a second observation: Plasma glucose value 19 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411
Generative approach for this example Multivariate Gaussian distributions for p(x C k ): p x y = k 1 = 2π d/2 Σ 1/2 exp{ 1 2 x μ k T Σ 1 k x μ k } k = 1,2 Prior distribution p(x C k ): p y = 1 = π, p y = 0 = 1 π 20
MLE for multivariate Gaussian For samples {x 1,, x N }, if we assume a multivariate Gaussian distribution, the MLE estimates will be: μ = n=1 N N x (n) Σ = 1 N N n=1 x (n) μ x (n) μ T 21
Generative approach: example Maximum likelihood estimation (D = x n, y n N n=1 ): π = N 1 N μ 1 = Σ 1 = 1 N 1 Σ 2 = 1 N 2 n=1 N y (n) x (n) N N 1, μ 2 = n=1 (1 y (n) )x (n) N 2 n=1 N y (n) x (n) μ x (n) μ T n=1 N (1 y n ) x (n) μ x (n) μ T N 1 = N y (n) n=1 N 2 = N N 1 22
Decision boundary for Gaussian Bayes classifier p C 1 x = p(c 2 x) ln p(c 1 x) = ln p(c 2 x) ln p(x C 1 ) + ln p(c 1 ) ln p(x) = ln p(x C 2 ) + ln p(c 2 ) ln p(x) ln p(x C 1 ) + ln p(c 1 ) = ln p(x C 2 ) + ln p(c 2 ) ln p(x C k ) p C k x = p x C k p(c k ) p(x) = d 2 ln 2π 1 2 ln Σ k 1 1 2 x μ k T Σ k 1 x μ k 23
Decision boundary p(x C 1 ) p(x C 2 ) p(c 1 x)=p(c 2 x) p(c 1 x) 24
Shared covariance matrix When classes share a single covariance matrix Σ = Σ 1 = Σ 2 p x C k = 1 2π d/2 Σ 1/2 exp{ 1 2 x μ k T Σ 1 x μ k } k = 1,2 p C 1 = π, p C 2 = 1 π 26
Likelihood N n=1 N p(x n, y (n) π, μ 1, μ 2, Σ) = p(x n y n, μ 1, μ 2, Σ)p(y n π) n=1 27
Shared covariance matrix Maximum likelihood estimation (D = x i, y i n i=1 ): μ 2 = μ 1 = π = N 1 N y (n) x (n) n=1 N n=1 N N 1 (1 y (n) )x (n) N 2 Σ = 1 N n C1 x (n) μ 1 x (n) μ 1 T + n C 2 x (n) μ 2 x (n) μ 2 T 28
Decision boundary when shared covariance matrix ln p(x C 1 ) + ln p(c 1 ) = ln p(x C 2 ) + ln p(c 2 ) ln p(x C k ) = d 2 ln 2π 1 2 ln Σ k 1 1 2 x μ k T Σ 1 x μ k 29
Bayes decision rule Multi-class misclassification rate Multi-class problem: Probability of error of Bayesian decision rule Simpler to compute the probability of correct decision P error = 1 P(correct) P Correct = = K i=1 K i=1 R i p(x, C i ) dx R i p C i x p(x) dx R i : the subset of feature space assigned to the class C i using the classifier 31
Bayes minimum error Bayes minimum error classifier: min α(.) E x,y I(α(x) y) Zero-one loss α x = argmax y p(y x) 32
Minimizing Bayes risk (expected loss) K E x,y L α x, y = L α x, C j p x, C j dx j=1 K = p x L α x, C j p C j x dx j=1 for each x minimize it that is called conditional risk Bayes minimum loss (risk) decision rule: α(x) K α(x) = argmin i=1,,k j=1 L ij p C j x 33 The loss of assigning a sample to C i where the correct class is C j
Minimizing expected loss: special case (loss = misclassification rate) Problem definition for this special case: If action α x = i is taken and the true category is C j, then the decision is correct if i = j and otherwise it is incorrect. Zero-one loss function: L ij = 1 δ ij = 0 i = j 1 o. w. K α x = argmin i=1,,k j=1 L ij p C j x = argmin i=1,,k 0 p C i x + j i p C j x 35 = argmin i=1,,k 1 p C i x = argmax i=1,,k p C i x
Probabilistic discriminant functions Discriminant functions: A popular way of representing a classifier A discriminant function f i x for each class C i (i = 1,, K): x is assigned to class C i if: f i (x) > f j (x) j i Representing Bayesian classifier using discriminant functions: Classifier minimizing error rate: f i x = P(C i x) Classifier minimizing risk: f i x = K j=1 L ij p C j x 36
Naïve Bayes classifier Generative methods High number of parameters Assumption: Conditional independence p x C k = p x 1 C k p x 2 C k p x d C k 37
Naïve Bayes classifier In the decision phase, it finds the label of x according to: argmax k=1,,k argmax k=1,,k p(c k ) p C k x n i=1 p(x i C k ) p x C k = p x 1 C k p x 2 C k p x d C k p C k x p(c k ) p(x i C k ) n i=1 38
Naïve Bayes classifier Finds d univariate distributions p x 1 C k,, p x d C k of finding one multi-variate distribution p x C k instead Example 1: For Gaussian class-conditional density p x C k, it finds d + d (mean and sigma parameters on different dimensions) instead of d + d(d+1) parameters 2 Example 2: For Bernoulli class-conditional density p x C k, it finds d (mean parameters on different dimensions) instead of 2 d 1 parameters It first estimates the class conditional densities p x 1 C k,, p x d C k and the prior probability p(c k ) for each class (k = 1,, K) based on the training set. 39
Naïve Bayes: discrete example p H = Yes = 0.3 p D = Yes H = Yes = 1 3 p S = Yes H = Yes = 2 3 p D = Yes H = No = 2 7 p S = Yes H = No = 2 7 Decision on x = [Yes, Yes] (a person that has diabetes and also smokes): p H = Yes x p H = Yes p D = yes H = Yes p S = yes H = Yes = 0.066 p H = No x p H = No p D = yes H = No p S = yes H = No = 0.057 Thus decide H = yes Diabetes (D) Smoke (S) Heart Disease (H) Y N Y Y N N N Y N N Y N N N N N Y Y N N N N Y Y N N N Y N N 40
Probabilistic classifiers How can we find the probabilities required in the Bayes decision rule? Probabilistic classification approaches can be divided in two main categories: Generative Estimate pdf p(x, C k ) for each class C k p(c k x) or alternatively estimate both pdf p(x C k ) and p C k Discriminative Directly estimate p(c k x) for each class C k and then use it to find to find p(c k x) 41
Generative approach Inference stage Determine class conditional densities p(x C k ) p(c k ) Use the Bayes theorem to find p(c k x) and priors Decision stage: After learning the model (inference stage), make optimal class assignment for new input if p C i x > p C j x j i then decide C i 42
Discriminative vs. generative approach 43 [Bishop]
Class conditional densities vs. posterior p x C 1 p x C 2 p C 1 x [Bishop] p C 1 x = σ(w T x + w 0 ) 44 w = Σ 1 μ 1 μ 2 w 0 = 1 2 μ 1 T Σ 1 μ 1 + 1 2 μ 2 T Σ 1 μ 2 + ln p(c 1) p(c 2 )
Discriminative approach Inference stage Determine the posterior class probabilities P(C k x) directly Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i 45
Posterior probabilities Two-class: p C k x can be written as a logistic sigmoid for a wide choice of p x C k distributions 1 p C 1 x = σ a(x) = 1 + exp( a(x)) Multi-class: p C k x can be written as a soft-max for a wide choice of p x C k p C k x = exp(a k(x)) exp(a j (x)) j=1 K 46
Discriminative approach: logistic regression More general than discriminant functions: f x; w predicts posterior probabilities P y = 1 x K = 2 f x; w = σ(w T x) x = 1, x 1,, x d w = w 0, w 1,, w d σ. is an activation function Sigmoid (logistic) function Activation function σ z = 1 1 + e z 47
Logistic regression f x; w : probability that y = 1 given x (parameterized by w) P y = 1 x, w = f x; w K = 2 y {0,1} P y = 0 x, w = 1 f x; w f x; w = σ(w T x) 0 f x; w 1 estimated probability of y = 1 on input x Example: Cancer (Malignant, Benign) f x; w = 0.7 70% chance of tumor being malignant 48
Logistic regression: Decision surface Decision surface f x; w = constant f x; w = σ w T 1 x = = 0.5 1+e (wt x) Decision surfaces are linear functions of x if f x; w 0.5 then y = 1 else y = 0 Equivalent to if w T x + w 0 0 then y = 1 else y = 0 49
Logistic regression: ML estimation Maximum (conditional) log likelihood: n w = argmax w log i=1 p y (i) w, x (i) p y (i) w, x (i) = f x (i) ; w y(i) 1 f x (i) ; w (1 y(i) ) log p y X, w n = y (i) log f x (i) ; w i=1 + (1 y (i) )log 1 f x (i) ; w 50
Logistic regression: cost function w = argmin w J(w) J w = n n i=1 log p y (i) w, x (i) = y (i) log f x (i) ; w (1 y (i) )log 1 f x (i) ; w i=1 No closed form solution for However J(w) is convex. w J(w) = 0 51
Logistic regression: Gradient descent w t+1 = w t η w J(w t ) w J w = n i=1 f x i ; w y i x i Is it similar to gradient of SSE for linear regression? w J w = n i=1 w T x i y i x i 52
Logistic regression: loss function Loss y, f x; w = y log f x; w (1 y) log(1 f x; w ) Since y = 1 or y = 0 Loss y, f x; w = log(f(x; w)) if y = 1 log(1 f x; w ) if y = 0 How is it related to zero-one loss? Loss y, y = 1 y y 0 y = y f x; w = 1 1 + exp( w T x) 53
Logistic regression: cost function (summary) Logistic Regression (LR) has a more proper cost function for classification than SSE and Perceptron Why is the cost function of LR also more suitable than? J w = 1 n i=1 where f x; w = σ(w T x) n y i f x i ; w 2 The conditional distribution p y x, w in the classification problem is not Gaussian (it is Bernoulli) The cost function of LR is also convex 54
Multi-class logistic regression For each class k, f k x; W predicts the probability of y = k i.e., P(y = k x, W) On a new input x, to make a prediction, pick the class that maximizes f k x; W : α x = argmax k=1,,k f k x if f k x > f j x decide C k j k then 55
Multi-class logistic regression K > 2 y {1,2,, K} f k x; W = p y = k x = exp(w k T x ) K exp(w T j x ) j=1 Normalized exponential (aka softmax) If w k T x w j T x for all j k then p(c k x) 1, p(c j x) 0 56 p C k x = p x C k p(c k ) K p x C j p(c j ) j=1
Logistic regression: multi-class 57 J W = log = log = n i=1 W = argmin W n i=1 K k=1 n i=1 K k=1 p y i J(W) x i, W f k x i ; W y i k y k i log f k x (i) ; W y is a vector of length K (1-of-K coding) e.g., y = 0,0,1,0 T when the target class is C 3 W = w 1 w K Y = y (1) y (n) = y 1 (1) (1) y K (n) y K y 1 (n)
Logistic regression: multi-class w j t+1 = w j t η W J(W t ) wj J W = n i=1 f j x i ; W y j i x i 58
Logistic Regression (LR): summary LR is a linear classifier LR optimization problem is obtained by maximum likelihood when assuming Bernoulli distribution for conditional 1 probabilities whose mean is 1+e (wt x) No closed-form solution for its optimization problem But convex cost function and global optimum can be found by gradient ascent 59
Discriminative vs. generative: number of parameters d-dimensional feature space Logistic regression: d + 1 parameters w = (w 0, w 1,.., w d ) Generative approach: Gaussian class-conditionals with shared covariance matrix 2d parameters for means d(d + 1)/2 parameters for shared covariance matrix one parameter for class prior p(c 1 ). But LR is more robust, less sensitive to incorrect modeling assumptions 60
Summary of alternatives Generative Most demanding, because it finds the joint distribution p(x, C k ) Usually needs a large training set to find p(x C k ) Can find p(x) Outlier or novelty detection Discriminative Specifies what is really needed (i.e., p(c k x)) More computationally efficient 61
Resources C. Bishop, Pattern Recognition and Machine Learning, Chapter 4.2-4.3. 62