Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012
Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative Naïve Bayes 2
Classification problem Given: Training set labeled set of N input-output pairs D = x i, y i i=1 y *1,, K+ N Goal: Given an input x, assign it to one of K classes Examples: Spam filter Handwritten digit recognition 3
Types of classifiers Discriminant function f(x) shows the class of the input x Probabilistic classification approaches can be divided in two main categories: Generative Discriminative 4
Target coding scheme Target values: Binary classification: a target variable y 0,1 Multiple classes (K > 2): y is a vector of length K (1-of-K) TargetClass C j : y j = 1 i j y i = 0 e.g., y = 0,0,1,0 T 5
Linear classifiers Linear classifiers: Decision boundaries are linear functions d 1 dimensional hyper-plane within the d dimensional input space. Linearly separable: Data points can be exactly classified by linear decision surfaces. We start by binary linear classification 6
Discriminant functions f x; w = w T x x = 1 x 1 x 2 x d w =,w 0 w 1 w 2 w d - w 0 : bias if f x; w = w T x 0 then C 1 else C 2 Decision boundary: f x; w = 0 f x; w predicts discrete class labels 7
Decision boundary Linear x 2 3 2 1 3 + 3 4 x 1 + x 2 = 0 if w T x 0 then y = 1 else y = 0 1 2 3 4 x 1 w =,3, 0.75, 1-8
Non-linear decision boundary Choose non-linear features Classifier still linear in parameters w x 2 1 + x 1 2 + x 2 2 = 0 1 φ x =,1, x 1, x 2, x 1 2, x 2 2, x 1 x 2 - w =, 1, 0, 0,1,1,0-1 -1 1 x 1 if w T φ(x) 0 then y = 1 else y = 0 x =,x 1, x 2-9
SSE cost function Two classes: Is it suitable for classification? J W = Xw y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 10
Lest squares vs. logistic regression We will see Logistic regression in the next slides Least squares Logistic regression Least square penalizes too correct predictions Least squares also lack robustness to noise 11
SSE cost function Two classes: J W = Xw y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 12
SSE cost function Two classes: g: a sigmoid function Can sigmoid solve the problem? J W = g(xw) y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 13
Multi-class classification Two-class classification Multi-class classification x 2 x 2 x 1 x 1 14
Multi-class classification One-vs-all (one-vs-rest) x 2 x 1 x 2 x 2 15 Class 1: Class 2: Class 3: x 1 x 2 x 1 x
Multi-class approaches One-vs-all All-vs-all 16
Multiple class y *1,2,, K+ if f i x > f j x j i then decide C i 17
SSE cost function Multiple classes: J W = Tr XW Y T XW Y X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) W = w 1 w K Y = y 1 y K 18
Lest squares vs. logistic regression Least squares Logistic regression 19
Logistic regression More general than discriminant functions: f x; w predicts posterior probabilities P y = 1 x 20
Logistic regression (Cont d) f x; w = g(w T x) g. is an activation function Sigmoid (logistic) function Activation function g z = 1 1 + e z 21
Logistic regression (Cont d) 0 f x; w 1 estimated probability of y = 1 on input x (y = 0 or y = 1) f x; w : probability that y = 1 given x, parameterized by w f x; w = P y = 1 x, w P y = 1 x, w = 1 P y = 0 x, w 22
Logistic regression (Cont d) Example: Cancer (Malignant, Benign) f x; w = 0.7 70% chance of tumor being malignant 23
Logistic regression: Decision surface Decision surface f x; w = constant g w T 1 x = = constant 1+e wt x Decision surfaces are linear functions of x Generalized linear models More complex analytical and computational properties than linear regression 24
Logistic regression: Decision surface (Cont d) if f x; w 0.5 then y = 1 else y = 0 Equivalent to if w T x 0 then y = 1 else y = 0 25
Logistic regression: cost function Is SSE is a proper cost function for classification? J w = 1 n Is it convex? n i=1? y i f x i ; w 2 Is it a proper for classification problem? Are conditional distributions p y x, w in the classification problem Gaussian? 26
Logistic regression: loss function Loss y, f x; w = log(f(x; w)) if y = 1 log(1 f x; w ) if y = 0 Is it related to? 1 y y Loss y, y = 0 y = y 27
Logistic regression: loss function y = 1 or y = 0 Loss y, f x; w = y log f x; w (1 y) log(1 f x; w ) 1 f x; w = 1 + exp( w T x) 28
Logistic regression: cost function Loss y, f x; w = log p y f(x; w) when assuming Bernoulli distribution for p y w, x p y w, x = f x; w y (1 f x; w ) (1 y) Maximum (conditional) log likelihood: log p D y D x, w = log p y (i) w, x (i) n i=1 29
Logistic regression: cost function J w = 1 n = 1 n n i=1 n i=1 Loss y i, f x i ; w y (i) log f x (i) ; w (1 y (i) )log 1 f x (i) ; w No closed form solution for min w J(w) However J(w) is convex. 30
Gradient descent w t+1 = w t η w J(w t ) w J w = 1 n n i=1 (f x i ; w y i )x i Similar to gradient of SSE for linear regression? 31
Logistic regression example 32 Figure has been adapted from Jaakkola s slides
Generalization to multi-class Train a logistic regression classifier f i x; w for each class i f i x; w predicts the probability of y = i (i.e., P(y = i x, w)). On a new input x, to make a prediction, pick the class that maximizes f i x; w y x = argmax i f i x; w 33
Logistic regression: multi-class K > 2 p y = k x = exp (w k T x ) K exp (w T j x ) j=1 Normalized exponential (aka softmax) If w k T x w j T x for all j k then p(c k x) 1, p(c j x) 0 34 p C k x = p x C k p(c k ) K p x C j p(c j ) j=1
Logistic Regression (LR): summary LR is a linear classifier LR optimization problem is obtained by maximum likelihood when assuming Bernoulli distribution for conditional probabilities No closed-form solution But convex cost function and global optimum with gradient ascent 35
Perceptron algorithm Linear discriminant model Two-class: y * 1,1+ y = 1 for C 2, y = 1 for C 1 Goal: i, x (i) C 1 w T x (i) > 0 i, x i C 2 w T x i < 0 f x; w g z = = g(w T x) 1, z < 0 1, z 0 36
Perceptron architecture f(x) 37
Perceptron criterion J P w = w T x i y i i M M: subset of training data that are misclassified Assumption: Classes are linearly separable 38
Some classification criteria J(w) J P (w) w 1 w 0 w 1 w 0 misclassification Perceptron 39 Figure source: Duda s Book
Some classification criteria J P (w) J q (w) w 1 w 0 w 1 w 0 Perceptron Least squares 40 Figure source: Duda s Book
Stochastic gradient descent for Perceptron w t+1 = w t η w J P (w t ) w J P w = x i y i Online perceptron: If x (i) is misclassified: i M w t+1 = w t + ηx (i) y (i) Perceptron convergence theorem: for linearly separable data Many solutions? Which solution among them? 41
Convergence of Perceptron Change w in a direction that corrects the error 42
Bayes decision theory Bayes theorem p C k x = p x C k p(c k ) p(x) Posterior probability: p C k x Likelihood: p x C k Prior probability: p(c k ) 43
Bayes decision theory Bayes decision: Choose the class with highest p C k x Cost function: probability of misclassification minimizing the chance of assigning x to the wrong class Can be extended to minimizing loss instead of misclassification error 44
Minimizing misclassification rate R k : Decision regions All points in R k are assigned to class C k p mistake = p x R 1, C 2 + p x R 2, C 1 = p x, C 2 R 1 dx + p x, C 2 R 2 dx Choose class with highest p C k x 45
Probability of correct classification Multi-class K p correct = p x R k, C k k=1 K = p x, C k dx k=1 R k p correct = 1 p mistake 46
Discriminative vs. generative approach 47
Generative approach Inference stage Determine P(x C k ) for each class individually. Determine P(C k ) Use the Bayes theorem to find P(C k x) Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i It models the distribution of inputs as well as outputs. Can generate synthetic data points 48
Discriminative approach Inference stage Determine the posterior class probabilities P(C k x) directly Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i Example: Logistic regression 49
Generative approach: example p x C k = k = 1,2 1 2π d/2 Σ 1/2 exp * 1 2 x μ k T Σ 1 x μ k + p C 1 = p, p C 2 = 1 p 50
Generative approach: example p x C k = k = 1,2 1 2π d/2 Σ 1/2 exp * 1 2 x μ k T Σ 1 x μ k + p C 1 = p, p C 2 = 1 p Maximum likelihood estimation (D = x i, y i n i=1 ): 51 p = n 1 n μ 1 = n i=1 y (i) x (i) n 1, μ 2 = S = n 1 n S 1 + n 2 n S 2 S k = 1 n n n i=1 (1 y (i) )x (i) n 2 x μ k x μ T i=1 k (k = 1,2)
Class conditional densities vs. posterior 52
Maximum likelihood on what? Generative models: Data likelihood P(X, Y w) Compute P(X w) Discriminative models: Conditional Data Likelihood Learn P(Y X, w) required for classification Doesn t waste effort learning P(X w) 53
Discriminative vs. generative: number of parameters d-dimensional feature space Logistic regression: d + 1 parameters w = (w 0, w 1,.., w d ) Generative approach: Gaussian class-conditional densities with shared covariance matrix 2d parameters for means d(d + 1)/2 parameters for shared covariance matrix one parameter for class prior p(c 1 ). 54
Naïve Bayes classifier Generative methods High number of parameters Conditional independence assumption p x C k = p x 1 C k p x 2 C k p x d C k 55
Naïve Bayes classifier p x C k = p x 1 C k p x 2 C k p x d C k p C k x p(c k ) p(x i C k ) n i=1 Two-class Gaussian Naïve Bayes: 2d + 1 parameters 56
Naïve Bayes: discrete example p PlayTennis = Yes = 9 14 = 0.64 p PlayTennis = No = 5 14 = 0.36 p Outlook = Sunny PlayTennis = Yes = 3 9 = 0.333 p Outlook = Sunny PlayTennis = No = 3 5 = 0.6 57
Naïve Bayes: discrete example x = Sunny, Cool, Hig, Strong p Yes x = p Yes p Sunny Yes P Cool Yes P Hig Yes P Strong Yes = 0.0053 p No x = p No p Sunny Yes P Cool Yes P Hig Yes P Strong Yes = 0.0206 58
Bayes optimal classifier Training Data: D = x i, y i i=1 H: Hypothesis space n 59 p C k x, D = p C k, x p( D) h H n p D = p y i x i, i=1 p( D) p D p()
Summary of alternatives Generative Most demanding, because it finds the joint distribution p(x, C k ) Usually needs a large training set to find p(x C k ) Can find p(x) Outlier or novelty detection Discriminative Specifies what is really needed (i.e., p(c k x)) More computationally efficient Discriminant function Does not have many capabilities of probabilistic methods 60