Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall PDF Free Download

Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012

Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative Naïve Bayes 2

Classification problem Given: Training set labeled set of N input-output pairs D = x i, y i i=1 y *1,, K+ N Goal: Given an input x, assign it to one of K classes Examples: Spam filter Handwritten digit recognition 3

Types of classifiers Discriminant function f(x) shows the class of the input x Probabilistic classification approaches can be divided in two main categories: Generative Discriminative 4

Target coding scheme Target values: Binary classification: a target variable y 0,1 Multiple classes (K > 2): y is a vector of length K (1-of-K) TargetClass C j : y j = 1 i j y i = 0 e.g., y = 0,0,1,0 T 5

Linear classifiers Linear classifiers: Decision boundaries are linear functions d 1 dimensional hyper-plane within the d dimensional input space. Linearly separable: Data points can be exactly classified by linear decision surfaces. We start by binary linear classification 6

Discriminant functions f x; w = w T x x = 1 x 1 x 2 x d w =,w 0 w 1 w 2 w d - w 0 : bias if f x; w = w T x 0 then C 1 else C 2 Decision boundary: f x; w = 0 f x; w predicts discrete class labels 7

Decision boundary Linear x 2 3 2 1 3 + 3 4 x 1 + x 2 = 0 if w T x 0 then y = 1 else y = 0 1 2 3 4 x 1 w =,3, 0.75, 1-8

Non-linear decision boundary Choose non-linear features Classifier still linear in parameters w x 2 1 + x 1 2 + x 2 2 = 0 1 φ x =,1, x 1, x 2, x 1 2, x 2 2, x 1 x 2 - w =, 1, 0, 0,1,1,0-1 -1 1 x 1 if w T φ(x) 0 then y = 1 else y = 0 x =,x 1, x 2-9

SSE cost function Two classes: Is it suitable for classification? J W = Xw y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 10

Lest squares vs. logistic regression We will see Logistic regression in the next slides Least squares Logistic regression Least square penalizes too correct predictions Least squares also lack robustness to noise 11

SSE cost function Two classes: J W = Xw y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 12

SSE cost function Two classes: g: a sigmoid function Can sigmoid solve the problem? J W = g(xw) y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 13

Multi-class classification Two-class classification Multi-class classification x 2 x 2 x 1 x 1 14

Multi-class classification One-vs-all (one-vs-rest) x 2 x 1 x 2 x 2 15 Class 1: Class 2: Class 3: x 1 x 2 x 1 x

Multi-class approaches One-vs-all All-vs-all 16

Multiple class y *1,2,, K+ if f i x > f j x j i then decide C i 17

SSE cost function Multiple classes: J W = Tr XW Y T XW Y X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) W = w 1 w K Y = y 1 y K 18

Lest squares vs. logistic regression Least squares Logistic regression 19

Logistic regression More general than discriminant functions: f x; w predicts posterior probabilities P y = 1 x 20

Logistic regression (Cont d) f x; w = g(w T x) g. is an activation function Sigmoid (logistic) function Activation function g z = 1 1 + e z 21

Logistic regression (Cont d) 0 f x; w 1 estimated probability of y = 1 on input x (y = 0 or y = 1) f x; w : probability that y = 1 given x, parameterized by w f x; w = P y = 1 x, w P y = 1 x, w = 1 P y = 0 x, w 22

Logistic regression (Cont d) Example: Cancer (Malignant, Benign) f x; w = 0.7 70% chance of tumor being malignant 23

Logistic regression: Decision surface Decision surface f x; w = constant g w T 1 x = = constant 1+e wt x Decision surfaces are linear functions of x Generalized linear models More complex analytical and computational properties than linear regression 24

Logistic regression: Decision surface (Cont d) if f x; w 0.5 then y = 1 else y = 0 Equivalent to if w T x 0 then y = 1 else y = 0 25

Logistic regression: cost function Is SSE is a proper cost function for classification? J w = 1 n Is it convex? n i=1? y i f x i ; w 2 Is it a proper for classification problem? Are conditional distributions p y x, w in the classification problem Gaussian? 26

Logistic regression: loss function Loss y, f x; w = log(f(x; w)) if y = 1 log(1 f x; w ) if y = 0 Is it related to? 1 y y Loss y, y = 0 y = y 27

Logistic regression: loss function y = 1 or y = 0 Loss y, f x; w = y log f x; w (1 y) log(1 f x; w ) 1 f x; w = 1 + exp( w T x) 28

Logistic regression: cost function Loss y, f x; w = log p y f(x; w) when assuming Bernoulli distribution for p y w, x p y w, x = f x; w y (1 f x; w ) (1 y) Maximum (conditional) log likelihood: log p D y D x, w = log p y (i) w, x (i) n i=1 29

Logistic regression: cost function J w = 1 n = 1 n n i=1 n i=1 Loss y i, f x i ; w y (i) log f x (i) ; w (1 y (i) )log 1 f x (i) ; w No closed form solution for min w J(w) However J(w) is convex. 30

Gradient descent w t+1 = w t η w J(w t ) w J w = 1 n n i=1 (f x i ; w y i )x i Similar to gradient of SSE for linear regression? 31

Logistic regression example 32 Figure has been adapted from Jaakkola s slides

Generalization to multi-class Train a logistic regression classifier f i x; w for each class i f i x; w predicts the probability of y = i (i.e., P(y = i x, w)). On a new input x, to make a prediction, pick the class that maximizes f i x; w y x = argmax i f i x; w 33

Logistic regression: multi-class K > 2 p y = k x = exp (w k T x ) K exp (w T j x ) j=1 Normalized exponential (aka softmax) If w k T x w j T x for all j k then p(c k x) 1, p(c j x) 0 34 p C k x = p x C k p(c k ) K p x C j p(c j ) j=1

Logistic Regression (LR): summary LR is a linear classifier LR optimization problem is obtained by maximum likelihood when assuming Bernoulli distribution for conditional probabilities No closed-form solution But convex cost function and global optimum with gradient ascent 35

Perceptron algorithm Linear discriminant model Two-class: y * 1,1+ y = 1 for C 2, y = 1 for C 1 Goal: i, x (i) C 1 w T x (i) > 0 i, x i C 2 w T x i < 0 f x; w g z = = g(w T x) 1, z < 0 1, z 0 36

Perceptron architecture f(x) 37

Perceptron criterion J P w = w T x i y i i M M: subset of training data that are misclassified Assumption: Classes are linearly separable 38

Some classification criteria J(w) J P (w) w 1 w 0 w 1 w 0 misclassification Perceptron 39 Figure source: Duda s Book

Some classification criteria J P (w) J q (w) w 1 w 0 w 1 w 0 Perceptron Least squares 40 Figure source: Duda s Book

Stochastic gradient descent for Perceptron w t+1 = w t η w J P (w t ) w J P w = x i y i Online perceptron: If x (i) is misclassified: i M w t+1 = w t + ηx (i) y (i) Perceptron convergence theorem: for linearly separable data Many solutions? Which solution among them? 41

Convergence of Perceptron Change w in a direction that corrects the error 42

Bayes decision theory Bayes theorem p C k x = p x C k p(c k ) p(x) Posterior probability: p C k x Likelihood: p x C k Prior probability: p(c k ) 43

Bayes decision theory Bayes decision: Choose the class with highest p C k x Cost function: probability of misclassification minimizing the chance of assigning x to the wrong class Can be extended to minimizing loss instead of misclassification error 44

Minimizing misclassification rate R k : Decision regions All points in R k are assigned to class C k p mistake = p x R 1, C 2 + p x R 2, C 1 = p x, C 2 R 1 dx + p x, C 2 R 2 dx Choose class with highest p C k x 45

Probability of correct classification Multi-class K p correct = p x R k, C k k=1 K = p x, C k dx k=1 R k p correct = 1 p mistake 46

Discriminative vs. generative approach 47

Generative approach Inference stage Determine P(x C k ) for each class individually. Determine P(C k ) Use the Bayes theorem to find P(C k x) Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i It models the distribution of inputs as well as outputs. Can generate synthetic data points 48

Discriminative approach Inference stage Determine the posterior class probabilities P(C k x) directly Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i Example: Logistic regression 49

Generative approach: example p x C k = k = 1,2 1 2π d/2 Σ 1/2 exp * 1 2 x μ k T Σ 1 x μ k + p C 1 = p, p C 2 = 1 p 50

Generative approach: example p x C k = k = 1,2 1 2π d/2 Σ 1/2 exp * 1 2 x μ k T Σ 1 x μ k + p C 1 = p, p C 2 = 1 p Maximum likelihood estimation (D = x i, y i n i=1 ): 51 p = n 1 n μ 1 = n i=1 y (i) x (i) n 1, μ 2 = S = n 1 n S 1 + n 2 n S 2 S k = 1 n n n i=1 (1 y (i) )x (i) n 2 x μ k x μ T i=1 k (k = 1,2)

Class conditional densities vs. posterior 52

Maximum likelihood on what? Generative models: Data likelihood P(X, Y w) Compute P(X w) Discriminative models: Conditional Data Likelihood Learn P(Y X, w) required for classification Doesn t waste effort learning P(X w) 53

Discriminative vs. generative: number of parameters d-dimensional feature space Logistic regression: d + 1 parameters w = (w 0, w 1,.., w d ) Generative approach: Gaussian class-conditional densities with shared covariance matrix 2d parameters for means d(d + 1)/2 parameters for shared covariance matrix one parameter for class prior p(c 1 ). 54

Naïve Bayes classifier Generative methods High number of parameters Conditional independence assumption p x C k = p x 1 C k p x 2 C k p x d C k 55

Naïve Bayes classifier p x C k = p x 1 C k p x 2 C k p x d C k p C k x p(c k ) p(x i C k ) n i=1 Two-class Gaussian Naïve Bayes: 2d + 1 parameters 56

Naïve Bayes: discrete example p PlayTennis = Yes = 9 14 = 0.64 p PlayTennis = No = 5 14 = 0.36 p Outlook = Sunny PlayTennis = Yes = 3 9 = 0.333 p Outlook = Sunny PlayTennis = No = 3 5 = 0.6 57

Naïve Bayes: discrete example x = Sunny, Cool, Hig, Strong p Yes x = p Yes p Sunny Yes P Cool Yes P Hig Yes P Strong Yes = 0.0053 p No x = p No p Sunny Yes P Cool Yes P Hig Yes P Strong Yes = 0.0206 58

Bayes optimal classifier Training Data: D = x i, y i i=1 H: Hypothesis space n 59 p C k x, D = p C k, x p( D) h H n p D = p y i x i, i=1 p( D) p D p()

Summary of alternatives Generative Most demanding, because it finds the joint distribution p(x, C k ) Usually needs a large training set to find p(x C k ) Can find p(x) Outlier or novelty detection Discriminative Specifies what is really needed (i.e., p(c k x)) More computationally efficient Discriminant function Does not have many capabilities of probabilistic methods 60

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012