LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning

Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary is not a straight line. Decision boundary: Line that separates the three classes (x, circle, diamond) from each other. *http://nlp.stanford.edu/

Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary is not a straight line. Decision boundary On the contrary, for linear classifiers the decision boundary is a straight line (or a hyperplane). Decision boundaries *http://nlp.stanford.edu/

Classification problem Problem: Based on a labeled set of data points (X,y) learn a function f: X à y in order to predict the label (class) of new unseen data points X. The input feature vectors x may be numeric, but the target variable y is categorical. Binary classification: the target variable y can have only 2 values, representing the possible existing classes. Multiclass (or multivariate) classification: the target variable y can have more than 2 values (but a finite number of them).

Linear classifiers x 1 Unknown record What is the class of the bullet? x 2

Linear classifiers x 1 Unknown record What is the class of the bullet? The line will tell us! x 2

Linear classifiers x 1 Linearly separable sets: there exist a hyperplane (here a line) that correctly classifies all points. x 2

Binary classification Two classes to choose from yes or no, negative or positive. Input: training data: n dimensional vectors x (or points) Process: Find a linear function described by the following equation f(x)=w 1 x 1 + +w n x n + w 0 = w T x The sign (+ or -) of f(x) will show the class of a newly given point. So, the predicted class y of a new point x is: y = 1 if f(x )>0 = 1 if f(x )<0

Binary classification w T x=0 (decision boundary) x 1 x 2

Binary classification w T x=0 (decision boundary) x 1 w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) class 1 w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) w T x<0 (class -1) x 2

Learning the classification boundary There are many algorithms that differ on: How they measure if the model fits well the training data (typically called the loss function). If they use regularization or not. We are going to see the following: Simple Perceptron Logistic Regression Support Vector Classifier (SVC) (not only the linear case) Naive Bayes

The simple Perceptron algorithm Computes vector w to linearly separate the input points (with no guarantees that the optimal solution will be found). Always converges if points are linearly separable. Many possible lines depending on the starting point. A small learning rate lowers the probability of misclassifying correct points. Rosenblatt Algorithm: 1. Randomly initialize w. 2. Set the learning rateηto a value between 0 and 1. 3. Repeat until all points correctly classified: 4. For points m in the set of misclassified points M: 5. w w + ηy m x m

Perceptron example

Support Vector Machine Class 1 Class -1 x 1 Many possible lines.. Which one is the best? x 2 Slides inspired from Jinwei Gu, An Introduction of Support Vector Machine

Support Vector Machine x 1 Many possible lines.. Which one is the best? Intuitively the one far away from the data points x 2

Support Vector Machine x 1 Many possible lines.. Which one is the best? Margin: defined by the distance of the closest data point from the decision boundary. x 2 Intuitively the one far away from the data points for which the margin is maximized.

Support Vector Machine x 1 support vectors x 2 The data points (vectors) that define the margin are called support vectors. They form a small subset of the dataset used to define the decision boundary. *The rest of the data points do not participate in the decision.

Support Vector Machine x 1 The furthest the point is from the decision boundary the more confident we are about the class assignment. x 2

Support Vector Machine x 1 The furthest the point is from the decision boundary the more confident we are about the class assignment. x 2 We are more confident for the class assignment for than for.

Support Vector Machine x 1 So, picking the green line (by the SVM classifier) gives a classification safety margin: a slight error in the measurement will not affect the result. x 2 Larger margin classifier à Better generalization ability and noise-tolerant.

Support Vector Machine x 1 For example, if we had chosen the blue line, would have been assigned to class 1( ) x 2

Support Vector Machine x 1 x 2 For example, if we had chosen the blue line, would have been assigned to class 1( ) whereas with the green line it is correctly classified in class -1 ( )

Support Vector Machine Learning phase x 1 n = w w x + n x + x - x - x 2 Unit-length normal vector of the hyperplane We search for w (weight vector) and b (intercept) to define the decision boundary. For the decision hyperplane it holds that: w T x + b = 0

Support Vector Machine Learning phase x 1 x + d n x + x - x - x 2 For the support vectors x + and x - it holds that T + wx T wx + b = 1 + b = 1 For the margin width it holds that n = w w Unit-length normal vector of the hyperplane M = 2d = 2 (wt x + + b) w (d is the distance of a support vector from the decision boundary) = 2 w

Support Vector Machine Learning phase x 1 x + n x + x - x - x 2 Optimization Problem: Maximize margin width 2 w such that For y i = +1, For y i = -1, w T x i + b 1 w T x i + b 1

Support Vector Machine Learning phase x 1 x + n x + x - x - x 2 Optimization Problem (alternatively): Minimize 1 2 w 2 such that y i ( w T x i + b) 1

Support Vector Machine Learning phase Quadratic programming with linear constraints Lagrangian Function Minimize 1 2 w 2 such that y i ( w T x i + b) 1 n 1 2 minimize L ( w, b, α ) = w α y ( w x + b) 1 ( T ) p i i i i 2 i= 1 s.t. αi 0 Lagrangian multipliers

Support Vector Machine Learning phase n 1 2 minimize L ( w, b, α ) = w α y ( w x + b) 1 ( T ) p i i i i 2 i= 1 s.t. αi 0 L p = 0 w = α y x w i= 1 L p b = 0 n i= 1 n α y i i i i i = 0

Support Vector Machine Learning phase n 1 2 minimize L ( w, b, α ) = w α y ( w x + b) 1 ( T ) p i i i i 2 i= 1 s.t. αi 0 Lagrangian Dual Problem maximize s.t. αi 0 Property of α i when we introduce the lagrange multipliers. 1 n n n T αi αα i jyy i j i j i= 1 2 i= 1 j= 1 n xx, and i= 1 α y i i = 0 The result when we differentiate the original Lagrangian w.r.t. b.

Support Vector Machine n Learning phase From KKT* conditions, we know: ( T y wx b ) α ( + ) 1 = 0 i i i n Thus, only support vectors have αi 0 n The solution has the form: n w = α yx = α yx i i i i i i i= 1 i SV b = y i - w T x i where x i is a support vector *Karush-Kuhn-Tucker

Support Vector Machine Learning phase n The solution has the form: n w = α yx = α yx i i i i i i i= 1 i SV b = y i - w T x i where x i is a support vector

Support Vector Machine Testing phase n The linear discriminant function is: g(x) = w T x + b = i SV α i y i x i T x + b n If g(x) 0 then x is class 1, otherwise it belongs to class -1. n Note: it relies on a dot product between the test point x and the support vectors x i n Also keep in mind that solving the optimization problem involved computing the dot products x it x j between all pairs of training points

Soft margin classification x 1 What if a circle point exists among the dots? Then the data are not linearly separable! x 2

Soft margin classification x 1 ξ i Allow for small classification errors by introducing slack variables ξ i. x i x 2 A non-zero value for ξ i allows x i to not meet the margin requirement at a cost proportional to the value of ξ i.

Soft margin classification x 1 Optimization problem ξ i x i x 2 min w 2 +C ξ i w,ξ i subject to N i=1 y i (w T x i + b) 1 ξ i for i =1...N C is the penalty that we pay for each error.

What about non-linearly separable data? Here, a straight line cannot separate well the data! x 1 x 2

What about non-linearly separable data? The circle describes the separation better.. x 1 0 0 x 2

What about non-linearly separable data? What if we introduce a third dimension, representing a circle? x 1 0 0 x 2

What about non-linearly separable data? x 3 = x 1 2 + x 2 2 x 1 0 x 3 Φ(x) (non-linear transformation) 0 x 1 0 x 2 0 x 2 Now, the data in the transformed space defined by (x 1, x 2, x 3 ) are linearly separable! So, let s useφ(x) instead of x!

Support Vector Classifier Learning phase minimize L (, b, 1 ) = y ( + b) 1 n 2 For the p w original αi w space αi i w xi 2 i= 1 s.t. αi 0 ( T ) Lagrangian Dual Problem maximize s.t. αi 0 Property of α i when we introduce the lagrange multipliers. Data points participated only as dot product.. 1 n n n T αi αα i jyy i j i j i= 1 2 i= 1 j= 1 n xx, and i= 1 α y i i = 0 The result when we differentiate the original Lagrangian w.r.t. b.

Support Vector Classifier Learning phase minimize L (, b, 1 ) = y ( + b) 1 n 2 For the p w original αi w space αi i w xi 2 i= 1 s.t. αi 0 ( T ) Lagrangian Dual Problem maximize s.t. αi 0 Property of α i when we introduce the lagrange multipliers. Let s replace them with the transformed data points! 1 n n n T αi αα i jyy i j i j i= 1 2 i= 1 j= 1 n xx, and i= 1 α y i i = 0 The result when we differentiate the original Lagrangian w.r.t. b.

Support Vector Machine Kernels Learning phase n i=1 maximize α i 1 2 s.t. αi 0 n n i=1 j=1, and α i α j y i y j Φ(x i ) T Φ(x j ) n i= 1 α y i i = 0 K(x i,x j )=Φ(x i ) T Φ(x j ) is called the kernel function Kernel trick: Instead of having to transform each data point to the new space, we can directly replace with the dot product.

Support Vector Machine Kernels Testing phase g(x) = i SV α i K(x i,x) + b n If g(x) 0 then x is class 1, otherwise it belongs to class -1.

Popular Kernels Polynomial kernel K(x,y) = (x T y +1) d d=1: linear kernel d=2: quadratic kernel (Gaussian) Radial basis function (RBF) K(x,y) = exp( γ x y 2 )

Logistic Regression Despite its name, it is a classification method. 1 It applies a sigmoid function σ ( f (x)) = over a linear classifier f(x)=w T 1+ e f (x) x. So, for a point x we have the two cases: { 0.5 σ ( f (x)) <0.5 then then y =1 y = 1 It can also be interpreted as the confidence (aka probability) with which an element is in a binary (1,-1) class y. p(y x) = 1 1+ e wt x A sigmoid (sigma-like) function fit to the data.

Logistic Regression Computes the confidence (aka probability) with which an element is in a binary (1,-1) class c. 1 p(y x) = 1+ e wt x Predict y=1 Predict y=-1 Obviously, p(y =1 x) =1 p(y = 1 x)

Logistic Regression Logistic regression is linear* because the decision boundary it produces is linear in function of x. To see this, consider the probabilities with which a certain point x belongs to the class y=1 and to the class y= 1 1 p(y =1 x) = 1+ e wt x =... = e wtx log p(y =1 x) log = w T x 1 p(y = 1 x) 1 p(y = 1 x) 1+ e wt x For the decision boundary the probabilities are equal to 0.5, so log 0.5 0.5 = 0 = wt x Decision boundary hyperplane *More concretely, it is a generalized linear model.

Logistic Regression Why do we choose the sigmoid function to fit the data? Least squares fit σ(w 1 x +w 0 ) fit to y w 1 x +w 0 fit to y Fit of w 1 x +w 0 is dominated by the more distant points and.. causes misclassification. Instead LR regresses the sigmoid to the class data.

Logistic Regression Similarly in the 2D space LR Linear LR Linear

Logistic Regression To compute the parameters w we have to solve an optimization problem. For this we are going to use the data likelihood function: N 1 l(w) = 1+ e y iw T x i i=1 Thus, we search for w that maximizes the likelihood of the data: N N N 1 # 1 & log = log 1+ e y iw T x i % 1+ e y iw T x $ i ( = log 1+ e y iw T x i ' i=1 i=1 ( ) For convenience, the loss function is the negative logarithm of the data likelihood: N L(w) = i=1 i=1 log( 1+ e y iw T x i )

Logistic Regression Finally, w is the argument that minimizes L(w): w = argmin L(w) = argmin As in linear regression, in logistic regression we can add a regularization term r(w) (L 1 or L 2 ): w = argmin( N i 1 log( 1+ e y iw T x i ) where C is the tuning parameter and r(w) is w 2 2 or w 1. Note that via regularization, we also allow for small misclassifications (to be correlated to soft margin Support Vector Classifier). N i 1 log( 1+ e y iw T x i ) + Cr(w))

Logistic Regression w = argmin( N i=1 log( 1+ e y iw T x i ) + Cr(w)) For correctly classified points y i w T (x i ) is negative, and thus log(1+ e y iwt(x i) ) is near zero. For incorrectly classified points y i w T (x i ) is positive, and thus log(1+ e y iwt(x i) ) can be large. Hence the optimization penalizes parameters which lead to such misclassifications.

Naive Bayes Classifier Assumes that the features x are conditionally independent, given the output y. Here we will see the method for discrete values. Uses the Bayesian rule

Probability Basics Prior, conditional and joint probability for random variables Prior probability: P(x) Conditional probability: P 2 ( x1 x2), P(x x1) Joint probability: x = ( x1, x2 ), P( x) = P(x1,x2) Relationship: P(x 1,x2) = P( x2 x1) P( x1) = P( x1 x2) P( x2) Independence: P( x2 x1) = P( x2), P( x1 x2) = P( x1), P(x1,x2) = P( x1) P( x2)

Naive Bayes Classifier Assumes that the features x are conditionally independent, given the output y. Here we will see the method for discrete values. Uses the Bayesian rule. P(y x) = P(x y)p(y) P(x) Posterior = Likelihood Prior Evidence P(y x): the posterior probability of class (target), given an attribute vector (predictor). P(y): the prior class probability P(x y): the likelihood, i.e., the probability of generating the predictor x given target y P(x): the evidence, i.e., the prior probability of the predictor x

Naive Bayes Classifier Maximum A Posterior (MAP) Classification Rule To classify an input x find the posterior probabilities P(y i x) for each output class y i. Assign the label y to x if P(y x) is the highest among all P(y i x)s. Note that since the factor P(x) in the Bayesian rule is common for all P(y i x), we can omit it when applying MAP. P(y i x) = P(x y i )P(y i ) P(x) So, the classification rule can be written as: y = argmax y P(y)Π i=1 P(x y i )P(y i ) Naives Bayes classifier has a fast training phase, even with many features because it looks and collects statistics from each feature individually. n P(x i y)

Naive Bayes Classifier as a linear classifier Naive Bayes Classifier can be seen as a linear classifier if we consider the log version of the classification rule f (x) = log P(y =1 x) P(y = 0 x) = log P(y =1 x) log P(y = 0 x) = (logθ 1 logθ 0 ) T x + (log P(y =1) log P(y = 0)) f(x) is a linear function in x. f(x)=0 is the decision boundary.

Naive Bayes Classifier An example: Play Tennis depending on weather conditions Is it probable that we will play tennis on a (Sunny, Hot, Normal, Strong) day?

Naive Bayes Classifier Step 1: Build frequency table for each attribute. Outlook Play=Yes Play=No Sunny 2 3 Overcast 4 0 Rain 3 2 Temperature Play=Yes Play=No Hot 2 2 Mild 4 2 Cool 3 1 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3 4 Normal 6 1 Strong 3 3 Weak 6 2

Naive Bayes Classifier Step 2: Build likelihood table for each attribute. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Strong 3/9 3/5 Weak 6/9 2/5 9/14 5/14

Naive Bayes Classifier Step 3: Compute the probabilities. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Strong 3/9 3/5 Weak 6/9 2/5 9/14 5/14 P(y x) = P(yes Strong) =?

Naive Bayes Classifier Step 3: Compute the probabilities. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 P(x y) = P(Strong yes) = 3 / 9 = 0.33 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(y) = P(yes) = 9 /14 = 0.64 P(x) = P(Strong) = 6 /14 = 0.43 P(y x) = P(yes Strong) =? 9/14 5/14

Naive Bayes Classifier Step 3: Compute the probabilities. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 P(x y) = P(Strong yes) = 3 / 9 = 0.33 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 9/14 5/14 P(y) = P(yes) = 9 /14 = 0.64 P(x) = P(Strong) = 6 /14 = 0.43 P(y x) = P(yes Strong) = 0.33* 0.64 / 0.43 = 0.49

Naive Bayes Classifier Q: Is it probable that we will play tennis on a x=(outlook:sunny, Temp:Hot, Humidity:Normal, Wind:Strong) day? A: Let s apply MAP for P(yes, x) and P(no, x). Likelihood yes =P(x yes)=p(outlook=sunny yes)*p(temp=hot yes)*p(humidity=normal yes)*p(wind=strong yes)=2/9*2/9*6/9*3/9=0.043 Likelihood no =P(x no)=p(outlook=sunny no)*p(temp=hot no)*p(humidity=normal no)*p(wind=strong no)=3/5*2/5*1/5*3/5=0.029 P(yes)=9/14=0.64 P(no)=5/14=0.36 P(yes x)=p(x yes)*p(yes)=0.043*0.64=0.03 P(no x)=p(x no)*p(no)=0.029*0.36=0.01 So, since P(yes x)>p(no x) then x is classified as a yes.

Naive Bayes Classifier The zero-frequency problem: When an attribute has zero frequency for a y i, then the likelihood P(x y i ) will be equal to zero! To avoid this, we can simply augment all the counts by one. Eg: Outlook Play=Yes Play=No Sunny 2 3 Overcast 4 0 Rain 3 2 Outlook Play=Yes Play=No Sunny 3 4 Overcast 5 1 Rain 4 3

Binary vs multiclass classifiers x 1 x 2 x 1 x 2 Binary Multiclass

Multiclass classification (linear) Multiclass classification: each sample can have one class out of many possibilities. Linear algorithms (LR, SVC, Naïve Bayes) that we saw can be used as per se, also for multiclass problems. How? One-Vs-All. One Vs All: Binary problem with class 0 being some class and class 1 being the rest of the classes. If the set of possible classes is Y then For each y i Y Learn a line separating class y i from the rest of the classes using a binary classifier. Use all the learned lines to separate the classes.

One vs All classification x 1 x 1 x 1 x 2 x 2 x 2 x 1 x 2

One vs All classification x 1 x 1 x 1 x 2 x 2 x 2 Eg for LR we compute: p(y i x) for y i in {,, } The class with the highest probability for a new point x is its predicted class. x 1 x 2

One vs All classification: x 1 x 2 Do you see any problem here?

One vs All classification x 1 x x 2 What about the points in the triangle?

One vs All classification x 1 What about the points in the triangle? x x 2 Choose the line closest to the point! à The class with the highest probability.

Sources Cambridge UP: Support vector machines and machine learning on documents Xiaojin Zhu: Naive Bayes Classifier Ke Chen: Naive Bayes Classifier Ng: Multiclass classification Gu: An Introduction of Support Vector Machine