Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Similar documents
COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Machine Learning from Data

Machine Learning. Support Vector Machines. Manfred Huber

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Announcements - Homework

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

ICS-E4030 Kernel Methods in Machine Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear & nonlinear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Methods and Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Max Margin-Classifier

Introduction to Support Vector Machines

Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines for Classification and Regression

Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture Notes on Support Vector Machine

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

SVMs, Duality and the Kernel Trick

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

18.9 SUPPORT VECTOR MACHINES

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

CS145: INTRODUCTION TO DATA MINING

CS798: Selected topics in Machine Learning

Statistical Pattern Recognition

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines. Machine Learning Fall 2017

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines

(Kernels +) Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support vector machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines Explained

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machine

Support vector machines Lecture 4

L5 Support Vector Classification

Support Vector Machines.

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

SVM optimization and Kernel methods

Incorporating detractors into SVM classification

Convex Optimization and Support Vector Machine

Polyhedral Computation. Linear Classifiers & the SVM

Lecture 10: A brief introduction to Support Vector Machine

COMP 875 Announcements

LECTURE 7 Support vector machines

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Applied inductive learning - Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SUPPORT VECTOR MACHINE

Support Vector Machine for Classification and Regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machine. Industrial AI Lab.

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines, Kernel SVM

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

9 Classification. 9.1 Linear Classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernel methods CSE 250B

Kernelized Perceptron Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning A Geometric Approach

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Transcription:

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation as optimization problem Generalized Lagrangian and dual Allowing for noise (soft margins) Solving the dual: SMO COMP-652, Lecture 9 - October 7, 2009 1

Perceptrons Consider a binary classification problem with data {x i, y i } m i=1, y i { 1, +1}. A perceptron is a classifier of the form: h w,w0 (x) = sgn(w x + w 0 ) = { +1 if w x + w0 0 1 otherwise Here, w is a vector of weights, denotes the dot product, and w 0 is a constant offset. The decision boundary is w x + w 0 = 0. Perceptrons output a class, not a probability An example x, y is classified correctly iff: y(w x + w 0 ) > 0 COMP-652, Lecture 9 - October 7, 2009 2

A gradient descent-like learning rule Consider the following procedure: 1. Initialize w and w 0 randomly 2. While any training examples remain incorrecty classified (a) Loop through all misclassified examples (b) For misclassified example i, perform the updates: w w + γy i x i, w 0 w 0 + γy i where γ is a step-size parameter. The update equation, or sometimes the whole procedure, is called the perceptron learning rule. Intuition: Yes, for examples misclassified as negative, increase w x i + w 0, for examples misclassified as positive, it decrease it COMP-652, Lecture 9 - October 7, 2009 3

Gradient descent interpretation The perceptron learning rule can be interpreted as a gradient descent procedure, but with the following perceptron criterion function J(w, w 0 ) = m i=1 { 0 if yi (w x i + w 0 ) 0 y i (w x i + w 0 ) if y i (w x i + w 0 ) < 0 For correctly classified examples, the error is zero. For incorrectly classified examples, the error is by how much w x i + w 0 is on the wrong side of the decision boundary. J is piecewise linear, so it has a gradient almost everywhere; the gradient gives the perceptron learning rule. J is zero iff all examples are classified correctly just like the 0-1 loss function. COMP-652, Lecture 9 - October 7, 2009 4

Linear separability The data set is linearly separable if and only if there exists w, w 0 such that: For all i, y i (w x i + w 0 ) > 0. Or equivalently, the 0-1 loss is zero. x 2 + x 2 + + - + - - - x 1 - + x 1 (a) (b) COMP-652, Lecture 9 - October 7, 2009 5

Perceptron convergence theorem The perceptron convergence theorem states that if the perceptron learning rule is applied to a linearly separable data set, a solution will be found after some finite number of updates. The number of updates depends on the data set, and also on the step size parameter. If the data is not linearly separable, there will be oscillation (which can be detected automatically). COMP-652, Lecture 9 - October 7, 2009 6

Perceptron learning example separable data 1 w = [0 0] w0 = 0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 COMP-652, Lecture 9 - October 7, 2009 7

Perceptron learning example separable data 1 w = [4.111 3.8704] w0 = 4 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 COMP-652, Lecture 9 - October 7, 2009 8

Weight as a combination of input vectors Recall percepton learning rule: w w + γy i x i, w 0 w 0 + γy i If initial weights are zero, then at any step, the weights are a linear combination of feature vectors: w = m α i x i, w 0 = i=1 m α i y i i=1 where α i is the sum of step sizes used for all updates based on example i. This is called the dual representation of the classifier. Even by the end of training, some example may have never participated in an update. COMP-652, Lecture 9 - October 7, 2009 9

Example used (bold) and not used (faint) in updates 1 w = [4.111 3.8704] w0 = 4 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 COMP-652, Lecture 9 - October 7, 2009 10

Comment: Solutions are nonunique 0.9 w = [2.1395 1.9372] w0 = 2 x2 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 COMP-652, Lecture 9 - October 7, 2009 11

Perceptron summary Perceptrons can be learned to fit linearly separable data, using a gradient descent rule. There are other fitting approaches e.g., formulation as a linear constraint satisfaction problem / linear program. Solutions are non-unique. Logistic neurons are often thought of as a smooth version of a perceptron For non-linearly separable data: Perhaps data can be linearly separated in a different feature space? Perhaps we can relax the criterion of separating all the data? COMP-652, Lecture 9 - October 7, 2009 12

Support Vector Machines Support vector machines (SVMs) for binary classification can be viewed as a way of training perceptrons There are three main new ideas: An alternative optimization criterion (the margin ), which eliminates the non-uniqueness of solutions and has theoretical advantages A way of handling nonseparable data by allowing mistakes An efficient way of operating in expanded feature spaces the kernel trick SVMs can also be used for multiclass classification and regression. COMP-652, Lecture 9 - October 7, 2009 13

Returning to the non-uniqueness issue Consider a linearly separable binary classification data set {x i, y i } m i=1. There is an infinite number of hyperplanes that separate the classes: + + + + + - - - - - Which plane is best? Relatedly, for a given plane, for which points should we be most confident in the classification? COMP-652, Lecture 9 - October 7, 2009 14

The margin, and linear SVMs For a given separating hyperplane, the margin is two times the (Euclidean) distance from the hyperplane to the nearest training example. + + + + + + + + + + - - - - - - - - - - It is the width of the strip around the decision boundary containing no training examples. A linear SVM is a perceptron for which we choose w, w 0 so that margin is maximized COMP-652, Lecture 9 - October 7, 2009 15

Distance to the decision boundary Suppose we have a decision boundary that separates the data. A w! (i) B Let γ i be the distance from instance x i to the decision boundary. How can we write γ i in term of x i, y i, w, w 0? COMP-652, Lecture 9 - October 7, 2009 16

Distance to the decision boundary (II) The vector w is normal to the decision boundary. Thus, unit normal. The vector from the B to A is γ i w w. B, the point on the decision boundary nearest x i, is x i γ i As B is on the decision boundary, ( ) w w x i γ i + w 0 = 0 w Solving for γ i yields, for a positive example: w w w w. is the γ i = w w x i + w 0 w COMP-652, Lecture 9 - October 7, 2009 17

The margin The margin of the hyperplane is 2M, where M = min i γ i The most direct statement of the problem of finding a maximum margin separating hyperplane is thus max min γ i w,w 0 i max min y i w,w 0 i ( w w x i + w ) 0 w This turns out to be inconvenient for optimization, however... COMP-652, Lecture 9 - October 7, 2009 18

Treating the γ i as constraints From the definition of margin, we have: ( w M γ i = y i w x i + w ) 0 w i This suggests: maximize M with respect to w, ( w 0 ) w subject to y i w x i + w 0 w M for all i COMP-652, Lecture 9 - October 7, 2009 19

Treating the γ i as constraints From the definition of margin, we have: ( w M γ i = y i w x i + w ) 0 w i This suggests: maximize M with respect to w, ( w 0 ) w subject to y i w x i + w 0 w M for all i Problems: w appears nonlinearly in the constraints. This problem is underconstrained. If (w, w 0, M) is an optimal solution, then so is (βw, βw 0, M) for any β > 0. COMP-652, Lecture 9 - October 7, 2009 20

Adding a constraint Let s try adding the constraint that w M = 1. This allows us to rewrite the objective function and constraints as: min w w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 This is really nice because the constraints are linear. The objective function w is still a bit awkward. COMP-652, Lecture 9 - October 7, 2009 21

Final formulation Let s maximize w 2 instead of w. (Taking the square is a monotone transformation, as w is postive, so this doesn t change the optimal solution.) This gets us to: min w 2 w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 This we can solve! How? COMP-652, Lecture 9 - October 7, 2009 22

Final formulation Let s maximize w 2 instead of w. (Taking the square is a monotone transformation, as w is postive, so this doesn t change the optimal solution.) This gets us to: min w 2 w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 This we can solve! How? It is a quadratic programming (QP) problem a standard type of optimization problem for which many efficient packages are available. Better yet, it s a convex (positive semidefinite) QP COMP-652, Lecture 9 - October 7, 2009 23

Example 1 w = [49.6504 46.8962] w0 = 48.6936 0.9 w = [11.7959 12.8066] w0 = 12.9174 0.9 0.8 0.8 0.7 x2 0.7 0.6 0.5 0.4 0.3 x2 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 0 0 0.2 0.4 0.6 0.8 1 x1 We have a solution, but no support vectors yet... COMP-652, Lecture 9 - October 7, 2009 24

Lagrange multipliers for inequality constraints (revisited) Suppose we have the following optimization problem, called primal: min w f(w) such that g i (w) 0, i = 1... k We define the generalized Lagrangian: L(w, α) = f(w) + k α i g i (w), (1) i=1 where α i, i = 1... k are the Lagrange multipliers. COMP-652, Lecture 9 - October 7, 2009 25

A different optimization problem Consider P(w) = max α:αi 0 L(w, α) Observe that the follow is true. Why? P(w) = { f(w) if all constraints are satisfied + otherwise Hence, instead of computing min w f(w) subject to the original constraints, we can compute: p = min w P(w) = min w max L(w, α) α:α i 0 COMP-652, Lecture 9 - October 7, 2009 26

Dual optimization problem Let d = max α:αi 0 min w L(w, α) (max and min are reversed) We can show that d p. Let p = L(w p, α p ) Let d = L(w d, α d ) Then d = L(w d, α d ) L(w p, α d ) L(w p, α p ) = p.) COMP-652, Lecture 9 - October 7, 2009 27

Dual optimization problem If f, g i are convex and the g i can all be satisfied simultaneously for some w, then we have equality: d = p = L(w, α ) Moreover w, α solve the primal and dual if and only if they satisfy the following conditions (called Karush-Kunh-Tucker): w i L(w, α ) = 0, i = 1... n (2) α i g i (w ) = 0, i = 1... k (3) g i (w ) 0, i = 1... k (4) α i 0, i = 1... k (5) COMP-652, Lecture 9 - October 7, 2009 28

Back to maximum margin perceptron We wanted to solve (rewritten slightly): 1 min 2 w 2 w.r.t. w, w 0 s.t. 1 y i (w x i + w 0 ) 0 The Lagrangian is: L(w, w 0, α) = 1 2 w 2 + i α i (1 y i (w x i + w 0 )) The primal problem is: min w,w0 max α:αi 0 L(w, w 0, α) We will solve the dual problem: max α:αi 0 min w,w0 L(w, w 0, α) In this case, the optimal solutions coincide, because we have a quadratic objective and linear constraints (both of which are convex). COMP-652, Lecture 9 - October 7, 2009 29

Solving the dual From KKT (2), the derivatives of L(w, w 0, α) wrt w, w 0 should be 0 The condition on the derivative wrt w 0 gives i α iy i = 0 The condition on the derivative wrt w gives: w = i α i y i x i Just like for the perceptron with zero initial weights, the optimal solution for w is a linear combination of the x i, and likewise for w 0. The output is ( m ) h w,w0 (x) = sgn α i y i (x i x) + w 0 i=1 Output depends on weighted dot product of input vector with training examples COMP-652, Lecture 9 - October 7, 2009 30

Solving the dual (II) By plugging these back into the expression for L, we get: max α α i 1 2 i y i y j α i α j (x i x j ) i,j with constraints: α i 0 and i α iy i = 0 COMP-652, Lecture 9 - October 7, 2009 31

The support vectors Suppose we find optimal αs (e.g., using a standard QP package) The α i will be > 0 only for the points for which 1 y i (w x i + w 0 ) = 0 These are the points lying on the edge of the margin, and they are called support vectors, because they define the decision boundary The output of the classifier for query point x is computed as: ( m ) sgn α i y i (x i x) + w 0 i=1 Hence, the output is determined by computing the dot product of the point with the support vectors! COMP-652, Lecture 9 - October 7, 2009 32

Example Support vectors are in bold COMP-652, Lecture 9 - October 7, 2009 33

Soft margin classifiers Recall that in the linearly separable case, we compute the solution to the following optimization problem: 1 min 2 w 2 w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 If we want to allow misclassifications, we can relax the constraints to: y i (w x i + w 0 ) 1 ξ i If ξ i (0, 1), the data point is within the margin If ξ i 1, then the data point is misclassified We define the soft error as i ξ i We will have to change the criterion to reflect the soft errors COMP-652, Lecture 9 - October 7, 2009 34

New problem formulation with soft errors Instead of: 1 min 2 w 2 w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 we want to solve: 1 min 2 w 2 + C i ξ i w.r.t. w, w 0, ξ i s.t. y i (w x i + w 0 ) 1 ξ i, ξ i 0 Note that soft errors include points that are misclassified, as well as points within the margin There is a linear penalty for both categories The choice of the constant C controls overfitting COMP-652, Lecture 9 - October 7, 2009 35

A built-in overfitting knob 1 min 2 w 2 + C i ξ i w.r.t. w, w 0, ξ i s.t. y i (w x i + w 0 ) 1 ξ i ξ i 0 If C is 0, there is no penalty for soft errors, so the focus is on maximizing the margin, even if this means more mistakes If C is very large, the emphasis on the soft errors will cause decreasing the margin, if this helps to classify more examples correctly. COMP-652, Lecture 9 - October 7, 2009 36

Lagrangian for the new problem Like before, we can write a Lagrangian for the problem and then use the dual formulation to find the optimal parameters: L(w, w 0, α, ξ, µ) = 1 2 w 2 + C i ξ i + i α i (1 ξ i y i (w i x i + w 0 )) + i µ i ξ i All the previously described machinery can be used to solve this problem Note that in addition to α i we have coefficients µ i, which ensure that the errors are positive, but do not participate in the decision boundary Next time: an even better way of dealing with non-linearly separable data COMP-652, Lecture 9 - October 7, 2009 37

Solving the quadratic optimization problem Many approaches exist Because we have constraints, gradient descent does not apply directly (the optimum might be outside of the feasible region) Platt s algorithm is the fastest current approach, based on coordinate ascent COMP-652, Lecture 9 - October 7, 2009 38

Coordinate ascent Suppose you want to find the maximum of some function F (α 1,... α n ) Coordinate ascent optimizes the function by repeatedly picking an α i and optimizing it, while all other parameters are fixed There are different ways of looping through the parameters: Round-robin Repeatedly pick a parameter at random Choose next the variable expected to make the largest improvement... COMP-652, Lecture 9 - October 7, 2009 39

Example COMP-652, Lecture 9 - October 7, 2009 40

Our optimization problem max α α i 1 y i y j α i α j K(x i, x j ) 2 with constraints: 0 α i C and i α iy i = 0 i i,j Suppose we want to optimize for α 1 while α 2,... α n are fixed We cannot do it because α 1 will be completely determined by the last constraint: α 1 = y 1 m i=2 α iy i Instead, we have to optimize pairs of α i, α j parameters together COMP-652, Lecture 9 - October 7, 2009 41

SMO Suppose that we want to optimize α 1 and α 2 together, while all other parameters are fixed. We know that y 1 α 1 + y 2 α 2 = m i=1 y iα i = ξ, where ξ is a constant So α 1 = y 1 (ξ y 2 α 2 ) (because y 1 is either +1 or 1 so y 2 1 = 1) This defines a line, and any pair α 1, α 2 which is a solution has to be on the line COMP-652, Lecture 9 - October 7, 2009 42

SMO (II) We also know that 0 α 1 C and 0 α 2 C, so the solution has to be on the line segment inside the rectangle below COMP-652, Lecture 9 - October 7, 2009 43

SMO(III) By plugging α 1 back in the optimization criterion, we obtain a quadratic function of α 2, whose optimum we can find exactly If the optimum is inside the rectangle, we take it. If not, we pick the closest intersection point of the line and the rectangle This procedure is very fast because all these are simple computations. COMP-652, Lecture 9 - October 7, 2009 44