COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Similar documents
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Jeff Howbert Introduction to Machine Learning Winter

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machines for Classification and Regression

Support Vector Machine (continued)

Lecture Support Vector Machine (SVM) Classifiers

Kernel Methods and Support Vector Machines

Statistical Machine Learning from Data

Max Margin-Classifier

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

SVMs, Duality and the Kernel Trick

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machine

Support Vector Machines and Kernel Methods

Support Vector Machines

Support Vector Machines

Support Vector Machines

Announcements - Homework

18.9 SUPPORT VECTOR MACHINES

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Lecture Notes on Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

(Kernels +) Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

COMP 875 Announcements

Linear & nonlinear classifiers

ICS-E4030 Kernel Methods in Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

ML (cont.): SUPPORT VECTOR MACHINES

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Support Vector Machine. Industrial AI Lab.

LECTURE 7 Support vector machines

CS-E4830 Kernel Methods in Machine Learning

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

CS798: Selected topics in Machine Learning

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines

Support Vector Machines Explained

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture 4: Training a Classifier

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Machine Learning Fall 2017

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

UVA CS 4501: Machine Learning

Statistical Pattern Recognition

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Linear Discrimination Functions

Review: Support vector machines. Machine learning techniques and image analysis

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning 4771

Kernelized Perceptron Support Vector Machines

Support Vector Machine for Classification and Regression

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines, Kernel SVM

Support Vector Machines.

Support Vector Machines for Classification: A Statistical Portrait

Support vector machines Lecture 4

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Polyhedral Computation. Linear Classifiers & the SVM

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear, Binary SVM Classifiers

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CS145: INTRODUCTION TO DATA MINING

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines: Maximum Margin Classifiers

Classification and Support Vector Machine

Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Deep Learning for Computer Vision

L5 Support Vector Classification

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Transcription:

COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37

Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation as optimization problem Generalized Lagrangian and dual COMP 652 Lecture 12 2 / 37

Perceptrons Consider a binary classification problem with data {x i, y i } m i=1 Let the two classes be y = 1 and y = +1 A perceptron is a classifier of the form: h w,w0 (x) = sgn(w x + w 0 ) = { +1 if w x + w0 0 1 otherwise Here, w is a vector of weights, denotes the dot product, and w 0 is a constant offset. The decision boundary is w x + w 0 = 0. How do we measure goodness of fit w.r.t. a data set? COMP 652 Lecture 12 3 / 37

The 01 loss function Perceptrons do not output class probabilities, just a guess at the class. An example x, y is classified correctly if (and pretty much only if): y(w x + w 0 ) > 0 For a data set {x i, y i } m i=1, the 01 loss for a particular perceptron given by parameters w, w 0 is simply the number of examples classified incorrectly. The 0 1 loss function: Takes values in the set {0, 1,..., m}. Is piecewise constant. The derivatives with respect to w, w 0 are zero everywhere that they are defined. Not good for analytical solution or for gradient descent. COMP 652 Lecture 12 4 / 37

A gradient descentlike learning rule Condisder the following procedure: 1. Initialize w and w 0 randomly 2. While any training examples remain incorrecty classified (a) (b) Loop through all misclassified examples For misclassified example i, perform the updates: w w + γy i x i, w 0 w 0 + γy i where γ is a stepsize parameter. The update equation, or sometimes the whole procedure, is called the perceptron learning rule. Does this rule make sense? COMP 652 Lecture 12 5 / 37

A gradient descentlike learning rule Condisder the following procedure: 1. Initialize w and w 0 randomly 2. While any training examples remain incorrecty classified (a) (b) Loop through all misclassified examples For misclassified example i, perform the updates: w w + γy i x i, w 0 w 0 + γy i where γ is a stepsize parameter. The update equation, or sometimes the whole procedure, is called the perceptron learning rule. Does this rule make sense? Yes, because for examples misclassified as negative, it increases w x i + w 0, and for examples misclassified as positive, it decreases w x i + w 0. COMP 652 Lecture 12 6 / 37

Gradient descent interpretation The perceptron learning rule can be interpreted as a gradient descent procedure, but with a different error function than the 01 loss. Consider the perceptron criterion function J(w, w 0 ) = m i=1 { 0 if yi (w x i + w 0 ) 0 y i (w x i + w 0 ) if y i (w x i + w 0 ) < 0 For correctly classified examples, the error is zero. For incorrectly classified examples, the error is by how much w x i + w 0 is on the wrong side of the decision boundary. J is piecewise linear, and its gradient gives the perceptron learning rule. J is zero if (and pretty much only iff) all examples are classified correctly just like the 01 loss function. Do we get convergence? COMP 652 Lecture 12 7 / 37

Linear separability Consider a (binary classifiacdtion) data set {x i, y i } m i=1. The data set is linearly separable if and only if there exists w, w 0 such that: For all i, y i (w x i + w 0 ) > 0. Or equivalently, the 01 loss is zero. x 2 + x 2 + + + x 1 + x 1 (a) (b) COMP 652 Lecture 12 8 / 37

Perceptron convergence theorem The perceptron convergence theorem states that if the perceptron learning rule is applied to a linearly separable data set, a solution will be found after some finite number of updates. The number of updates depends on the data set, and also on the step size parameter. Step size = 1 can be used. See Duda, Hart & Stork, Chapter 5, for proof. COMP 652 Lecture 12 9 / 37

Perceptron learning example separable data './.0!.!1...!./.!!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' COMP 652 Lecture 12 10 / 37

Perceptron learning example separable data './.0$"'''.("&*!$1...!./.!$!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' COMP 652 Lecture 12 11 / 37

Weight as a combination of input vectors Recall percepton learning rule: w w + γy i x i, w 0 w 0 + γy i If initial weights are zero, then at any step, the weights are a linear combination of feature vectors. That is, at any time we can write w = m α i x i, w 0 = i=1 m α i y i i=1 where α i is the sum of step sizes used for all updates based on example i. This is called the dual representation of the classifier. Even by the end of training, some example may have never participated in an update. COMP 652 Lecture 12 12 / 37

Example used (bold) and not used (faint) in updates './.0$"'''.("&*!$1...!./.!$!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' COMP 652 Lecture 12 13 / 37

Comment: Solutions are nonunique!"+./.0#"'(+).'"+(*#1...!./.!#,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' COMP 652 Lecture 12 14 / 37

Perceptron comments Perceptrons can be learned to fit linearly separable data, using a gradient descent rule. There are other fitting approaches e.g., formulation as a linear constraint satisfaction problem / linear program. Solutions are nonunique. What about nonlinearly seperable data? Training by perceptron learning rule doesn t terminate / need not converge. Perhaps data can be linearly separated in a different feature space? Perhaps we can relax the criterion of separating all the data? COMP 652 Lecture 12 15 / 37

Support Vector Machines Support vector machines (SVMs) for binary classification are a way of training perceptrons (pretty much) There are three main ideas they add: An alternative optimization criterion (the margin ), which eliminates the nonuniqueness of solutions and has certain theoretical advantages An efficient way of operating in expanded feature spaces the kernel trick A way of handling nonseparable data SVMs can also be used for multiclass classification and regression. COMP 652 Lecture 12 16 / 37

Returning to the nonuniqueness issue Consider a linearly separable binary classification data set {x i, y i } m i=1. There is an infininte number of hyperplanes that separate the two classes: + + + + + Which plane is best? Relatedly, for a given plane, for which points should we be most confident in the classification? COMP 652 Lecture 12 17 / 37

The margin, and linear SVMs For a given separating hyperplane, the margin is two times the (Euclidean) distance from the hyperplane to the nearest training example. + + + + + + + + + + It is the width of the strip around the decision boundary containing no training examples. A linear SVM is a perceptron for which we choose w, w 0 so that margin is maximized COMP 652 Lecture 12 18 / 37

Distance to the decision boundary Suppose we have a decision boundary that separates the data. A w! (i) B Let γ i be the distance from instance x i to the decision boundary. How can we write γ i in term of x i, y i, w, w 0? COMP 652 Lecture 12 19 / 37

Distance to the decision boundary (II) w The vector w is normal to the decision boundary. Thus, w is the unit normal. w The vector from the B to A is γ i w. w B, the point on the decision boundary nearest x i, is x i γ i w. As B is on the decision boundary, w (x i γ i w w ) + w 0 = 0 Solving for γ i yields γ i = w w x i + w 0 w That s for a positive example. For a negative example, we get that negative of that. In either case, we can write: ( w γ i = y i w x i + w ) 0 w COMP 652 Lecture 12 20 / 37

The margin From the previous definition, the margin of the hyperplane is 2M, where M = min i γ i The most direct statement of the problem of finding a maximum margin separating hyperplane is thus max min γ i w,w 0 i max min y i w,w 0 i ( w w x i + w ) 0 w This turns out to be inconvenient for optimization, however... Aside: The distances γ i were computed assuming that w, w 0 define a separating hyperline. If they do not i.e., some data is misclassified then at least one γ i will be negative. Thus, maximizing min i γ i not only gives us a separating hyperplane (assuming one exists), it also gives us one with maximum margin. COMP 652 Lecture 12 21 / 37

Treating the γ i as constraints From the definition of margin, we have: M γ i = y i ( w w x i + w 0 w ) i This suggests: maximize M with respect to w, ( w 0 ) subject to y w i w x i + w 0 w M for all i COMP 652 Lecture 12 22 / 37

Treating the γ i as constraints From the definition of margin, we have: M γ i = y i ( w w x i + w 0 w ) i This suggests: maximize M with respect to w, ( w 0 ) subject to y w i w x i + w 0 w M for all i Problems: w appears nonlinearly in the constraints. This problem is underconstrained. If (w, w 0, M) is an optimal solution, then so is (βw, βw 0, M) for any β > 0. COMP 652 Lecture 12 23 / 37

Adding a constraint Let s try adding the constraint that w M = 1. This allows us to rewrite the objective function and constraints as: min w w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 This is really nice because the constraints are linear. The objective function w is still a bit awkward. COMP 652 Lecture 12 24 / 37

Final formulation Let s maximize w 2 instead of w. (Taking the square is a monotone transformation, as weights is postive, so this doesn t change the optimal solution.) This gets us to: min w 2 w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 This we can solve! How? COMP 652 Lecture 12 25 / 37

Final formulation Let s maximize w 2 instead of w. (Taking the square is a monotone transformation, as weights is postive, so this doesn t change the optimal solution.) This gets us to: min w 2 w.r.t. w, w 0 s.t. y i (w x i + w 0 ) 1 This we can solve! How? It is a quadratic programming (QP) problem a standard type of optimization problem for which many efficient softwares are available. Better yet, it s a convex (positive semidefinite) QP Computational complexity? COMP 652 Lecture 12 26 / 37

A pause to think We could stop here. This QP gives us a unique solution to finding a maximum margin perceptron / linear SVM. E.g., from MATLAB: './.0$+"%)!$.$%"&+%#1...!./.!$&"%+(%!"+./.0''"*+)+.'#"&!%%1...!./.!'#"+'*$,#!"+!"&!"*!"%!")!"$!"(!"#!"',#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ','!!!"#!"$!"%!"& ',' However, we re going to look at solving the same optimization problem another way. The formulation we ve arrived at: Doesn t show us the kernel trick Doesn t show us the support vectors COMP 652 Lecture 12 27 / 37

Lagrange multipliers for inequality constraints Suppose we have the following optimization problem, called primal: min w such that g i (w) 0, i = 1... k We define the generalized Lagrangian: L(w, α) = f(w) + k α i g i (w), (1) i=1 where α i, i = 1... k are the Lagrange multipliers. COMP 652 Lecture 12 28 / 37

A different optimization problem Consider P(w) = max α:αi 0 L(w, α) Observe that the follow is true. Why? P(w) = { f(w) if all constraints are satisfied + otherwise Hence, instead of computing min w f(w) subject to the original constraints, we can compute: p = min w P(w) = min w max L(w, α) α:α i 0 COMP 652 Lecture 12 29 / 37

Dual optimization problem Let d = max α:αi 0 min w L(w, α) (max and min are reversed) We can show that d p. Let p = L(w p, α p ) Let d = L(w d, α d ) Then d = L(w d, α d ) L(w p, α d ) L(w p, α p ) = p.) COMP 652 Lecture 12 30 / 37

Dual optimization problem If f, g i are convex and the g i can all be satisfied simultaneously for some w, then we have equality: d = p = L(w, α ) Moreover w, α solve the primal and dual if and only if they satisfy the following conditions (called KarushKunhTucker): L(w, α ) w i = 0, i = 1... n (2) αi g i (w ) = 0, i = 1... k (3) g i (w ) 0, i = 1... k (4) α i 0, i = 1... k (5) COMP 652 Lecture 12 31 / 37

Back to maximum margin perceptron We wanted to solve (rewritten slightly): 1 min 2 w 2 w.r.t. w, w 0 s.t. 1 y i (w x i + w 0 ) 0 The Lagrangian is: L(w, w 0, α) = 1 2 w 2 + i α i (1 y i (w x i + w 0 )) The primal problem is: min w,w0 max α:αi 0 L(w, w 0, α) We will solve the dual problem: max α:αi 0 min w,w0 L(w, w 0, α) In this case, the optimal solutions coincide, because we have a quadratic objective and linear constraints (both of which are convex). COMP 652 Lecture 12 32 / 37

Solving the dual From KKT condition (2), we know that the derivatives of L(w, w 0, α) wrt w, w 0 should be 0 The condition on the derivative wrt w 0 gives i α iy i = 0 The condition on the derivative wrt w gives: w = i α i y i x i Just like for the perceptron with zero initial weights, the optimal solution for w is a linear combination of the x i, and likewise for w 0. The output is ( m ) h w,w0 (x) = sgn α i y i (x i x) + w 0 i=1 Output depends on weighted dot product of input vector with training examples COMP 652 Lecture 12 33 / 37

Solving the dual (II) By plugging these back into the expression for L, we get: max α α i 1 y i y j α i α j (x i x j ) 2 i i,j with constraints: α i 0 and i α iy i = 0 COMP 652 Lecture 12 34 / 37

The support vectors Suppose we find optimal α s (e.g., obtained by a standard QP package) The α i will be > 0 only for the points for which 1 y i (w x i + w 0 ) = 0 These are the points lying on the edge of the margin, and they are called support vectors, because they define the decision boundary The output of the classifier for query point x is computed as: ( m ) sgn α i y i (x i x) + w 0 i=1 Hence, the output is determined by computing the dot product of the point with the support vectors! COMP 652 Lecture 12 35 / 37

Example Support vectors are in bold COMP 652 Lecture 12 36 / 37

Summary of perceptrons and linear SMVs Perceptron is essentially just a linear decision boundary, with positive class on one side, negative on the other Perceptron learning rule allows finding a decision boundary if the data is linearly separable... but it s nonunique The margin is twice the distance from the decision boundary to the nearest instance A linear support vector machine is a perceptron that maximizes the margin It can be found by quadratic programming The solution weights are a linear combination of support vectors the training instances closest to the decision boundary (a fact made clear by the transformation to the dual) COMP 652 Lecture 12 37 / 37