Linear, Binary SVM Classifiers

Similar documents
Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines, Kernel SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines

Support Vector Machines and Kernel Methods

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

L5 Support Vector Classification

Statistical Pattern Recognition

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Announcements - Homework

Support Vector Machine. Industrial AI Lab.

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Methods and Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Linear & nonlinear classifiers

Machine Learning And Applications: Supervised Learning-SVM

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Linear & nonlinear classifiers

18.9 SUPPORT VECTOR MACHINES

Warm up: risk prediction with logistic regression

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support vector machines

Convex Optimization and Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machine

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Statistical Methods for SVM

(Kernels +) Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning 4771

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Max Margin-Classifier

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines Explained

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

ML (cont.): SUPPORT VECTOR MACHINES

Lecture Support Vector Machine (SVM) Classifiers

LECTURE 7 Support vector machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Deep Learning for Computer Vision

Lecture 10: A brief introduction to Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS798: Selected topics in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Support vector machines Lecture 4

Support Vector Machines.

Introduction to Support Vector Machines

Lecture 3 January 28

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machine for Classification and Regression

Lecture Notes on Support Vector Machine

Machine Learning A Geometric Approach

Support Vector Machines and Speaker Verification

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

More about the Perceptron

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

10701 Recitation 5 Duality and SVM. Ahmed Hefny

SVM optimization and Kernel methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

COMP 875 Announcements

Support Vector Machine

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Support Vector Machines for Classification: A Statistical Portrait

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Discriminative Models

Learning by constraints and SVMs (2)

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Lecture 10: Support Vector Machine and Large Margin Classifier

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Transcription:

Linear, Binary SVM Classifiers COMPSCI 37D Machine Learning COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers / 6

Outline What Linear, Binary SVM Classifiers Do 2 Margin I 3 Loss and Regularized Risk 4 Training an SVM is a Quadratic Program 5 The KKT Conditions and the Support Vectors 6 The Dual Problem COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 2 / 6

What Linear, Binary SVM Classifiers Do The Separable Case? Where to place the boundary? The number of choices grows with d COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 3 / 6

What Linear, Binary SVM Classifiers Do SVMs Maximize the Smallest Margin As Placing the boundary as far as possible from the nearest samples improves generalization Leave as much empty space around the boundary as possible Only the points that barely make the margin matter These are the support vectors Initially, we don t know which points will be support vectors COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 4 / 6

What Linear, Binary SVM Classifiers Do The General Case If the data is not linearly separable, there must be misclassified samples. These have a negative margin Assign a penalty inversely proportional to the margin, and a penalty that grows affinely with the negative of the margin Give different weights to the two penalties (cross-validation!) Find the optimal compromise: minimum risk (total penalty) COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 5 / 6

Margin Separating Hyperplane X = R d and Y = {, } (more convenient labels) Hyperplane: n T x + c = 0 with knk = Decision rule: ŷ = h(x) =sign(n T x + c) n points towards the ŷ = half-space If y is the true label, decision is correct if n T x + c 0 if y = n T x + c apple 0 if y = More compactly, a decision is correct if y(n T x + c) 0 SVMs want this inequality to hold with a margin COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 6 / 6

Margin Margin The margin of (x, y) is the signed distance of x from the boundary: Positive if x is on the correct side of the boundary, negative otherwise µ v (x, y) def = y (n T x + c) v =(n, c) Margin of a training set T : t y^ = n separating hyperplane µ (x, ) v y^ = µ v (T ) def = min (x,y)2t µ v (x, y) Boundary separates T if µ (x, -) v µ v (T ) > 0 COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 7 / 6

Loss and Regularized Risk The Hinge Loss Reference margin µ > 0 reference margin reference margin (unknown, to be determined) Hinge loss (x, y): max{0,µ µ µ v (x, y)} Training samples with µ v (x, y) µ are classified correctly with a margin at least µ Some loss incurred as soon as µ v (x, y) <µ even if the sample is classified correctly mic ξ(x, ) µ (x, -) v y^ = µ* n separating hyperplane a 3 µ* y^ = µ (x, ) v ξ(x, -) COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 8 / 6

Loss and Regularized Risk The Training Risk NGE Loss The training risk for SVMs is not just N P N n=0 (x n, y n ) A regularization term is added to force µ to be small Separating hyperplane is n T x + c = 0 Let w T x + b = 0 with w =!n, b =!c and! = kwk = µ! is a reciprocal scaling factor if only w is changed: Large margin, small! Make risk higher when! is large (small margin): L T (w, b) def = 2 kwk2 + C N P N n=0 (x n, y n, w, b) where (x, y) = µ max{0,µ µ v (x, y)} e e = max{0,µ y (n T x + c)} = max{0, y(w T x + b)} µ a y c ONTHEUNE COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 9 / 6

Loss and Regularized Risk Regularized Risk ERM classifier: (w, b )=ERM T (w, b) =argmin at (w,b) L T (w, b) where L T (w, b) def P = 2 kwk2 + C N N n=0 (x n, y n, w, b) (x n, y n, w, b) def =max{0, y n (w T x n + b)} C determines a trade-off Large C )kwkless important ) larger! ) smaller margin ) fewer samples within the margin We buy a larger margin by accepting more samples inside it Cross-validation! COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 0 / 6

Training an SVM is a Quadratic Program Rephrasing as Training a Quadratic Program (w, b )=argmin (w,b) 2 kwk2 + C N P N n=0 (x n, y n, w, b) where (x n, y n, w, b) def =max{0, y n (w T x n + b)} Not differentiable because of the max: Bummer! Neat trick: Instead of minimizing L T over w, b, we introduce new variables n and minimize L T over w, b,,..., N with the constrainto n (x n, y n, w, b) We would not decrease L T if we made n bigger Advantage: We can now split n (x n, y n, w, b) into two separate, affine constraints because is a hinge: n 0 everywhere When o y n (w T x n + b) 0, we want n y n (w T x n + b) Thus, n 0 and y n (w T x n + b) + n 0 I 5 IT In Not O K COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers / 6

Training an SVM is a Quadratic Program Quadratic Program Formulation We achieve differentiability at the cost of adding N slack variables n : Old: min (w,b) 2 kwk2 + C N P N n=0 (x n, y n, w, b) where (x n, y n, w, b) def =max{0, y n (w T x n + b)} New: min w,b, f (w, ) where f (w, ) = 2 kwk2 + P N n= n subject to the constraints g def = C N y n (w T x n + b) + n 0 and with We have our quadratic program! n 0 s x Hii I l gu Curtain b COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 2 / 6

Training an SVM is a Quadratic Program The Lagrangian min w,b, f (w, ) where f (w, ) = 2 kwk2 + P N n= n subject to the constraints def = C N and with Lagrangian L(u, ) =f (u) L(w, b,,, ) y n (w T x n + b) + n 0 def = 2 kwk2 + T c NX n= n 0 n feel NX n [y n (w T x n + b) + n ] n= n= E u is (w, b, ) and has both n and n COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 3 / 6 NX n n

The KKT Conditions and the Support Vectors The KKT Conditions L(w, b,,, ) def = 2 kwk2 + P N m= m e @L g P N @w = w n= ny n x n = 0 @L @b = P N n= ny n = 0 @L @ n = n n = 0 y n (w T x n + b) + n 0 n 0 n 0 CONIC n 0 n [y n (w T x n + b) + n ]=0 n n = AI 0 a EE P N m= m[y m (w T x m + b) + m ] p P N m= y complementarity A m m a f COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 4 / 6

The KKT Conditions and the Support Vectors The Support Vectors w = P N n= ny n x n The separating-hyperplane parameters w are a linear combination of the training data points x n In the separable case, only the data points on the margin boundaries have nonzero n In the non-separable case, it s more complicated, but only data points near the margin have nonzero n Either way, these data points are called the support vectors Ln 0 only for these points in theseparable case COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 5 / 6 a µ SUPPORTVECTORS

The Dual Problem The Dual Some crank-turning: Solve KKT for w, and plug back in This yields a very simple dual: D( ) = NX n= NX Q n 2 m= to be maximized with constraints P N n= ny n = 0 and 0 apple n apple NX m n y m y n x T mx n. D turns out to be concave. Minimize D for a convex program Key feature: x n appears only in inner products with other x m This will have very important consequences n= COMPSCI 37D Machine Learning Linear, Binary SVM Classifiers 6 / 6