CSC 411 Lecture 17: Support Vector Machine

Similar documents
Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines, Kernel SVM

Announcements - Homework

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine (continued)

Support Vector Machines for Classification and Regression

Jeff Howbert Introduction to Machine Learning Winter

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

Pattern Recognition 2018 Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Convex Optimization and Support Vector Machine

Support Vector Machines

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

L5 Support Vector Classification

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

Statistical Pattern Recognition

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning A Geometric Approach

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning And Applications: Supervised Learning-SVM

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Discriminative Models

Support Vector Machines and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

LECTURE 7 Support vector machines

ICS-E4030 Kernel Methods in Machine Learning

Max Margin-Classifier

Linear & nonlinear classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Warm up: risk prediction with logistic regression

Support Vector Machines

Lecture Notes on Support Vector Machine

Discriminative Models

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Kernel Methods and Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support vector machines

SVMs, Duality and the Kernel Trick

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

(Kernels +) Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. Machine Learning Fall 2017

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines.

Incorporating detractors into SVM classification

Support Vector Machine (SVM) and Kernel Methods

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support vector machines Lecture 4

Linear & nonlinear classifiers

UVA CS 4501: Machine Learning

Support Vector Machines

Kernelized Perceptron Support Vector Machines

10701 Recitation 5 Duality and SVM. Ahmed Hefny

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Introduction to Logistic Regression and Support Vector Machine

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines

Statistical Machine Learning from Data

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS-E4830 Kernel Methods in Machine Learning

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Support Vector Machines

Linear, Binary SVM Classifiers

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Non-linear Support Vector Machines

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

Constrained Optimization and Support Vector Machines

Classification and Support Vector Machine

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

COMP 875 Announcements

18.9 SUPPORT VECTOR MACHINES

Transcription:

CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1

Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17 2 / 1

Today We are back to supervised learning We are given training data {(x (i), t (i) )} N We will look at classification, so t (i) will represent the class label We will focus on binary classification (two classes) We will consider a linear classifier first (next class non-linear decision boundaries) Tiny change from before: instead of using t = 1 and t = 0 for positive and negative class, we will use t = 1 for the positive and t = 1 for the negative class CSC411 Lec17 3 / 1

Logistic Regression y = { 1 if (w T x + b) 0 1 if (w T x + b) < 0 CSC411 Lec17 4 / 1

Max Margin Classification If the data is linearly separable, which separating hyperplane do you pick? Aim: learn a boundary that leads to the largest margin (buffer) from points on both sides Why: intuition; theoretical support; and works well in practice Subset of vectors that support (determine boundary) are called the support vectors CSC411 Lec17 5 / 1

Hard SVM Assume (for now) the data is linearly separable. There exists w and b such that i : sign(w T x (i) + b) = t (i) Equivalently: i : t (i) (w T x (i) + b) > 0 We want to maximize the margin, how do we formulate it mathematically? What is the distance of x (i) from the hyperplane w T x + b = 0? wt x (i) +b w (show!). Equivalently (assuming w separates) t(i) (w T x (i) +b) w The margin for w and b: min i t (i) (w T x (i) +b) w - distance to closest data point. CSC411 Lec17 6 / 1

Hard SVM Objective Hard SVM objective V1: [ t (i) (w T x (i) ] + b) w, b = arg max min w,b i w s.t. i : t (i) (w T x (i) + b) > 0 The straightforward way to write the SVM objective. Note: If the data is linearly separable you don t really need the t (i) (w T x (i) + b) > 0 constraints (why?). Writing it differently will lead to easier optimization. The objective is scale invariant - we can normalize the margin to one. Hard SVM objective V2: w, b = arg max w,b s.t. [ ] 1 1 = arg min w w,b 2 w 2 min i t (i) (w T x (i) + b) = 1 CSC411 Lec17 7 / 1

Hard SVM Objective Further simplification - margin is at least one. Hard SVM (primal) objective: w, b = arg min w,b 1 2 w 2 s.t. i : t (i) (w T x (i) + b) 1 Why is this equivalent? If the margin isn t exactly one we can scale w down and get a smaller norm. Convex quadratic programming problem. CSC411 Lec17 8 / 1

SVM Dual Form Convert the constrained minimization to an unconstrained optimization problem: represent constraints as penalty terms: min w,b 1 2 w 2 + penalty term For data {(x (i), t (i) )} N, use the following penalty max α i 0 α i [1 (w T x (i) + b)t (i) ] = { 0 if (w T x (i) + b)t (i) 1 otherwise Rewrite the minimization problem min {1 w,b 2 w 2 + max α i[1 (w T x (i) + b)t (i) ]} α i 0 where α i are the Lagrange multipliers = min max{1 w,b α i 0 2 w 2 + α i [1 (w T x (i) + b)t (i) ]} CSC411 Lec17 9 / 1

SVM Dual Form Let: J(w, b; α) = 1 2 w 2 + α i [1 (w T x (i) + b)t (i) ] Swap the max and min : This is a lower bound max min α i 0 w,b J(w, b; α) min max w,b α i 0 J(w, b; α) Equality holds in certain conditions Called strong duality CSC411 Lec17 10 / 1

KKT conditions Solving: max min α i 0 w,b J(w, b; α) = max min α i 0 w,b 1 2 w 2 + α i [1 (w T x (i) + b)t (i) ] Convex analysis theory: The solution satisfies all constraints and the following KKT conditions: J(w, b; α) = w α i x (i) t (i) = 0 w = α i t (i) x (i) w J(w, b; α) b = α i t (i) = 0 α i [1 (w T x (i) + b)t (i) ] = 0 (Complementary slackness) Then substitute back to get final dual objective: L = max { N α i 1 t (i) t (j) α i α j (x (i)t x (j) )} α i 0 2 Set b = 1 2 i,j=1 (max i:t (i)= 1 w T x (i) + max i:t (i)=1 w T x (i) ) CSC411 Lec17 11 / 1

Support Vectors From KKT conditions we have w = α i t (i) x (i), i : α i [1 (w T x (i) + b)t (i) ] = 0 If point x (i) isn t on the boundary, then α i = 0! The optimal solution is a sum of only the support vectors Can show that if there are relatively few support vectors then SVM generalizes well. CSC411 Lec17 12 / 1

Summary of Linear SVM Binary and linear separable classification Linear classifier with maximal margin Showed two objectives, primal and dual Both are convex quadratic optimization problems. Primal optimizes d variables, dual optimizes N variables. The weights are w = α i t (i) x (i) Only a small subset of α i s will be nonzero, and the corresponding x (i) s are the support vectors S Prediction on a new example: y = sign[b + x T w] = sign[b + x T ( i S α i t (i) x (i) )] CSC411 Lec17 13 / 1

Non-separable data So far we assume the data is separable, what do we do if its not? the constraints t (i) (w T x (i) + b) 1 cannot be satisfied for all data points. Solution: Allow the constraints to be violated, but penalize for it. Introduce slack variables ξ i 0 and make the new constraints t (i) (w T x (i) + b) 1 ξ i What is the minimal slack needed? ξ i = max{1 t (i) (w T x (i) + b), 0} Penalty is C i ξ i for some hyperparameter C. Some variations use C i ξ2 i penalty CSC411 Lec17 14 / 1

Soft-SVM The new objective: 1 min w,b,ξ 2 w 2 + C s.t ξ i 0; i t (i) (w T x (i) + b) 1 ξ i ξ i Example lies on wrong side of hyperplane ξ i > 1 Therefore i ξ i upper bounds the number of training errors C trades off training error vs model complexity This is known as the soft-margin extension Can show a dual form is the same except we have an additional α i C constraint (and b computation). CSC411 Lec17 15 / 1

Hinge Loss Our objective: 1 min w,b,ξ 2 w 2 + C ξ i s.t ξ i 0; i t (i) (w T x (i) + b) 1 ξ i We can plug in the optimal ξ i and get the equivalent objective 1 min w,b, 2 w 2 + C l hinge (w T x (i) + b, t (i) ) We define the hinge loss as l hinge (w T x (i) + b, t (i) ) = max{1 t (i) (w T x (i) + b), 0} CSC411 Lec17 16 / 1

Hinge Loss l hinge (z, t) = max{1 z t, 0} l logistic (z, t) = ln(1+exp( z t)) Another surrogate loss for 0-1 loss. Convex. Not smooth at the kink but ok (can use subgradients, not covered in this course) Hinge loss can be used with other learning algorithms like neural networks. CSC411 Lec17 17 / 1

Example: Pedestrian Detection Big breakthrough in computer vision (2006) Linear SVM + hand crafted features (HOG) [Image credit: Histograms of Oriented Gradients for Human Detection ] CSC411 Lec17 18 / 1

SVM Theoretical Guarantees * Assume the data is separated by a margin γ and that x 1 Can show that with probability at least 1 δ the 0-1 loss of (hard) SVM will be bounded by ( ) 1/γ 2 log(1/δ) O N + N Main observation: This does not depend on the dimension! Can show similar results for soft SVM. Very important for kernels (soon) CSC411 Lec17 19 / 1

Recap Maximum margin classifiers. Convex quadratic optimization. Primal and dual objective, which one to use depends on dimension and data size. Open source packages like sklearn support both. If data isn t separable - use slack variables to penalize constraint violations. Another perspective - hinge loss with l 2 regularization. Important note: If you use a off-the-shelf quadratic solver it will be very slow, special SVM solvers like SMO are much better. Can use SGD, see Pegasos CSC411 Lec17 20 / 1