Support Vector Machines: Maximum Margin Classifiers

Similar documents
Geometrical intuition behind the dual problem

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) and Kernel Methods

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 18: Optimization Programming

ICS-E4030 Kernel Methods in Machine Learning

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture Notes on Support Vector Machine

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Support Vector Machines for Regression

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Statistical Machine Learning from Data

CS-E4830 Kernel Methods in Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines and Kernel Methods

CSC 411 Lecture 17: Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

Linear & nonlinear classifiers

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines for Classification and Regression

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Convex Optimization & Lagrange Duality

Support vector machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Announcements - Homework

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines

LECTURE 7 Support vector machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Constrained Optimization and Lagrangian Duality

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines and Kernel Methods

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

SMO Algorithms for Support Vector Machines without Bias Term

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

SVMs, Duality and the Kernel Trick

Applied inductive learning - Lecture 7

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Incorporating detractors into SVM classification

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Kernel Methods and Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 2: Linear SVM in the Dual

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Support Vector Machines, Kernel SVM

CS798: Selected topics in Machine Learning

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

UVA CS 4501: Machine Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines

Machine Learning A Geometric Approach

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Speaker Verification

Kernel Methods. Machine Learning A W VO

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Constrained Optimization and Support Vector Machines

Lecture: Duality of LP, SOCP and SDP

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ML (cont.): SUPPORT VECTOR MACHINES

Convex Optimization M2

Lecture: Duality.

5. Duality. Lagrangian

Kernels and the Kernel Trick. Machine Learning Fall 2017

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Statistical Pattern Recognition

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

L5 Support Vector Classification

Soft-Margin Support Vector Machine

Pattern Recognition 2018 Support Vector Machines

Lecture 3 January 28

Optimality, Duality, Complementarity for Constrained Optimization

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

SVM and Kernel machine

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

CS145: INTRODUCTION TO DATA MINING

Transcription:

Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1

Outline What is behind Support Vector Machines? Constrained Optimization Lagrange Duality Support Vector Machines in Detail Kernel Trick LibSVM demo 2

Binary Classification Problem Given: Training data generated according to the distribution D x 1, y 1,, x p, y p R n { 1,1} input label input space label space Problem: Find a classifier (a function) h x : R n { 1,1} such that it generalizes well on the test set obtained from the same distribution D Solution: Linear Approach: linear classifiers (e.g. Perceptron) Non Linear Approach: non-linear classifiers (e.g. Neural Networks, SVM) 3

Linearly Separable Data Assume that the training data is linearly separable x 2 x 1 4

Linearly Separable Data Assume that the training data is linearly separable x 2 w. x b=0 abscissa on axis parallel to w abscissa of origin 0 is b w x 1 Then the classifier is: h x = w. x b Inference: sign h x { 1,1} where w R n,b R 5

Linearly Separable Data x 2 Assume that the training data is linearly separable w. x b=0 w. x b=0 = 1 w Margin w = 1 w Maximize margin ρ (or 2ρ) so that: x 1 For the closest points: w. x b= 1 h x = w. x b { 1,1} w. x b=1 2 = 2 w 6

Optimization Problem A Constrained Optimization Problem min w 1 2 w 2 s.t.: y i w. x i b 1, i=1,, m label input Equivalent to maximizing the margin A convex optimization problem: Objective is convex Constraints are affine hence convex = 1 w 7

Optimization Problem Compare: min w 1 2 w 2 objective s.t.: y i w. x i b 1, i=1,, m constraints With: min w p i=1 y i w. x i b 2 w 2 energy/errors regularization 8

Optimization: Some Theory The problem: min f 0 x x s.t.: f i x 0, h i x =0, i=1,, m i=1,, p objective function inequality constraints equality constraints Solution of problem: x o Global (unique) optimum if the problem is convex Local optimum if the problem is not convex 9

Optimization: Some Theory Example: Standard Linear Program (LP) min c T x x s.t.: Ax=b x 0 Example: Least Squares Solution of Linear Equations (with L 2 norm regularization of the solution x) i.e. Ridge Regression min x T x x s.t.: Ax=b 10

Big Picture Constrained / unconstrained optimization Hierarchy of objective function: smooth = infinitely derivable convex = has a global optimum f 0 convex non-convex smooth non-smooth smooth non-smooth SVM NN 11

Toy Example: Equality Constraint Example 1: min x 1 x 2 f s.t.: x 2 1 x 2 2 2=0 h 1 x 2 h 1 h 1 f f = f x 1 f x 2 f 1, 1 f h 1 x 1 h 1 = h 1 x 1 h 1 x 2 At Optimal Solution: f x o = 1 o h 1 x o 12

Toy Example: Equality Constraint x is not an optimal solution, if there exists such that h 1 x s = 0 f x s f x Using first order Taylor's expansion s 0 h 1 x s = h 1 x h 1 x T s = h 1 x T s = 0 f x s f x = f x T s 0 1 2 Such an when s h 1 x are not parallel can exist only and f x h 1 x s f x 13

Toy Example: Equality Constraint Thus we have f x o = 1 o h 1 x o The Lagrangian L x, 1 = f x 1 h 1 x Lagrange multiplier or dual variable for h 1 Thus at the solution x L x o, 1 o = f x o 1 o h 1 x o = 0 This is just a necessary (not a sufficient) condition x solution implies h 1 x f x 14

Toy Example: Inequality Constraint Example 2: min x 1 x 2 f s.t.: 2 x 2 1 x 2 2 0 c 1 x 2 c 1 c 1 f = f x 1 f x 2 f c 1 1, 1 f f x 1 c 1 = c 1 x 1 c 1 x 2 15

Toy Example: Inequality Constraint x is not an optimal solution, if there exists such that c 1 x s 0 f x s f x Using first order Taylor's expansion s 0 f c 1 s x 2 x 1 c 1 x s = c 1 x c 1 x T s 0 1 s f f x s f x = f x T s 0 2 1, 1 16

Toy Example: Inequality Constraint Case 1: Inactive constraint Any sufficiently small s as long as f 1 x 0 c 1 x 0 f Thus s = f x where 0 s c 1 x 1 Case 2: Active constraint c 1 x = 0 c 1 x T s 0 1 f x T s 0 2 1, 1 s f f x = 1 c 1 x, where 1 0 s f x 17 c 1 x

Toy Example: Inequality Constraint Thus we have the Lagrangian (as before) L x, 1 = f x 1 c 1 x Lagrange multiplier or dual variable for c 1 The optimality conditions and x L x o, 1 o = f x o 1 o c 1 x o = 0 for some 1 0 1 o c 1 x o = 0 Complementarity condition either c 1 x o = 0 or 1 o = 0 (active) (inactive) 18

Same Concepts in a More General Setting 19

Lagrange Function The Problem min x f 0 x s.t.: f i x 0, h i x =0, i=1,, m i=1,, p objective function m inequality constraints p equality constraints Standard tool for constrained optimization: the Lagrange Function m L x,, = f 0 x i=1 p i f i x i=1 i h i x dual variables or Lagrange multipliers 20

Lagrange Dual Function, Defined, for as the minimum value of the Lagrange function over x m inequality constraints p equality constraints g : R m R p R g, =inf x D L x,, =inf x D f 0 x m i=1 p 1 f i x i=1 i h i x 21

Lagrange Dual Function Interpretation of Lagrange dual function: Writing the original problem as unconstrained problem but with hard indicators (penalties) minimize x where m f 0 x i=1 satisfied p I 0 f i x i=1 unsatisfied I 1 h i x I 0 u ={ 0 u 0 I u 0} 1 u ={ 0 u=0 u 0} satisfied unsatisfied indicator functions 22

Lagrange Dual Function Interpretation of Lagrange dual function: The Lagrange multipliers in Lagrange dual function can be seen as softer version of indicator (penalty) functions. minimize x m f 0 x i=1 p I 0 f i x i=1 I 1 h i x inf x D m f 0 x i=1 p i f i x i=1 i h i x 23

Lagrange Dual Function Lagrange dual function gives a lower bound on optimal value of the problem: g, p o Proof: Let x be a feasible optimal point and let 0. Then we have: f i x 0 i=1,, m h i x = 0 i=1,, p 24

Lagrange Dual Function Lagrange dual function gives a lower bound on optimal value of the problem: g, p o Proof: Let x be a feasible optimal point and let 0. Then we have: f i x 0 i=1,,m h i x = 0 i=1,, p Thus m L x,, = f 0 x i=1 p i f i x i=1 i h i x f 0 x 25

Lagrange Dual Function Lagrange dual function gives a lower bound on optimal value of the problem: g, p o Proof: Let x be a feasible optimal point and let 0. Then we have: f i x 0 i=1,,m h i x = 0 i=1,, p Thus. 0 m L x,, = f 0 x i=1 p i f i x i=1 i h i x f 0 x 26

Lagrange Dual Function Lagrange dual function gives a lower bound on optimal value of the problem: x g, = inf x D g, p o 0 Proof: Let be a feasible optimal point and let. Then we have: Thus Hence f i x 0 h i x = 0 m L x,, = f 0 x i=1 i=1,,m i=1,, p. 0 i f i x i=1 L x,, L x,, f 0 x p i h i x f 0 x 27

Sufficient Condition If x o, o, o is a saddle point, i.e. if x R n, 0, L x o,, L x o, o, o L x, o, o L x,,, o, o... then x o, o, o is a solution of p o x o x 28

Lagrange Dual Problem Lagrange dual function gives a lower bound on the optimal value of the problem. We seek the best lower bound to minimize the objective: maximize s.t.: 0 g, The dual optimal value and solution: d o = g o, o The Lagrange dual problem is convex even if the original problem is not. 29

Primal / Dual Problems Primal problem: min x D p o f 0 x s.t.: f i x 0, h i x =0, i=1,, m i=1,, p Dual problem: d o max, s.t.: g, g, =inf x D 0 m f 0 x i=1 p 1 f i x i=1 i h i x 30

Weak Duality Weak duality theorem: d o p o Optimal duality gap: p o d o 0 This bound is sometimes used to get an estimate on the optimal value of the original problem that is difficult to solve. 31

Strong Duality Strong Duality: d o = p o Strong duality does not hold in general. Slater's Condition: If x D f i x 0 h i x = 0 and it is strictly feasible: for i=1, m for i=1, p Strong Duality theorem: if Slater's condition holds and the problem is convex, then strong duality is attained: o, o with d o =g o, o =max, g, =inf x L x, o, o = p o 32

Optimality Conditions: First Order Complementary slackness: if strong duality holds, then at optimality x o, o, o i o f i x o = 0 i=1, m Proof: f 0 x o = g o, o = inf x m f 0 x i=1 m f 0 x o i=1 f 0 x o (strong duality) p o i f i x i=1 p o i f i x o i=1 i o h i x i o h i x o i, f i x 0, i 0 i, h i x = 0 33

Optimality Conditions: First Order Karush-Kuhn-Tucker (KKT) conditions If the strong duality holds, then at optimality: f i x o 0, h i x o = 0, i o 0, i o f i x o = 0, m f 0 x o i=1 i=1,,m i=1,, p i=1,,m i=1,, m p o i f i x o i=1 i o h i x o = 0 KKT conditions are necessary in general (local optimum) necessary and sufficient in case of convex problems (global optimum) 34

What are Support Vector Machines? Linear classifiers (Mostly) binary classifiers Supervised training Good generalization with explicit bounds 35

Main Ideas Behind Support Vector Machines Maximal margin Dual space Linear classifiers in high-dimensional space using non-linear mapping Kernel trick 36

Quadratic Programming 37

Using the Lagrangian 38

Dual Space 39

Strong Duality 40

Duality Gap 41

No Duality Gap Thanks to Convexity 42

Dual Form 43

Non-linear separation of datasets Non-linear separation is impossible in most problems Illustration from Prof. Mohri's lecture notes 44

Non-separable datasets Solutions: 1) Nonlinear classifiers 2) Increase dimensionality of dataset and add a non-linear mapping Ф 45

Kernel Trick similarity measure between 2 data samples 46

Kernel Trick Illustrated 47

Curse of Dimensionality Due to the Non-Linear Mapping 48

Positive Symmetric Definite Kernels (Mercer Condition) 49

Advantages of SVM Work very well... Error bounds easy to obtain: Generalization error small and predictable Fool-proof method: (Mostly) three kernels to choose from: Gaussian Linear and Polynomial Sigmoid Very small number of parameters to optimize 50

Limitations of SVM Size limitation: Size of kernel matrix is quadratic with the number of training vectors Speed limitations: 1) During training: very large quadratic programming problem solved numerically Solutions: Chunking Sequential Minimal Optimization (SMO) breaks QP problem into many small QP problems solved analytically Hardware implementations 2) During testing: number of support vectors Solution: Online SVM 51