Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1

Outline What is behind Support Vector Machines? Constrained Optimization Lagrange Duality Support Vector Machines in Detail Kernel Trick LibSVM demo 2

Binary Classification Problem Given: Training data generated according to the distribution D x 1, y 1,, x p, y p R n { 1,1} input label input space label space Problem: Find a classifier (a function) h x : R n { 1,1} such that it generalizes well on the test set obtained from the same distribution D Solution: Linear Approach: linear classifiers (e.g. Perceptron) Non Linear Approach: non-linear classifiers (e.g. Neural Networks, SVM) 3

Linearly Separable Data Assume that the training data is linearly separable x 2 x 1 4

Linearly Separable Data Assume that the training data is linearly separable x 2 w. x b=0 abscissa on axis parallel to w abscissa of origin 0 is b w x 1 Then the classifier is: h x = w. x b Inference: sign h x { 1,1} where w R n,b R 5

Linearly Separable Data x 2 Assume that the training data is linearly separable w. x b=0 w. x b=0 = 1 w Margin w = 1 w Maximize margin ρ (or 2ρ) so that: x 1 For the closest points: w. x b= 1 h x = w. x b { 1,1} w. x b=1 2 = 2 w 6

Optimization Problem A Constrained Optimization Problem min w 1 2 w 2 s.t.: y i w. x i b 1, i=1,, m label input Equivalent to maximizing the margin A convex optimization problem: Objective is convex Constraints are affine hence convex = 1 w 7

Optimization Problem Compare: min w 1 2 w 2 objective s.t.: y i w. x i b 1, i=1,, m constraints With: min w p i=1 y i w. x i b 2 w 2 energy/errors regularization 8

Optimization: Some Theory The problem: min f 0 x x s.t.: f i x 0, h i x =0, i=1,, m i=1,, p objective function inequality constraints equality constraints Solution of problem: x o Global (unique) optimum if the problem is convex Local optimum if the problem is not convex 9

Optimization: Some Theory Example: Standard Linear Program (LP) min c T x x s.t.: Ax=b x 0 Example: Least Squares Solution of Linear Equations (with L 2 norm regularization of the solution x) i.e. Ridge Regression min x T x x s.t.: Ax=b 10

Big Picture Constrained / unconstrained optimization Hierarchy of objective function: smooth = infinitely derivable convex = has a global optimum f 0 convex non-convex smooth non-smooth smooth non-smooth SVM NN 11

Toy Example: Equality Constraint Example 1: min x 1 x 2 f s.t.: x 2 1 x 2 2 2=0 h 1 x 2 h 1 h 1 f f = f x 1 f x 2 f 1, 1 f h 1 x 1 h 1 = h 1 x 1 h 1 x 2 At Optimal Solution: f x o = 1 o h 1 x o 12

Toy Example: Equality Constraint x is not an optimal solution, if there exists such that h 1 x s = 0 f x s f x Using first order Taylor's expansion s 0 h 1 x s = h 1 x h 1 x T s = h 1 x T s = 0 f x s f x = f x T s 0 1 2 Such an when s h 1 x are not parallel can exist only and f x h 1 x s f x 13

Toy Example: Equality Constraint Thus we have f x o = 1 o h 1 x o The Lagrangian L x, 1 = f x 1 h 1 x Lagrange multiplier or dual variable for h 1 Thus at the solution x L x o, 1 o = f x o 1 o h 1 x o = 0 This is just a necessary (not a sufficient) condition x solution implies h 1 x f x 14

Toy Example: Inequality Constraint Example 2: min x 1 x 2 f s.t.: 2 x 2 1 x 2 2 0 c 1 x 2 c 1 c 1 f = f x 1 f x 2 f c 1 1, 1 f f x 1 c 1 = c 1 x 1 c 1 x 2 15

Toy Example: Inequality Constraint x is not an optimal solution, if there exists such that c 1 x s 0 f x s f x Using first order Taylor's expansion s 0 f c 1 s x 2 x 1 c 1 x s = c 1 x c 1 x T s 0 1 s f f x s f x = f x T s 0 2 1, 1 16

Toy Example: Inequality Constraint Case 1: Inactive constraint Any sufficiently small s as long as f 1 x 0 c 1 x 0 f Thus s = f x where 0 s c 1 x 1 Case 2: Active constraint c 1 x = 0 c 1 x T s 0 1 f x T s 0 2 1, 1 s f f x = 1 c 1 x, where 1 0 s f x 17 c 1 x

Toy Example: Inequality Constraint Thus we have the Lagrangian (as before) L x, 1 = f x 1 c 1 x Lagrange multiplier or dual variable for c 1 The optimality conditions and x L x o, 1 o = f x o 1 o c 1 x o = 0 for some 1 0 1 o c 1 x o = 0 Complementarity condition either c 1 x o = 0 or 1 o = 0 (active) (inactive) 18

Same Concepts in a More General Setting 19

Lagrange Function The Problem min x f 0 x s.t.: f i x 0, h i x =0, i=1,, m i=1,, p objective function m inequality constraints p equality constraints Standard tool for constrained optimization: the Lagrange Function m L x,, = f 0 x i=1 p i f i x i=1 i h i x dual variables or Lagrange multipliers 20

Lagrange Dual Function, Defined, for as the minimum value of the Lagrange function over x m inequality constraints p equality constraints g : R m R p R g, =inf x D L x,, =inf x D f 0 x m i=1 p 1 f i x i=1 i h i x 21

Lagrange Dual Function Interpretation of Lagrange dual function: Writing the original problem as unconstrained problem but with hard indicators (penalties) minimize x where m f 0 x i=1 satisfied p I 0 f i x i=1 unsatisfied I 1 h i x I 0 u ={ 0 u 0 I u 0} 1 u ={ 0 u=0 u 0} satisfied unsatisfied indicator functions 22

Lagrange Dual Function Interpretation of Lagrange dual function: The Lagrange multipliers in Lagrange dual function can be seen as softer version of indicator (penalty) functions. minimize x m f 0 x i=1 p I 0 f i x i=1 I 1 h i x inf x D m f 0 x i=1 p i f i x i=1 i h i x 23

Lagrange Dual Function Lagrange dual function gives a lower bound on optimal value of the problem: g, p o Proof: Let x be a feasible optimal point and let 0. Then we have: f i x 0 i=1,, m h i x = 0 i=1,, p 24

Lagrange Dual Function Lagrange dual function gives a lower bound on optimal value of the problem: x g, = inf x D g, p o 0 Proof: Let be a feasible optimal point and let. Then we have: Thus Hence f i x 0 h i x = 0 m L x,, = f 0 x i=1 i=1,,m i=1,, p. 0 i f i x i=1 L x,, L x,, f 0 x p i h i x f 0 x 27

Sufficient Condition If x o, o, o is a saddle point, i.e. if x R n, 0, L x o,, L x o, o, o L x, o, o L x,,, o, o... then x o, o, o is a solution of p o x o x 28

Lagrange Dual Problem Lagrange dual function gives a lower bound on the optimal value of the problem. We seek the best lower bound to minimize the objective: maximize s.t.: 0 g, The dual optimal value and solution: d o = g o, o The Lagrange dual problem is convex even if the original problem is not. 29

Primal / Dual Problems Primal problem: min x D p o f 0 x s.t.: f i x 0, h i x =0, i=1,, m i=1,, p Dual problem: d o max, s.t.: g, g, =inf x D 0 m f 0 x i=1 p 1 f i x i=1 i h i x 30

Weak Duality Weak duality theorem: d o p o Optimal duality gap: p o d o 0 This bound is sometimes used to get an estimate on the optimal value of the original problem that is difficult to solve. 31

Strong Duality Strong Duality: d o = p o Strong duality does not hold in general. Slater's Condition: If x D f i x 0 h i x = 0 and it is strictly feasible: for i=1, m for i=1, p Strong Duality theorem: if Slater's condition holds and the problem is convex, then strong duality is attained: o, o with d o =g o, o =max, g, =inf x L x, o, o = p o 32

Optimality Conditions: First Order Complementary slackness: if strong duality holds, then at optimality x o, o, o i o f i x o = 0 i=1, m Proof: f 0 x o = g o, o = inf x m f 0 x i=1 m f 0 x o i=1 f 0 x o (strong duality) p o i f i x i=1 p o i f i x o i=1 i o h i x i o h i x o i, f i x 0, i 0 i, h i x = 0 33

Optimality Conditions: First Order Karush-Kuhn-Tucker (KKT) conditions If the strong duality holds, then at optimality: f i x o 0, h i x o = 0, i o 0, i o f i x o = 0, m f 0 x o i=1 i=1,,m i=1,, p i=1,,m i=1,, m p o i f i x o i=1 i o h i x o = 0 KKT conditions are necessary in general (local optimum) necessary and sufficient in case of convex problems (global optimum) 34

What are Support Vector Machines? Linear classifiers (Mostly) binary classifiers Supervised training Good generalization with explicit bounds 35

Main Ideas Behind Support Vector Machines Maximal margin Dual space Linear classifiers in high-dimensional space using non-linear mapping Kernel trick 36

Quadratic Programming 37

Using the Lagrangian 38

Dual Space 39

Strong Duality 40

Duality Gap 41

No Duality Gap Thanks to Convexity 42

Dual Form 43

Non-linear separation of datasets Non-linear separation is impossible in most problems Illustration from Prof. Mohri's lecture notes 44

Non-separable datasets Solutions: 1) Nonlinear classifiers 2) Increase dimensionality of dataset and add a non-linear mapping Ф 45

Kernel Trick similarity measure between 2 data samples 46

Kernel Trick Illustrated 47

Curse of Dimensionality Due to the Non-Linear Mapping 48

Positive Symmetric Definite Kernels (Mercer Condition) 49

Advantages of SVM Work very well... Error bounds easy to obtain: Generalization error small and predictable Fool-proof method: (Mostly) three kernels to choose from: Gaussian Linear and Polynomial Sigmoid Very small number of parameters to optimize 50

Limitations of SVM Size limitation: Size of kernel matrix is quadratic with the number of training vectors Speed limitations: 1) During training: very large quadratic programming problem solved numerically Solutions: Chunking Sequential Minimal Optimization (SMO) breaks QP problem into many small QP problems solved analytically Hardware implementations 2) During testing: number of support vectors Solution: Online SVM 51