Jeff Howbert Introduction to Machine Learning Winter

Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1

Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable classes SVM classifiers for nonlinear decision boundaries kernel functions Other applications of SVMs Software Jeff Howbert Introduction to Machine Learning Winter 2012 2

Linearly separable classes Goal: find a linear decision boundary (hyperplane) that separates the classes Jeff Howbert Introduction to Machine Learning Winter 2012 3

One possible solution Jeff Howbert Introduction to Machine Learning Winter 2012 4

Another possible solution Jeff Howbert Introduction to Machine Learning Winter 2012 5

Other possible solutions Jeff Howbert Introduction to Machine Learning Winter 2012 6

Which h one is better? B1 or B2? How do you define better? Jeff Howbert Introduction to Machine Learning Winter 2012 7

Hyperplane that maximizes the margin will have better generalization => B1 is better than B2 Jeff Howbert Introduction to Machine Learning Winter 2012 8

B 1 test sample B 2 b 21 b 22 margin b 11 Hyperplane that maximizes the margin will have better generalization => B1 is better than B2 Jeff Howbert Introduction to Machine Learning Winter 2012 9 b 12

B 1 test sample B 2 b 21 b 22 margin b 11 Hyperplane that maximizes the margin will have better generalization => B1 is better than B2 Jeff Howbert Introduction to Machine Learning Winter 2012 10 b 12

B 1 W w x + b = 0 w x + b = 1 w x + b = +1 b 11 y i = f ( x ) = + 1 if w x + b 1 2 margin = 1 if w x + b 1 w Jeff Howbert Introduction to Machine Learning Winter 2012 11 b 12

We want to maximize: margin = 2 w Which is equivalent to minimizing: L( w ) = w 2 2 But subject to the following constraints: y i + 1 if w x + b 1 = f ( x) = 1 if w x + b 1 This is a constrained convex optimization problem Solve with numerical approaches, e.g. quadratic programming Jeff Howbert Introduction to Machine Learning Winter 2012 12

Solving for w that gives maximum margin: 1. Combine objective function and constraints into new objective function, using Lagrange multipliers λ i N 1 2 L primal = w λi ( yi ( w xi + b ) 1 ) 2 i= 1 2. To minimize this Lagrangian, we take derivatives of w and b and set them to 0: L L p w p = 0 w = λ i y ix N N i= 1 = 0 λ y b i= 1 Jeff Howbert Introduction to Machine Learning Winter 2012 13 y i i = 0 i

Solving for w that gives maximum margin: 3. Substituting and rearranging gives the dual of the Lagrangian: L N 1 λ λ λ y y x x dual = i i j i y j i j i= 1 2 i, j which we try to maximize (not minimize). 4. Once we have the λ i, we can substitute into previous equations to get w and b. 5. This defines w and b as linear combinations of the training data. Jeff Howbert Introduction to Machine Learning Winter 2012 14

Optimizing the dual is easier. Function of λ i only, not λ i and w. Convex optimization guaranteed to find global optimum. Most of the λ i go to zero. The x i for which λ i 0 are called the support vectors. These support (lie on) the margin boundaries. The x i for which λ i = 0 lie away from the margin boundaries. They are not required for defining the maximum margin hyperplane. Jeff Howbert Introduction to Machine Learning Winter 2012 15

Example of solving for maximum margin hyperplane Jeff Howbert Introduction to Machine Learning Winter 2012 16

What if the classes are not linearly separable? Jeff Howbert Introduction to Machine Learning Winter 2012 17

Now which one is better? B1 or B2? How do you define better? Jeff Howbert Introduction to Machine Learning Winter 2012 18

What if the problem is not linearly separable? Solution: introduce slack variables Need to minimize: Subject to: y i = f ( x ) = 2 w = + N k L( w) C ξi 2 i= 1 + 1 if w x + b 1 + ξ i 1 if w x + b 1 + ξ i C is an important hyperparameter, whose value is usually optimized by cross-validation. Jeff Howbert Introduction to Machine Learning Winter 2012 19

Slack variables for nonseparable data Jeff Howbert Introduction to Machine Learning Winter 2012 20

What if decision boundary is not linear? Jeff Howbert Introduction to Machine Learning Winter 2012 21

Solution: nonlinear transform of attributes Φ :[ x 1, x2] [ x1,( x1 + x 2 ) 4 ] Jeff Howbert Introduction to Machine Learning Winter 2012 22

Solution: nonlinear transform of attributes 2 Φ :[ x1, x2] [( x1 x1 ),( x2 x2)] 2 Jeff Howbert Introduction to Machine Learning Winter 2012 23

Issues with finding useful nonlinear transforms Not feasible to do manually as number of attributes grows (i.e. any real world problem) Usually involves transformation to higher dimensional space increases computational burden of SVM optimization curse of dimensionality With SVMs, can circumvent all the above via the kernel trick Jeff Howbert Introduction to Machine Learning Winter 2012 24

Kernel trick Support vector machines Don t need to specify the attribute transform Φ( x ) Only need to know how to calculate the dot product of any two transformed samples: k( x 1, x 2 ) = Φ( x 1 ) Φ( x 2 ) The kernel function k is substituted into the dual of the Lagrangian, allowing determination ti of a maximum margin hyperplane in the (implicitly) transformed space Φ( x ) All subsequent calculations, including predictions on test samples, are done using the kernel in place of Φ( x 1 ) Φ( x 2 ) Jeff Howbert Introduction to Machine Learning Winter 2012 25

Common kernel functions for SVM linear k x, x ) ( 1 2 = x1 x2 polynomial k( x1, x2) = ( γ x1 x2 + c) d ( ) 2 Gaussian or radial basis k( x, x ) = exp γ x x 1 2 1 2 sigmoid k ( x, x2) = tanh( γ x1 x2 + 1 c ) Jeff Howbert Introduction to Machine Learning Winter 2012 26

For some kernels (e.g. Gaussian) the implicit transform Φ( x ) is infinite-dimensional! But calculations with kernel are done in original space, so computational burden and curse of dimensionality aren t a problem. Jeff Howbert Introduction to Machine Learning Winter 2012 27

Jeff Howbert Introduction to Machine Learning Winter 2012 28

Applications of SVMs to machine learning Classification binary multiclass one-class Regression Transduction (semi-supervised learning) Ranking Clustering Structured labels Jeff Howbert Introduction to Machine Learning Winter 2012 29

Software SVM light http://svmlight.joachims.org/ org/ libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm/ includes MATLAB / Octave interface Jeff Howbert Introduction to Machine Learning Winter 2012 30