Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data 02/03/2018 Introduction to Data Mining 2

Support Vector Machines B 1 One Possible Solution 02/03/2018 Introduction to Data Mining 3 Support Vector Machines B 2 Another possible solution 02/03/2018 Introduction to Data Mining 4

Support Vector Machines B 2 Other possible solutions 02/03/2018 Introduction to Data Mining 5 Support Vector Machines B 1 B 2 Which one is better? B1 or B2? How do you define better? 02/03/2018 Introduction to Data Mining 6

Support Vector Machines B 1 B 2 b 21 b 22 margin b 11 b 12 Find hyperplane maximizes the margin => B1 is better than B2 02/03/2018 Introduction to Data Mining 7 Support Vector Machines B 1 w! x! + b = 0 w! x! + b = 1 w! x! + b = + 1 b 11! 1 if w x + b 1 f ( x) = 1 if w x + b 1 02/03/2018 Introduction to Data Mining 8 b 12 2 Margin = w!

Linear SVM Linear model:! 1 if w x + b 1 f ( x) = 1 if w x + b 1 Learning the model is equivalent to determining the values of w! and b How to find w! and b from training data? 02/03/2018 Introduction to Data Mining 9 Learning Linear SVM 2 Objective is to maximize: Margin = w!!! w Which is equivalent to minimizing: L( w) = 2 Subject to the following constraints: 1 if w xi + b 1 y i = 1 if w xi + b 1 or yi ( w xi + b) 1, i = 1,2,..., N u This is a constrained optimization problem Solve it using Lagrange multiplier method 2 02/03/2018 Introduction to Data Mining 10

Example of Linear SVM Support vectors x1 x2 y λ 0.3858 0.4687 1 65.5261 0.4871 0.611-1 65.5261 0.9218 0.4103-1 0 0.7382 0.8936-1 0 0.1763 0.0579 1 0 0.4057 0.3529 1 0 0.9355 0.8132-1 0 0.2146 0.0099 1 0 02/03/2018 Introduction to Data Mining 11 Learning Linear SVM Decision boundary depends only on support vectors If you have data set with same support vectors, decision boundary will not change How to classify using SVM once w and b are found? Given a test record, x i! 1 f ( x i ) = 1 if w xi + b 1 if w x + b 1 i 02/03/2018 Introduction to Data Mining 12

Support Vector Machines What if the problem is not linearly separable? 02/03/2018 Introduction to Data Mining 13 Support Vector Machines What if the problem is not linearly separable? Introduce slack variables u Need to minimize:! 2 N w k L( w) = + C ξi 2 u Subject to: i= 1 y i 1 = 1 if w xi + b 1-ξi if w x + b 1+ ξ u If k is 1 or 2, this leads to same objective function as linear SVM but with different constraints (see textbook) i i 02/03/2018 Introduction to Data Mining 14

Support Vector Machines B 1 B 2 b 21 b 22 margin b 11 b 12 Find the hyperplane that optimizes both factors 02/03/2018 Introduction to Data Mining 15 Nonlinear Support Vector Machines What if decision boundary is not linear? 02/03/2018 Introduction to Data Mining 16

Nonlinear Support Vector Machines Trick: Transform data into higher dimensional space Decision boundary: w Φ( x) + b = 0 02/03/2018 Introduction to Data Mining 17 Learning Nonlinear SVM Optimization problem: Which leads to the same set of equations (but involve Φ(x) instead of x) 02/03/2018 Introduction to Data Mining 18

Learning NonLinear SVM Issues: What type of mapping function Φ should be used? How to do the computation in high dimensional space? u Most computations involve dot product Φ(x i ) Φ(x j ) u Curse of dimensionality? 02/03/2018 Introduction to Data Mining 19 Learning Nonlinear SVM Kernel Trick: Φ(x i ) Φ(x j ) = K(x i, x j ) K(x i, x j ) is a kernel function (expressed in terms of the coordinates in the original space) u Examples: 02/03/2018 Introduction to Data Mining 20

Example of Nonlinear SVM SVM with polynomial degree 2 kernel 02/03/2018 Introduction to Data Mining 21 Learning Nonlinear SVM Advantages of using kernel: Don t have to know the mapping function Φ Computing dot product Φ(x i ) Φ(x j ) in the original space avoids curse of dimensionality Not all functions can be kernels Must make sure there is a corresponding Φ in some high-dimensional space Mercer s theorem (see textbook) 02/03/2018 Introduction to Data Mining 22

Characteristics of SVM Since the learning problem is formulated as a convex optimization problem, efficient algorithms are available to find the global minima of the objective function (many of the other methods use greedy approaches and find locally optimal solutions) Overfitting is addressed by maximizing the margin of the decision boundary, but the user still needs to provide the type of kernel function and cost function Difficult to handle missing values Robust to noise High computational complexity for building the model 02/03/2018 Introduction to Data Mining 23