UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia

Size: px

Start display at page:

Download "UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia"

Tyrone Kelley
5 years ago
Views:

1 UVA CS 4501: Machine Learning Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce Dr. Yanjun Qi University of Virginia Department of Computer Science

2 Where are we? è Five major of this course q Regression (supervised) q Classifica@on (supervised) q Unsupervised models q Learning theory q Graphical models 10/22/18 Dr. Yanjun Qi / UVA CS 2

3 Today q Support Vector Machine (SVM) ü History of SVM ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü Op@miza@on to learn model parameters (w, b) ü Non linearly separable case ü Op@miza@on with dual form ü Nonlinear decision boundary ü Prac@cal Guide 10/22/18 Dr. Yanjun Qi / UVA CS 3

4 Classifying in 1-d Can an SVM correctly classify this data? What about this? X X 10/22/18 Dr. Yanjun Qi / UVA CS 4

5 Classifying in 1-d Can an SVM correctly classify this data? And now? (extend with polynomial basis ) X 2 X X 10/22/18 Dr. Yanjun Qi / UVA CS 5

6 RECAP: Polynomial regression Introduce basis 10/22/18 Dr. Yanjun Qi / UVA CS 6 Dr. Nando de Freitas s tutorial slide

7 Non-linear SVMs: 2D The original input space (x) can be mapped to some higher-dimensional feature space (φ(x) )where the training set is separable: x=(x 1,x 2 ) φ(x) =(x 12, x 22, 2 x 1 x 2 ) 2 x 1 x 2 Φ: x φ(x) x 2 2 x 1 2 This slide is courtesy of 7

8 10/22/18 Dr. Yanjun Qi / UVA CS Credit: Stanford ML course 8

9 Dual SVM for linearly separable case Training / Testing Our dual target function: maxα i α i 1 2 α i y i = 0 α i 0 To evaluate a new sample x ts we need to compute: w T x ts + b =! i i n i=1 i j=1 α i α j y i y j x i T x j α i y i x T i x ts + b Dot product for all training samples! = sign α i y i x T i x ts Dot product with ( all??) training samples ( ) y ts + b 10/22/18 i SupportVectors 9

10 max α α i 1 2 i α i y i = 0 C >α i 0, i i i,j α i α j y i y j x T i x j max α i α i y i = 0 C >α i 0, i i α i α i α j y i y j Φ(x i ) T Φ(x j ) i,j Training 10/22/18 Dr. Yanjun Qi / UVA CS 10

11 w T x ts + b =! y ts i α i y i x T i x ts + b! = sign α i y i x i T x ts i SupportVectors ( ) + b Testing 10/22/18 Dr. Yanjun Qi / UVA CS 11

12 A little bit theory: Vapnik-Chervonenkis (VC) dimension If data is mapped into sufficiently high dimension, then samples will in general be linearly separable; N data points are in general separable in a space of N-1 dimensions or more!!! VC dimension of the set of oriented lines in R 2 is 3 It can be shown that the VC dimension of the family of oriented separa@ng hyperplanes in R N is at least N+1 10/22/18 Dr. Yanjun Qi / UVA CS 12

13 If data is mapped into sufficiently high dimension, then samples will in general be linearly separable; N data points are in general separable in a space of N-1 dimensions or more!!! Is this too much computational work? Possible problems - High computation burden due to high-dimensionality - Many more parameters to estimate Input space Φ(.) Φ( ) Φ( ) Φ( ) Φ ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Feature space 10/22/18 Dr. Yanjun Qi / UVA CS 13

14 Input space Φ(.) Φ( ) Φ( ) Φ( ) Φ ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Feature space SVM solves these two issues simultaneously Kernel tricks for efficient computation Dual formulation only assigns parameters to samples, not to features 10/22/18 Dr. Yanjun Qi / UVA CS 14

15 SVM solves these two issues simultaneously Kernel tricks for efficient computation Dual formulation only assigns parameters to samples, not features (1). Kernel tricks for efficient computation Never represent features explicitly Compute dot products in closed form Very theory Reproducing Kernel Hilbert Spaces Not covered in detail here 10/22/18 Dr. Yanjun Qi / UVA CS 15

16 Linear kernel (we've seen it) K(x,z)= x T z Polynomial kernel (we will see an example) K(x,z)= ( 1+ x T z) d where d = 2, 3, To get the feature vectors we concatenate all dth order polynomial terms of the components of x (weighted appropriately) Radial basis kernel K(x,z)= exp r x z 2 In this case., r is hyperpara. The feature space of the RBF kernel has an infinite number of dimensions 10/22/18 Never represent features explicitly Compute dot products with a closed form Very interes@ng theory Reproducing Kernel Hilbert Spaces Dr. Yanjun Qi / UVA CS Not covered in detail here

17 Example: Quadratic kernels K(x,z)= ( 1+ x T z) d 1 K(x, z) := Φ(x) T Φ(z) 2x 1! 2x p Consider all quadratic terms for x 1, x 2 x p Φ(x)= 2 x 1! max α α i α i α j y i y j Φ(x i ) T 2 Φ(x j ) x p i i,j α i y i = 0 2x 1 x 2 i! α i 0 i 2x p 1 x p 10/22/18 Dr. Yanjun Qi / UVA CS 17

18 K(x,z)= ( 1+ x T z) d 10/22/18 Dr. Yanjun Qi / UVA CS 18

building a polykernel matrix directly through the K(x,z) function among n training samples è max α α i 1 α 2 i α j y i y j K(x i,x j ) i α i

19 The kernel trick Φ(x) T Φ(z) O(p^d*n^2) operations if using the basis function representations in building a poly-kernel matrix So, if we define the kernel funccon as follows, there is no need to carry out basis func@on explicitly K(x,z)= (1+ x T z) d O(p*n^2) operations if building a polykernel matrix directly through the K(x,z) function among n training samples è max α α i 1 α 2 i α j y i y j K(x i,x j ) i α i y i = 0 i i,j C >α! i 0, i train x T z This is because gives a scalar, then its power of d only costs constant FLOPS. 10/22/18 Dr. Yanjun Qi / UVA CS 19

20 Summary: Due to Kernel Trick Change all inner products to kernel For training, max Original α α i 1 α i 2 i α j y i y j x T i x j i,j Linear α i y i = 0 i C >α! i 0, i train With kernel function - nonlinear max α α i 1 α 2 i α j y i y j K(x i,x j ) i α i y i = 0 i i,j C >α! i 0, i train 10/22/18 20

21 Summary: Due to Kernel For the new data x_ts Original Linear! y ts = sign α i y i x T i x ts + b i supportvectorn With kernel function - nonlinear! y ts = sign α i y i K(x i,x ts )+ b i supportvectors 10/22/18 Dr. Yanjun Qi / UVA CS 21

22 Kernel Trick: Implicit Basis For some kernels (e.g. RBF ) the implicit transform basis form \phi( x ) is infinitedimensional! But calcula@ons with kernel are done in original space, so computa@onal burden and curse of dimensionality aren t a problem. K(x,z)= exp r x z 2 O(p*n^2) operations in building a RBF-kernel matrix for training è Gaussian RBF Kernel corresponds to an infinite-dimensional vector space. YouTube video of Caltech: Abu-Mostafa explaining this in more detail hpps:// v=xuj5jbqihlu&t=25m53s 10/22/18 Dr. Yanjun Qi / UVA CS 22

23 Time Cost Comparisons 10/22/18 Dr. Yanjun Qi / UVA CS 23

24 Kernel (Extra) In use of SVM, only the kernel (and not basis ) is specified Kernel can be thought of as a similarity measure between the input objects Not all similarity measure can be used as kernel func@on, however Mercer's condi@on states that any posi@ve semidefinite kernel K(x, y), i.e. can be expressed as a dot product in a high dimensional space. 10/22/18 Dr. Yanjun Qi / UVA CS 24

25 Kernel Matrix Kernel creates the kernel matrix, which summarize all the (train) data 10/22/18 Dr. Yanjun Qi / UVA CS 25

26 Choosing the Kernel Probably the most tricky part of using SVM. The kernel is important because it creates the kernel matrix, which summarize all the data Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, tree kernel, graph kernel, ) Kernel trick has helped Non-tradi@onal data like strings and trees able to be used as input to SVM, instead of feature vectors In prac@ce, a low degree polynomial kernel or RBF kernel with a reasonable width is a good ini@al try for most applica@ons. 10/22/18 Dr. Yanjun Qi / UVA CS 26

27 Kernel trick has helped data like strings and trees able to be used as input to SVM, instead of feature vectors Vector vs. data e.g. Graphs, Sequences, 3D structures, Original Space Feature Space 10/22/18 Dr. Yanjun Qi / UVA CS 27

28 Mercer Kernel vs. Smoothing Kernel (Extra) The Kernels used in Support Vector Machines are different from the Kernels used in LocalWeighted /Kernel Regression. We can think Support Vector Machines kernels as Mercer Kernels Local Weighted / Kernel Regression s kernels as Smoothing Kernels 10/22/18 Dr. Yanjun Qi / UVA CS 28

29 Why do SVMs work? q If we are using huge features spaces (e.g., with kernels), how come we are not overfitting the data? ü ü Number of parameters remains the same (and most are set to 0) While we have a lot of inputs, at the end we only care about the support vectors and these are usually a small group of samples ü The maximizing of the margin acts as a sort of regularization term leading to reduced overfitting 10/22/18 Dr. Yanjun Qi / UVA CS 29

30 10/22/18 Dr. Yanjun Qi / UVA CS 30

31 Today q Support Vector Machine (SVM) ü History of SVM ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü Op@miza@on to learn model parameters (w, b) ü Non linearly separable case ü Op@miza@on with dual form ü Nonlinear decision boundary ü Prac@cal Guide 10/22/18 Dr. Yanjun Qi / UVA CS 31

32 Software A list of SVM implementation can be found at Some implementation (such as LIBSVM) can handle multi-class classification SVMLight is among one of the earliest implementation of SVM Several Matlab toolboxes for SVM are also available 10/22/18 Dr. Yanjun Qi / UVA CS 32

33 Summary: Steps for Using SVM in HW Prepare the feature-data matrix Select the kernel to use Select the parameter of the kernel and the value of C You can use the values suggested by the SVM sovware, or you can set apart a valida@on set to determine the values of the parameter Execute the training algorithm and obtain the \alpha i Unseen data can be classified using the \alpha i and the support vectors 10/22/18 Dr. Yanjun Qi / UVA CS 33

34 Practical Guide to SVM From authors of as LIBSVM: A Prac@cal Guide to Support Vector Classifica@on Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, hpp:// guide.pdf 10/22/18 Dr. Yanjun Qi / UVA CS 34

35 LIBSVM hpp:// ü Developed by Chih-Jen Lin etc. ü Tools for Support Vector ü Also support ü C++/Java/Python/Matlab/Perl wrappers ü Linux/UNIX/Windows ü SMO fast!!! A Prac@cal Guide to Support Vector Classifica@on 10/22/18 Dr. Yanjun Qi / UVA CS 35

36 (a) Data file formats for LIBSVM Training.dat +1 1: :1 3:1 4: : :-1 4: :1 +1 1: :1 3: : : :1 3:1 4: : /22/18 Dr. Yanjun Qi / UVA CS 36

37 (b) Feature Preprocessing (1) Categorical Feature Recommend using m numbers to represent an m- category apribute. Only one of the m numbers is one, and others are zero. For example, a three-category apribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0) 10/22/18 Dr. Yanjun Qi / UVA CS 37 A Prac@cal Guide to Support Vector Classifica@on

38 Feature Preprocessing (2) Scaling before applying SVM is very important to avoid apributes in greater numeric ranges those in smaller numeric ranges. to avoid numerical during the Recommend linearly scaling each apribute to the range [1, +1] or [0, 1]. 10/22/18 Dr. Yanjun Qi / UVA CS 38 A Prac@cal Guide to Support Vector Classifica@on

39 10/22/18 Dr. Yanjun Qi / UVA CS 39

40 Feature Preprocessing 10/22/18 Dr. Yanjun Qi / UVA CS 40 A Prac@cal Guide to Support Vector Classifica@on

41 Feature Preprocessing (3) missing value Very very tricky! Easy way: to the missing values by the mean value of the variable A liple bit harder way: imputa@on using nearest neighbors Even more complex: e.g. EM based (beyond the scope) 10/22/18 Dr. Yanjun Qi / UVA CS 41 A Prac@cal Guide to Support Vector Classifica@on

42 Feature Preprocessing (4) out of token issue For discrete feature variable, very trick to handle Easy way: to the values by the most likely value (in train) of the variable Easy way: to the values by a random value (in train) of the variable More solu@ons later in the NaiveBayes slides! 10/22/18 Dr. Yanjun Qi / UVA CS 42

43 (c) Model 10/22/18 Dr. Yanjun Qi / UVA CS 43

44 Model find right C Select the right penalty parameter C 10/22/18 Dr. Yanjun Qi / UVA 44

45 (c) Model Three parameters for a polynomial kernel A Prac@cal Guide to Support Vector Classifica@on 10/22/18 Dr. Yanjun Qi / UVA CS 45

46 (d) Pipeline Procedures (1) train / test (2) k-folds cross valida@on (3) k-cv on train to choose hyperparameter / then test 10/22/18 Dr. Yanjun Qi / UVA CS 46

47 Evaluation Choice-I: Train and Test Training dataset consists of inputoutput pairs Evaluation f(x? ) Measure Loss on pair è ( f(x? ), y? ) 10/22/18 Dr. Yanjun Qi / UVA CS 47

48 Evaluation Choice-II: Cross Problem: don t have enough data to set aside a test set Solu@on: Each data point is used both as train and test Common types: -K-fold cross-valida@on (e.g. K=5, K=10) -2-fold cross-valida@on -Leave-one-out cross-valida@on (LOOCV) A good prac@ce is : to random shuffle all training sample before spli~ng 10/22/18 Dr. Yanjun Qi / UVA CS 48

49 Evaluation Choice-III: For HW2-Q2 A Prac@cal Guide to Support Vector Classifica@on 10/22/18 Dr. Yanjun Qi / UVA CS 49

50 A Guide to Support Vector 10/22/18 Dr. Yanjun Qi / UVA CS 50

Today: Review & PracCcal Guide q Support Vector Machine (SVM) ü History of SVM ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü Op@miza@on to learn model parameters

51 Today: Review & PracCcal Guide q Support Vector Machine (SVM) ü History of SVM ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü Op@miza@on to learn model parameters (w, b) ü Non linearly separable case ü Op@miza@on with dual form ü Nonlinear decision boundary ü Prac@cal Guide ü File format / LIBSVM ü Feature preprocsssing ü Model selec@on ü Pipeline procedure 10/22/18 Dr. Yanjun Qi / UVA CS 51

52 Support Vector Machine Task Representation classification Kernel Tric- Func K(xi, xj) K(x,z):= Φ(x) T Φ(z) Score Function Margin + Hinge Loss (optional) Search/Optimization Models, Parameters QP with Dual form w = Dual Weights i α i x i y i argmin w,b +C ε i p 2 w i=1 i n i=1 subject 10/22/18 to x i Dtrain : y i ( x i w + b) 1 ε i max α α i 1 α 2 i α j y i y j K( x i,x j ) i i,j Dr. Yanjun Qi / UVA CS α i y i = 0, C α i 0, i 52 i

53 Why SVM Works? (Extra) Vapnik argues that the fundamental problem is not the number of parameters to be Rather, the problem is about the flexibility of a classifier Vapnik argues that the flexibility of a classifier should not be characterized by the number of parameters, but by the capacity of a classifier This is formalized by the VC-dimension of a classifier The SVM objec@ve can also be jus@fied by structural risk minimiza@on: the empirical risk (training error), plus a term related to the generaliza@on ability of the classifier, is minimized Another view: the SVM loss func@on is analogous to ridge regression. The term ½ w 2 shrinks the parameters towards zero to avoid overfi~ng 10/22/18 Dr. Yanjun Qi / UVA CS 53

54 References Big thanks to Prof. Ziv Bar-Joseph and Prof. Eric CMU for allowing me to reuse some of his slides Elements of Sta@s@cal Learning, by Has@e, Tibshirani and Friedman Prof. Andrew CMU s slides Tutorial slides from Dr. Tie-Yan Liu, MSR Asia A Prac@cal Guide to Support Vector Classifica@on Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, Tutorial slides from Stanford Convex Op@miza@on I Boyd & Vandenberghe 10/22/18 Dr. Yanjun Qi / UVA CS 54

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning Lecture 16 Extra: Support Vector Machine Optimization with Dual Dr. Yanjun Qi University of Virginia Department of Computer Science Today Extra q Optimization of SVM ü SVM