10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Size: px

Start display at page:

Download "10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers"

Felix Franklin
5 years ago
Views:

1 Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1

2 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b<0 Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best? 2

3 Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class Linear Classifiers: summary Many common text classifiers are linear classifiers Despite this similarity, large performance differences For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? What to do for non-separable problems? 3

4 Separation by Hyperplanes Assume linear separability for now: in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c Which Hyperplane? In general, lots of possible solutions for a,b,c. Support Vector Machine (SVM) finds an optimal solution. 4

5 Support Vector Machine (SVM) SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Quadratic programming problem Support vectors Maximize margin Maximum Margin denotes +1 denotes -1 Support Vectors are those datapoints that the margin pushes up against 1. Maximizing the margin is good according f(x,w,b) to intuition = sign(w and PAC x theory + b) 2. Implies that only support vectors are important; other The training maximum examples margin are ignorable. linear classifier is the linear classifier with the, 3. Empirically it works um, maximum very very margin. well. This is the simplest kind of SVM (Called an LSVM) Linear SVM 5

6 Maximum Margin: Formalization w: hyperplane normal x_i: data point i y_i: class of data point i (+1 or -1) Constraint optimization formalization: (1) (2) maximize margin: 2/ w Maximum margin in geometrical terms 6

7 Quadratic Programming One can show that hyperplane w with maximum margin is: x_i: data point i y_i: class of data point i (+1 or -1) And the alpha_i are the so-called Lagrange multipliers- the solution to maximizing: Most alpha_i will be zero. Soft-Margin SVMs Negative for bad points. Define distance for each point with respect to separator ax + by = c: (ax + by) - c for red points c - (ax + by) for green points. 7

8 Solve Quadratic Program Solution gives separator between two classes: choice of a,b. Given a new point (x,y), can score its proximity to each class: evaluate ax+by. Set confidence threshold Support Vector Machines The cost function 8

9 Alternative view of logistic regression If, we want, If, we want, Alternative view of logistic regression Cost of example: If (want ): If (want ): 9

10 Support vector machine Logistic regression: Support vector machine: SVM hypothesis Hypothesis: 10

11 Support Vector Machines Cost function and decision boundaries Support Vector Machine If, we want (not just ) If, we want (not just ) 11

SVM Decision Boundary Whenever : Whenever : - 1 1-1 1 SVM Decision

12 SVM Decision Boundary Whenever : Whenever : SVM Decision Boundary: Linearly separable case x 2 Large margin classifier x 1 12

13 Large margin classifier in presence of outliers x 2 x 1 Support Vector Machines Kernels I 13

14 Non-linearly separable categories: transforming the space Non-linear decision boundaries: transform the feature space General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x φ(x) 14

15 Non-linear Decision Boundary x 2 x 1 Is there a different / better choice of features? How non-linear mapping can make a non-linear problem linearly separable 15

16 Support Vector Machines Kernels II Some popular kernels Polynomials of degree d Polynomials of degree up to d Gaussian/Radial kernels (polynomials of all orders projected space has infinite dimension) Sigmoid 16

17 Recall: We re maximizing: Kernels Observation: data only occur in dot products. We can map data into a very high dimensional space (even infinite!) as long as kernel computable. For mapping function Ф, compute kernel K(i,j) = Ф(xi) Ф(xj) Example: The Kernel Trick The linear classifier relies on dot product between vectors K(x i,x j )=x it x j If every data point is mapped into highdimensional space via some transformation Φ: x φ(x), the dot product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is some function that corresponds to an inner product in some expanded feature space We don't have to compute Φ: x φ(x) explicitly, K(x i,x j ) is enough for SVM learning 17

to a semi-positive definite symmetric Gram matrix: K= K(x 1,x 1 ) K(x 1,x 2 ) K(x 1,x 3 ) K(x 1,x N ) K(x 2,x 1 ) K(x 2,x 2 ) K(x 2,x

18 What Functions are Kernels? For some functions K(x i,x j ) checking that K(x i,x j )= φ(x i ) T φ(x j ) can be cumbersome. Mercer s theorem: Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix: K= K(x 1,x 1 ) K(x 1,x 2 ) K(x 1,x 3 ) K(x 1,x N ) K(x 2,x 1 ) K(x 2,x 2 ) K(x 2,x 3 ) K(x 2,x N ) K(x N,x 1 ) K(x N,x 2 ) K(x N,x 3 ) K(x N,x N ) SVM with Kernels Hypothesis: Given Predict y=1 if, compute features Training: 18

19 SVM parameters: C ( ). Large C: Lower bias, high variance. Small C: Higher bias, low variance. Large : Features vary more smoothly. Higher bias, lower variance. Small : Features vary less smoothly. Lower bias, higher variance. Support Vector Machines Using an SVM 19

No kernel ( linear kernel ) Predict y = 1 if Gaussian kernel:

20 Use SVM software package (e.g. liblinear, libsvm, ) to solve for parameters. Need to specify: Choice of parameter C. Choice of kernel (similarity function): E.g. No kernel ( linear kernel ) Predict y = 1 if Gaussian kernel: where. Need to choose., Kernel (similarity) functions: function f = kernel(x1,x2) x1 x2 return Note: Do perform feature scaling before using the Gaussian kernel. 20

Other choices of kernel Note: Not all similarity functions make valid kernels.

optimizations run correctly, and do not diverge).

chi-square kernel, histogram intersection kernel, Multi-class classification Many SVM packages

21 Other choices of kernel Note: Not all similarity functions make valid kernels. (Need to satisfy technical condition called Mercer s Theorem to make sure SVM packages optimizations run correctly, and do not diverge). Many off-the-shelf kernels available: - Polynomial kernel: - More esoteric: String kernel, chi-square kernel, histogram intersection kernel, Multi-class classification Many SVM packages already have built-in multi-class classification functionality. Otherwise, use one-vs.-all method. (Train SVMs, one to distinguish from the rest, for ), get Pick class with largest 21

22 Logistic regression vs. SVMs number of features ( ), number of training examples If is large (relative to ): Use logistic regression, or SVM without a kernel ( linear kernel ) If is small, is intermediate: Use SVM with Gaussian kernel If is small, is large: Create/add more features, then use logistic regression or SVM without a kernel Neural network likely to work well for most of these settings, but may be slower to train. THANKS Many of these slides come from Andrew Ng s Coursera course at Stanford 22

Support Vector Machine II

Support Vector Machine II Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 due tonight HW 2 released. Online Scalable Learning Adaptive to Unknown Dynamics and Graphs Yanning