Review: Support vector machines. Machine learning techniques and image analysis

Size: px

Start display at page:

Download "Review: Support vector machines. Machine learning techniques and image analysis"

Juliana Morgan
5 years ago
Views:

1 Review: Support vector machines

2 Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n.

3 Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. What are support vectors?

4 Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. What are support vectors?

5 Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n.

6 Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ]

7 Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution?

8 Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution? Then we must have α i = 0.

9 Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution? Then we must have α i = 0.

10 Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α)

11 Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α)

12 Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α) First, we fix α and minimize L(w, w 0 ; α) w.r.t. w, w 0 : w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0. w(α) = α i y i x i, α i y i = 0.

13 Review: SVM optimization (separable case) w(α) = α i y i x i, Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} α i y i = 0. α i [ 1 yi (w 0 (α) + w(α) T x i ) ]}

14 Review: SVM optimization (separable case) max {α i 0, i α iy i =0} w(α) = α i y i x i, Now we can substitute this solution into { 1 2 w(α) 2 + = max {α i 0, i α iy i =0} α i 1 2 α i y i = 0. α i [ 1 yi (w 0 (α) + w(α) T x i ) ]} α i α j y i y j x T i x j. i,j=1

15 Review: SVM optimization (separable case) Dual optimization problem subject to max α i 1 2 α i α j y i y j x T i x j i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields the optimal α. We substitute it back to get w: w = w(α) = α i y i x i

16 Review: SVM optimization (separable case) Dual optimization problem subject to max α i 1 2 α i α j y i y j x T i x j i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields the optimal α. We substitute it back to get w: w = w(α) = α i y i x i What is the structure of the solution?

17 Review: SVM classification w = α i >0 α i y i x i. Given a test example x, how is it classified?

18 Review: SVM classification w = α i >0 α i y i x i. Given a test example x, how is it classified? y = sign ( w 0 + w T x ) ( ) T = sign w 0 + α i y i x i x ( α i >0 = sign w 0 + α i y i x T i x α i >0 ) The classifier is based on the expansion in terms of dot products of x with support vectors.

19 Review: Non-separable case What if the training data are not linearly separable?

20 Review: Non-separable case What if the training data are not linearly separable? Basic idea: minimize 1 2 w 2 + C(penalty for violating margin constraints).

21 Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i

22 Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0.

23 Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty: ξ i = 1 y i (w 0 + w T x i ).

24 Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty: ξ i = 1 y i (w 0 + w T x i ). Penalty function max ( 0, 1 y i (w 0 + w T x i ) )

25 Review: Hinge loss max ( 0, 1 y i (w 0 + w T x i ) )

26 Connection between SVMs and logistic regression Logistic regression: Support vector machines: ( ) Hinge loss: max 0, 1 y i (w 0 + w T x i ) 1 P (y i x i ; w, w 0 ) = 1 + e y i(w 0 +w T x i ) Log loss: log (1 ) + e y i(w 0 +w T x i )

27 Review: Non-separable case Source: G. Shakhnarovich Dual problem: max α i 1 2 subject to α i α j y i y j x T i x j i,j=1 α i y i = 0, 0 α i C, for all i = 1,..., N. α i = 0: not support vector. 0 < α i < C: SV on the margin, ξ i = 0. α i = C: over the margin, either misclassified (ξ i > 1) or not (0 < ξ i 1).

28 Nonlinear SVM General idea: try to map the original input space into a high-dimensional feature space where the data is separable.

29 Example of nonlinear mapping Not separable in 1D:

30 Example of nonlinear mapping Not separable in 1D: Separable in 2D:

31 Example of nonlinear mapping Not separable in 1D: Separable in 2D: What is φ(x)?

32 Example of nonlinear mapping Not separable in 1D: Separable in 2D: What is φ(x)? φ(x) = (x, x 2 ).

33 Example of nonlinear mapping Source: G. Shakhnarovich Consider the mapping: φ : [x 1, x 2 ] T [1, 2x 1, 2x 2, x 2 1, x2 2, 2x 1 x 2 ] T. The (linear) SVM classifier in the feature space: ( ) ŷ = sign ŵ 0 + α i >0 α i y i φ(x i ) T φ(x) The dot product in the feature space: φ(x) T φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z x 2 2z x 1 x 2 z 1 z 2 = ( 1 + x T z ) 2.

34 Dot products and feature space Source: G. Shakhnarovich We defined a non-linear mapping into feature space φ : [x 1, x 2 ] T [1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ] T and saw that φ(x) T φ(z) = K(x, z) using the kernel K(x, z) = ( 1 + x T z ) 2. I.e., we can calculate dot products in the feature space implicitly, without ever writing the feature expansion!

35 The kernel trick Source: G. Shakhnarovich Replace dot products in the SVM formulation with kernel values. The optimization problem: max α i 1 2 α i α j y i y j K(x i, x j ) i,j=1 Need to compute pairwise kernel values for training data.

36 The kernel trick Source: G. Shakhnarovich Replace dot products in the SVM formulation with kernel values. The optimization problem: max α i 1 2 α i α j y i y j K(x i, x j ) i,j=1 Need to compute pairwise kernel values for training data. The classifier now defines a nonlinear decision boundary in the original space: ( ŷ = sign ŵ 0 + ) α i y i K(x i, x) α i >0 Need to compute K(x i, x) for all SVs x i.

37 Mercer s kernels Source: G. Shakhnarovich What kind of function K is a valid kernel, i.e. such that there exists a feature space Φ(x) in which K(x, z) = φ(x) T φ(z)? Theorem due to Mercer (1930s) K must be continuous; symmetric: K(x, z) = K(z, x); positive definite: for any x 1,..., x N, the kernel matrix K(x 1, x 1 ) K(x 1, x 2 ) K(x 1, x N ) K = K(x N, x 1 ) K(x N, x 2 ) K(x N, x N ) must be positive definite.

38 Some popular kernels Source: G. Shakhnarovich The linear kernel: K(x, z) = x T z. This leads to the original, linear SVM.

39 Some popular kernels Source: G. Shakhnarovich The linear kernel: K(x, z) = x T z. This leads to the original, linear SVM. The polynomial kernel: K(x, z; c, d) = (c + x T z) d. We can write the expansion explicitly, by concatenating powers up to d and multiplying by appropriate weights.

40 Example: SVM with polynomial kernel Source: G. Shakhnarovich

41 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ?

42 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.

43 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z.

44 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization).

45 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ?

46 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ? Then K(x, z) 1 for all x, z.

47 Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ? Then K(x, z) 1 for all x, z. The SVM underfits.

48 SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich Note: some SV here not close to the boundary

49 Making more Mercer kernels Let K 1 and K 2 be Mercer kernels. Then the following functions are also Mercer kernels: K(x, z) = K 1 (x, z) + K 2 (x, z) K(x, z) = ak 1 (x, z) (a is a positive scalar) K(x, z) = K 1 (x, z) K 2 (x, z) K(x, z) = x T Bz (B is a symmetric positive semi-definite matrix)

50 Making more Mercer kernels Let K 1 and K 2 be Mercer kernels. Then the following functions are also Mercer kernels: K(x, z) = K 1 (x, z) + K 2 (x, z) K(x, z) = ak 1 (x, z) (a is a positive scalar) K(x, z) = K 1 (x, z) K 2 (x, z) K(x, z) = x T Bz (B is a symmetric positive semi-definite matrix) Multiple kernel learning: learn kernel combinations as part of SVM optimization.

51 Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways.

52 Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value

53 Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example

54 Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example Error-correcting codes, decision trees/dags..

55 Support Vector Regression Source: G. Shakhnarovich The model: f(x) = w 0 + w T x

56 Support Vector Regression Source: G. Shakhnarovich The model: f(x) = w 0 + w T x Instead of the margin around the predicted decision boundary, we have ɛ-tube around the predicted function. y y(x) + ɛ ɛ-insensitive loss: y(x) y(x) ɛ x eps 0 eps 1

57 Support Vector Regression Optimization: introduce constraints and slack variables for going above or below the tube. 1 min w 0,w 2 w 2 + C (ξ i + ˆξ i ) subject to (w 0 + w T x i ) y i ɛ + ξ i, y i (w 0 + w T x i ) ɛ + ˆξ i, ξ i, ˆξ i 0, i = 1,..., n.

58 Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0.

59 Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0. We can let β i = ˆα i α i and simplify: max y i β i ɛ β i 1 β i β j x T i x j, β 2 subject to C β i C, i,j=1 β i = 0, i = 1,..., n.

60 Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0. We can let β i = ˆα i α i and simplify: max y i β i ɛ β i 1 β i β j x T i x j, β 2 subject to C β i C, i,j=1 β i = 0, i = 1,..., n. Then f(x) = w0 + n β i xt i x j, where w0 is chosen so that f(x i ) y i = ɛ for any i with 0 < βi < C.

61 SVM: summary Source: G. Shakhnarovich Main ideas:

62 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification

63 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick

64 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on?

65 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C

66 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters

67 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters Number of support vectors

68 SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters Number of support vectors A crucial component: good QP solver. Tons of off-the-shelf packages.

69 Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.)

70 Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training:

71 Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set

72 Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set In the worst case, can degenerate to nearest neighbor (every training point a support vector), but is much slower to train

73 Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set In the worst case, can degenerate to nearest neighbor (every training point a support vector), but is much slower to train No direct multi-class formulation

COMP 875 Announcements

COMP 875 Announcements Announcements Tentative presentation order is out Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting