Review: Support vector machines
Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n.
Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. What are support vectors?
Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. What are support vectors?
Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n.
Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ]
Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution?
Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution? Then we must have α i = 0.
Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution? Then we must have α i = 0.
Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α)
Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α)
Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α) First, we fix α and minimize L(w, w 0 ; α) w.r.t. w, w 0 : w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0. w(α) = α i y i x i, α i y i = 0.
Review: SVM optimization (separable case) w(α) = α i y i x i, Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} α i y i = 0. α i [ 1 yi (w 0 (α) + w(α) T x i ) ]}
Review: SVM optimization (separable case) max {α i 0, i α iy i =0} w(α) = α i y i x i, Now we can substitute this solution into { 1 2 w(α) 2 + = max {α i 0, i α iy i =0} α i 1 2 α i y i = 0. α i [ 1 yi (w 0 (α) + w(α) T x i ) ]} α i α j y i y j x T i x j. i,j=1
Review: SVM optimization (separable case) Dual optimization problem subject to max α i 1 2 α i α j y i y j x T i x j i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields the optimal α. We substitute it back to get w: w = w(α) = α i y i x i
Review: SVM optimization (separable case) Dual optimization problem subject to max α i 1 2 α i α j y i y j x T i x j i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields the optimal α. We substitute it back to get w: w = w(α) = α i y i x i What is the structure of the solution?
Review: SVM classification w = α i >0 α i y i x i. Given a test example x, how is it classified?
Review: SVM classification w = α i >0 α i y i x i. Given a test example x, how is it classified? y = sign ( w 0 + w T x ) ( ) T = sign w 0 + α i y i x i x ( α i >0 = sign w 0 + α i y i x T i x α i >0 ) The classifier is based on the expansion in terms of dot products of x with support vectors.
Review: Non-separable case What if the training data are not linearly separable?
Review: Non-separable case What if the training data are not linearly separable? Basic idea: minimize 1 2 w 2 + C(penalty for violating margin constraints).
Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i
Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0.
Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty: ξ i = 1 y i (w 0 + w T x i ).
Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty: ξ i = 1 y i (w 0 + w T x i ). Penalty function max ( 0, 1 y i (w 0 + w T x i ) )
Review: Hinge loss max ( 0, 1 y i (w 0 + w T x i ) )
Connection between SVMs and logistic regression Logistic regression: Support vector machines: ( ) Hinge loss: max 0, 1 y i (w 0 + w T x i ) 1 P (y i x i ; w, w 0 ) = 1 + e y i(w 0 +w T x i ) Log loss: log (1 ) + e y i(w 0 +w T x i )
Review: Non-separable case Source: G. Shakhnarovich Dual problem: max α i 1 2 subject to α i α j y i y j x T i x j i,j=1 α i y i = 0, 0 α i C, for all i = 1,..., N. α i = 0: not support vector. 0 < α i < C: SV on the margin, ξ i = 0. α i = C: over the margin, either misclassified (ξ i > 1) or not (0 < ξ i 1).
Nonlinear SVM General idea: try to map the original input space into a high-dimensional feature space where the data is separable.
Example of nonlinear mapping Not separable in 1D:
Example of nonlinear mapping Not separable in 1D: Separable in 2D:
Example of nonlinear mapping Not separable in 1D: Separable in 2D: What is φ(x)?
Example of nonlinear mapping Not separable in 1D: Separable in 2D: What is φ(x)? φ(x) = (x, x 2 ).
Example of nonlinear mapping Source: G. Shakhnarovich Consider the mapping: φ : [x 1, x 2 ] T [1, 2x 1, 2x 2, x 2 1, x2 2, 2x 1 x 2 ] T. The (linear) SVM classifier in the feature space: ( ) ŷ = sign ŵ 0 + α i >0 α i y i φ(x i ) T φ(x) The dot product in the feature space: φ(x) T φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z 2 1 + x 2 2z 2 2 + 2x 1 x 2 z 1 z 2 = ( 1 + x T z ) 2.
Dot products and feature space Source: G. Shakhnarovich We defined a non-linear mapping into feature space φ : [x 1, x 2 ] T [1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ] T and saw that φ(x) T φ(z) = K(x, z) using the kernel K(x, z) = ( 1 + x T z ) 2. I.e., we can calculate dot products in the feature space implicitly, without ever writing the feature expansion!
The kernel trick Source: G. Shakhnarovich Replace dot products in the SVM formulation with kernel values. The optimization problem: max α i 1 2 α i α j y i y j K(x i, x j ) i,j=1 Need to compute pairwise kernel values for training data.
The kernel trick Source: G. Shakhnarovich Replace dot products in the SVM formulation with kernel values. The optimization problem: max α i 1 2 α i α j y i y j K(x i, x j ) i,j=1 Need to compute pairwise kernel values for training data. The classifier now defines a nonlinear decision boundary in the original space: ( ŷ = sign ŵ 0 + ) α i y i K(x i, x) α i >0 Need to compute K(x i, x) for all SVs x i.
Mercer s kernels Source: G. Shakhnarovich What kind of function K is a valid kernel, i.e. such that there exists a feature space Φ(x) in which K(x, z) = φ(x) T φ(z)? Theorem due to Mercer (1930s) K must be continuous; symmetric: K(x, z) = K(z, x); positive definite: for any x 1,..., x N, the kernel matrix K(x 1, x 1 ) K(x 1, x 2 ) K(x 1, x N ) K =................. K(x N, x 1 ) K(x N, x 2 ) K(x N, x N ) must be positive definite.
Some popular kernels Source: G. Shakhnarovich The linear kernel: K(x, z) = x T z. This leads to the original, linear SVM.
Some popular kernels Source: G. Shakhnarovich The linear kernel: K(x, z) = x T z. This leads to the original, linear SVM. The polynomial kernel: K(x, z; c, d) = (c + x T z) d. We can write the expansion explicitly, by concatenating powers up to d and multiplying by appropriate weights.
Example: SVM with polynomial kernel Source: G. Shakhnarovich
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ?
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z.
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization).
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ?
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ? Then K(x, z) 1 for all x, z.
Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ? Then K(x, z) 1 for all x, z. The SVM underfits.
SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich Note: some SV here not close to the boundary
Making more Mercer kernels Let K 1 and K 2 be Mercer kernels. Then the following functions are also Mercer kernels: K(x, z) = K 1 (x, z) + K 2 (x, z) K(x, z) = ak 1 (x, z) (a is a positive scalar) K(x, z) = K 1 (x, z) K 2 (x, z) K(x, z) = x T Bz (B is a symmetric positive semi-definite matrix)
Making more Mercer kernels Let K 1 and K 2 be Mercer kernels. Then the following functions are also Mercer kernels: K(x, z) = K 1 (x, z) + K 2 (x, z) K(x, z) = ak 1 (x, z) (a is a positive scalar) K(x, z) = K 1 (x, z) K 2 (x, z) K(x, z) = x T Bz (B is a symmetric positive semi-definite matrix) Multiple kernel learning: learn kernel combinations as part of SVM optimization.
Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways.
Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value
Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example
Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example Error-correcting codes, decision trees/dags..
Support Vector Regression Source: G. Shakhnarovich The model: f(x) = w 0 + w T x
Support Vector Regression Source: G. Shakhnarovich The model: f(x) = w 0 + w T x Instead of the margin around the predicted decision boundary, we have ɛ-tube around the predicted function. y y(x) + ɛ 2 1.8 ɛ-insensitive loss: y(x) 1.6 1.4 y(x) ɛ 1.2 1 0.8 x 0.6 0.4 0.2 0 1 eps 0 eps 1
Support Vector Regression Optimization: introduce constraints and slack variables for going above or below the tube. 1 min w 0,w 2 w 2 + C (ξ i + ˆξ i ) subject to (w 0 + w T x i ) y i ɛ + ξ i, y i (w 0 + w T x i ) ɛ + ˆξ i, ξ i, ˆξ i 0, i = 1,..., n.
Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0.
Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0. We can let β i = ˆα i α i and simplify: max y i β i ɛ β i 1 β i β j x T i x j, β 2 subject to C β i C, i,j=1 β i = 0, i = 1,..., n.
Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0. We can let β i = ˆα i α i and simplify: max y i β i ɛ β i 1 β i β j x T i x j, β 2 subject to C β i C, i,j=1 β i = 0, i = 1,..., n. Then f(x) = w0 + n β i xt i x j, where w0 is chosen so that f(x i ) y i = ɛ for any i with 0 < βi < C.
SVM: summary Source: G. Shakhnarovich Main ideas:
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on?
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters Number of support vectors
SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters Number of support vectors A crucial component: good QP solver. Tons of off-the-shelf packages.
Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.)
Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training:
Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set
Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set In the worst case, can degenerate to nearest neighbor (every training point a support vector), but is much slower to train
Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set In the worst case, can degenerate to nearest neighbor (every training point a support vector), but is much slower to train No direct multi-class formulation