Review: Support vector machines. Machine learning techniques and image analysis

Similar documents
COMP 875 Announcements

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification and Regression

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS798: Selected topics in Machine Learning

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines Explained

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Modelli Lineari (Generalizzati) e SVM

CIS 520: Machine Learning Oct 09, Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture Notes on Support Vector Machine

Kernel methods CSE 250B

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Statistical Machine Learning from Data

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines.

Perceptron Revisited: Linear Separators. Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Basis Expansion and Nonlinear SVM. Kai Yu

L5 Support Vector Classification

Support Vector Machine

Support Vector Machines

Kernel Methods. Charles Elkan October 17, 2007

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel Methods and Support Vector Machines

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine for Classification and Regression

Support Vector Machines and Kernel Methods

18.9 SUPPORT VECTOR MACHINES

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines

Lecture 7: Kernels for Classification and Regression

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machines. Maximizing the Margin

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Support Vector Machine

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Support Vector Machines and Speaker Verification

Classifier Complexity and Support Vector Classifiers

Soft-margin SVM can address linearly separable problems with outliers

Statistical Pattern Recognition

Non-linear Support Vector Machines

Introduction to Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear & nonlinear classifiers

Support Vector Machines

LMS Algorithm Summary

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

(Kernels +) Support Vector Machines

Learning with kernels and SVM

Introduction to Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

ML (cont.): SUPPORT VECTOR MACHINES

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

SVMs, Duality and the Kernel Trick

SVMs: nonlinearity through kernels

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Max Margin-Classifier

Machine Learning : Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Introduction to Support Vector Machines

Support Vector Machine II

Statistical Methods for SVM

Support Vector Machines

Introduction to SVM and RVM

Announcements - Homework

Convex Optimization and Support Vector Machine

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Transcription:

Review: Support vector machines

Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n.

Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. What are support vectors?

Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. What are support vectors?

Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n.

Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ]

Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution?

Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution? Then we must have α i = 0.

Review: Support vector machines (separable case) Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Add constraints as terms to the objective function: min (w,w 0 ) 1 2 w 2 + max α i 0 α i [ 1 yi (w 0 + w T x i ) ] What if y i (w 0 + w T x i ) > 1 (not a support vector) at the solution? Then we must have α i = 0.

Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α)

Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α)

Review: SVM optimization (separable case) { max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} }{{} L(w,w 0 ;α) First, we fix α and minimize L(w, w 0 ; α) w.r.t. w, w 0 : w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0. w(α) = α i y i x i, α i y i = 0.

Review: SVM optimization (separable case) w(α) = α i y i x i, Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} α i y i = 0. α i [ 1 yi (w 0 (α) + w(α) T x i ) ]}

Review: SVM optimization (separable case) max {α i 0, i α iy i =0} w(α) = α i y i x i, Now we can substitute this solution into { 1 2 w(α) 2 + = max {α i 0, i α iy i =0} α i 1 2 α i y i = 0. α i [ 1 yi (w 0 (α) + w(α) T x i ) ]} α i α j y i y j x T i x j. i,j=1

Review: SVM optimization (separable case) Dual optimization problem subject to max α i 1 2 α i α j y i y j x T i x j i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields the optimal α. We substitute it back to get w: w = w(α) = α i y i x i

Review: SVM optimization (separable case) Dual optimization problem subject to max α i 1 2 α i α j y i y j x T i x j i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields the optimal α. We substitute it back to get w: w = w(α) = α i y i x i What is the structure of the solution?

Review: SVM classification w = α i >0 α i y i x i. Given a test example x, how is it classified?

Review: SVM classification w = α i >0 α i y i x i. Given a test example x, how is it classified? y = sign ( w 0 + w T x ) ( ) T = sign w 0 + α i y i x i x ( α i >0 = sign w 0 + α i y i x T i x α i >0 ) The classifier is based on the expansion in terms of dot products of x with support vectors.

Review: Non-separable case What if the training data are not linearly separable?

Review: Non-separable case What if the training data are not linearly separable? Basic idea: minimize 1 2 w 2 + C(penalty for violating margin constraints).

Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i

Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0.

Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty: ξ i = 1 y i (w 0 + w T x i ).

Non-separable case Rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w 2 + C subject to y i ( w0 + w T x i ) 1+ξi 0. ξ i Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty: ξ i = 1 y i (w 0 + w T x i ). Penalty function max ( 0, 1 y i (w 0 + w T x i ) )

Review: Hinge loss max ( 0, 1 y i (w 0 + w T x i ) )

Connection between SVMs and logistic regression Logistic regression: Support vector machines: ( ) Hinge loss: max 0, 1 y i (w 0 + w T x i ) 1 P (y i x i ; w, w 0 ) = 1 + e y i(w 0 +w T x i ) Log loss: log (1 ) + e y i(w 0 +w T x i )

Review: Non-separable case Source: G. Shakhnarovich Dual problem: max α i 1 2 subject to α i α j y i y j x T i x j i,j=1 α i y i = 0, 0 α i C, for all i = 1,..., N. α i = 0: not support vector. 0 < α i < C: SV on the margin, ξ i = 0. α i = C: over the margin, either misclassified (ξ i > 1) or not (0 < ξ i 1).

Nonlinear SVM General idea: try to map the original input space into a high-dimensional feature space where the data is separable.

Example of nonlinear mapping Not separable in 1D:

Example of nonlinear mapping Not separable in 1D: Separable in 2D:

Example of nonlinear mapping Not separable in 1D: Separable in 2D: What is φ(x)?

Example of nonlinear mapping Not separable in 1D: Separable in 2D: What is φ(x)? φ(x) = (x, x 2 ).

Example of nonlinear mapping Source: G. Shakhnarovich Consider the mapping: φ : [x 1, x 2 ] T [1, 2x 1, 2x 2, x 2 1, x2 2, 2x 1 x 2 ] T. The (linear) SVM classifier in the feature space: ( ) ŷ = sign ŵ 0 + α i >0 α i y i φ(x i ) T φ(x) The dot product in the feature space: φ(x) T φ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z 2 1 + x 2 2z 2 2 + 2x 1 x 2 z 1 z 2 = ( 1 + x T z ) 2.

Dot products and feature space Source: G. Shakhnarovich We defined a non-linear mapping into feature space φ : [x 1, x 2 ] T [1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ] T and saw that φ(x) T φ(z) = K(x, z) using the kernel K(x, z) = ( 1 + x T z ) 2. I.e., we can calculate dot products in the feature space implicitly, without ever writing the feature expansion!

The kernel trick Source: G. Shakhnarovich Replace dot products in the SVM formulation with kernel values. The optimization problem: max α i 1 2 α i α j y i y j K(x i, x j ) i,j=1 Need to compute pairwise kernel values for training data.

The kernel trick Source: G. Shakhnarovich Replace dot products in the SVM formulation with kernel values. The optimization problem: max α i 1 2 α i α j y i y j K(x i, x j ) i,j=1 Need to compute pairwise kernel values for training data. The classifier now defines a nonlinear decision boundary in the original space: ( ŷ = sign ŵ 0 + ) α i y i K(x i, x) α i >0 Need to compute K(x i, x) for all SVs x i.

Mercer s kernels Source: G. Shakhnarovich What kind of function K is a valid kernel, i.e. such that there exists a feature space Φ(x) in which K(x, z) = φ(x) T φ(z)? Theorem due to Mercer (1930s) K must be continuous; symmetric: K(x, z) = K(z, x); positive definite: for any x 1,..., x N, the kernel matrix K(x 1, x 1 ) K(x 1, x 2 ) K(x 1, x N ) K =................. K(x N, x 1 ) K(x N, x 2 ) K(x N, x N ) must be positive definite.

Some popular kernels Source: G. Shakhnarovich The linear kernel: K(x, z) = x T z. This leads to the original, linear SVM.

Some popular kernels Source: G. Shakhnarovich The linear kernel: K(x, z) = x T z. This leads to the original, linear SVM. The polynomial kernel: K(x, z; c, d) = (c + x T z) d. We can write the expansion explicitly, by concatenating powers up to d and multiplying by appropriate weights.

Example: SVM with polynomial kernel Source: G. Shakhnarovich

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ?

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z.

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization).

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ?

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ? Then K(x, z) 1 for all x, z.

Radial basis function kernel Source: G. Shakhnarovich K(x, z; σ) = exp ( 1σ ) 2 x z 2. The RBF kernel is a measure of similarity between two examples. The mapping φ(x) is infinite-dimensional! What is the role of parameter σ? Consider σ 0.Then K(x i, x; σ) 1 if x = z or 0 if x z. The SVM simply memorizes the training data (overfitting, lack of generalization). What about σ? Then K(x, z) 1 for all x, z. The SVM underfits.

SVM with RBF (Gaussian) kernels Source: G. Shakhnarovich Note: some SV here not close to the boundary

Making more Mercer kernels Let K 1 and K 2 be Mercer kernels. Then the following functions are also Mercer kernels: K(x, z) = K 1 (x, z) + K 2 (x, z) K(x, z) = ak 1 (x, z) (a is a positive scalar) K(x, z) = K 1 (x, z) K 2 (x, z) K(x, z) = x T Bz (B is a symmetric positive semi-definite matrix)

Making more Mercer kernels Let K 1 and K 2 be Mercer kernels. Then the following functions are also Mercer kernels: K(x, z) = K 1 (x, z) + K 2 (x, z) K(x, z) = ak 1 (x, z) (a is a positive scalar) K(x, z) = K 1 (x, z) K 2 (x, z) K(x, z) = x T Bz (B is a symmetric positive semi-definite matrix) Multiple kernel learning: learn kernel combinations as part of SVM optimization.

Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways.

Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value

Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example

Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class classifiers by combining two-class SVMs in various ways. One vs. others: Traning: learn an SVM for each class vs. the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One vs. one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example Error-correcting codes, decision trees/dags..

Support Vector Regression Source: G. Shakhnarovich The model: f(x) = w 0 + w T x

Support Vector Regression Source: G. Shakhnarovich The model: f(x) = w 0 + w T x Instead of the margin around the predicted decision boundary, we have ɛ-tube around the predicted function. y y(x) + ɛ 2 1.8 ɛ-insensitive loss: y(x) 1.6 1.4 y(x) ɛ 1.2 1 0.8 x 0.6 0.4 0.2 0 1 eps 0 eps 1

Support Vector Regression Optimization: introduce constraints and slack variables for going above or below the tube. 1 min w 0,w 2 w 2 + C (ξ i + ˆξ i ) subject to (w 0 + w T x i ) y i ɛ + ξ i, y i (w 0 + w T x i ) ɛ + ˆξ i, ξ i, ˆξ i 0, i = 1,..., n.

Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0.

Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0. We can let β i = ˆα i α i and simplify: max y i β i ɛ β i 1 β i β j x T i x j, β 2 subject to C β i C, i,j=1 β i = 0, i = 1,..., n.

Support Vector Regression: Dual problem max α (ˆα i α i )y i ɛ (ˆα i +α i ) 1 2 subject to 0 ˆα i, α i C, (ˆα i α i )(ˆα j α j )x T i x j, i,j=1 (ˆα i α i ) = 0, i = 1,..., n. Note that at the solution, we must have ξ i ˆξi = 0, α i ˆα i = 0. We can let β i = ˆα i α i and simplify: max y i β i ɛ β i 1 β i β j x T i x j, β 2 subject to C β i C, i,j=1 β i = 0, i = 1,..., n. Then f(x) = w0 + n β i xt i x j, where w0 is chosen so that f(x i ) y i = ɛ for any i with 0 < βi < C.

SVM: summary Source: G. Shakhnarovich Main ideas:

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on?

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters Number of support vectors

SVM: summary Source: G. Shakhnarovich Main ideas: Large margin classification The kernel trick What does the complexity/generalization ability of SVMs depend on? The constant C Choice of kernel and kernel parameters Number of support vectors A crucial component: good QP solver. Tons of off-the-shelf packages.

Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.)

Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training:

Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set

Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set In the worst case, can degenerate to nearest neighbor (every training point a support vector), but is much slower to train

Advantages and disadvantages of SVMs Advantages: One of the most successful ML techniques! Good generalization ability for small training sets Kernel trick is powerful and flexible Margin-based formalism can be extended to a large class of problems (regression, structured prediction, etc.) Disadvantages: Computational and storage complexity of training: quadratic in the size of the training set In the worst case, can degenerate to nearest neighbor (every training point a support vector), but is much slower to train No direct multi-class formulation