Statistical Machine Learning from Data

Similar documents
Introduction to Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine

Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernel Methods and Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Pattern Recognition 2018 Support Vector Machines

Max Margin-Classifier

Lecture Notes on Support Vector Machine

CS798: Selected topics in Machine Learning

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

ICS-E4030 Kernel Methods in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Sequential Minimal Optimization (SMO)

Support Vector Machine

Statistical Machine Learning from Data

Support Vector Machines

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Review: Support vector machines. Machine learning techniques and image analysis

Jeff Howbert Introduction to Machine Learning Winter

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

(Kernels +) Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Support Vector Machines and Kernel Methods

SMO Algorithms for Support Vector Machines without Bias Term

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CS-E4830 Kernel Methods in Machine Learning

An introduction to Support Vector Machines

Statistical Machine Learning from Data

Lecture 10: Support Vector Machine and Large Margin Classifier

Support vector machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Support Vector Machine for Classification and Regression

Statistical Machine Learning from Data

Polyhedral Computation. Linear Classifiers & the SVM

Links between Perceptrons, MLPs and SVMs

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines for Classification and Regression

Lecture Support Vector Machine (SVM) Classifiers

LECTURE 7 Support vector machines

Support Vector Machines

L5 Support Vector Classification

Learning with kernels and SVM

Support Vector Machines

Constrained Optimization and Support Vector Machines

Support Vector Machines

Support Vector Machines.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

SVMs, Duality and the Kernel Trick

Learning From Data Lecture 25 The Kernel Trick

LMS Algorithm Summary

Incorporating detractors into SVM classification

Support Vector Machines

Support Vector Machines for Regression

Statistical Methods for NLP

SVM optimization and Kernel methods

Classifier Complexity and Support Vector Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Non-linear Support Vector Machines

Support Vector Machines and Speaker Verification

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machines for Classification: A Statistical Portrait

Introduction to SVM and RVM

Soft-margin SVM can address linearly separable problems with outliers

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

SVMs: nonlinearity through kernels

Machine Learning A Geometric Approach

Support Vector Machines. Maximizing the Margin

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machines, Kernel SVM

Convex Optimization and Support Vector Machine

Transcription:

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio January 18, 2006

Samy Bengio Statistical Machine Learning from Data 2 1 Linear Support Vector Machines 2 3 4

Samy Bengio Statistical Machine Learning from Data 3 Linear Support Vector Machines 1 Linear Support Vector Machines 2 3 4

Setup Linear Support Vector Machines Training set: (x i, y i )...n R d { 1, 1} We would like to find an hyperplane wx + b = 0 (w R d, b R) which separates the two classes. Samy Bengio Statistical Machine Learning from Data 4

The Margin Let d + be the shortest distance from the hyperplane to the closest positive example. Let d be the shortest distance from the hyperplane to the closest negative example. Define the margin of the hyperplane to be d + + d. The simplest SVM looks for the separating hyperplane with the largest margin. Samy Bengio Statistical Machine Learning from Data 5

Samy Bengio Statistical Machine Learning from Data 6 Linear Support Vector Machines SVMs and the Margin (Graphical View) wx + b = 1 wx + b = 0 wx + b = 1 Margin

Samy Bengio Statistical Machine Learning from Data 7 Linear Support Vector Machines Why is it Good to Maximize the Margin? There are several justifications to favor large margins... for instance: If training and test data come from the same distribution and all test data are within some distance from the training points... Then a margin (2 ) is enough to classify all points:

Samy Bengio Statistical Machine Learning from Data 8 Linear Support Vector Machines Why is it Good to Maximize the Margin? If all points lie at a distance of at least from the separator, and all points are in a bounded sphere, then a small perturbation of the definition of the separator will not hurt. Hence one can use less bits to encode the separating hyperplane. This is related to the Minimum Description Length principle: The best description of the data, in terms of generalization error, should be the one that requires the fewest bits to store.

Samy Bengio Statistical Machine Learning from Data 9 Linear Support Vector Machines Formulation of the SVM Problem We can define the following constraints: wx i + b +1 for y i = +1 wx i + b 1 for y i = 1 They can be combined as follows: y i (wx i + b) 1 0 One can show that d + = d = 1 w with w the Euclidean 2 norm of w. Hence, the margin is simply w. So we would like to minimize: w 2 Under the constraints: 2 y i (wx i + b) 1 0 i i

Samy Bengio Statistical Machine Learning from Data 10 Linear Support Vector Machines A Constrained Optimization Problem Normal way to solve an optimization problem with cost C(w) and parameter w: set C w = 0. Example: minimize C(w) = w 2 2 3w hence C w = w 3 = 0 = w = 3 When there are constraints c i 0, use Lagrange multipliers and verify the solution with the Karush-Kuhn-Tucker (KKT) conditions.

A Constrained Optimization Problem (con t) Form the Lagrangian by subtracting one term for each constraint c i 0, weighted by a positive Lagrange multiplier: L(w, α) = C(w) i α i c i We must now minimize L with respect to w subject to L α i = 0 α i 0 i We can equivalently solve the dual problem: maximize L with respect to α subject to L w = 0 α i 0 i The general problem is to find a saddle point: max α min L(w, α) w Samy Bengio Statistical Machine Learning from Data 11

Lagrangian Formulation for SVMs We introduce a Lagrange multiplier α i, i = 1,..., n, one for each inequality constraint: L(w, b, α) = w 2 2 α i (y i (wx i + b) 1) L has to be minimized w.r.t. the primal variables w and b and maximized w.r.t. the dual variables α i. At the extremum, we have L w = 0 and L b = 0 Samy Bengio Statistical Machine Learning from Data 12

Solve the Lagrangian We have: L = w 2 2 α i (y i (wx i + b) 1) We want L w = 0: L w = w α i y i x i = 0 We want L b = 0: w = L b = α i y i x i α i y i = 0 Samy Bengio Statistical Machine Learning from Data 13

Substituting to get the Dual L = w 2 2 = L = = 1 2 = 1 2 α i (y i (wx i + b) 1) n,j=1 α iα j y i y j x i x j 2 α i α j y i y j x i x j i,j=1 α i α j y i y j x i x j + i,j=1 α i 1 2 α i y i α j y j x j x i + b 1 j=1 α i α j y i y j x j x i b i,j=1 α i α j y i y j x i x j i,j=1 α i α i y i + Samy Bengio Statistical Machine Learning from Data 14 α i

Samy Bengio Statistical Machine Learning from Data 15 Linear Support Vector Machines The Dual Formulation We need to maximize the following: L = α i 1 2 α i α j y i y j x i x j i,j=1 subject to α i 0, i α i y i = 0 This can be solved using classical quadratic programming optimization packages, based for instance on constrained gradient descent.

Samy Bengio Statistical Machine Learning from Data 16 Linear Support Vector Machines The KKT Conditions The following KKT conditions are satisfied at the solution. L w = w α i y i x i = 0 L b = α i y i = 0 y i (wx i + b) 1 0, i α i 0, i α i (y i (wx i + b) 1) = 0, i This can be used to estimate b after w has been found during training.

Samy Bengio Statistical Machine Learning from Data 17 A Bug Linear Support Vector Machines This minimization problem does not have any solution if the two classes are not separable. ξ i x i

Samy Bengio Statistical Machine Learning from Data 18 Linear Support Vector Machines Fixing The Bug: Soft Margin Relax the constraints: use a soft margin instead of a hard margin. We would like to minimize: Under the constraints: w 2 2 + C ξ i y i (wx i + b) 1 ξ i i ξ i 0 i

Samy Bengio Statistical Machine Learning from Data 19 Linear Support Vector Machines The Non-Separable Dual Formulation We need to maximize the following: L = α i 1 α i α j y i y j x i x j 2 i,j=1 subject to 0 α i C, i,..., n α i y i = 0 We then obtain w and b as follows: w = i α i y i x i α i [1 ξ i y i (wx i + b)] = 0

Support Vector Note that the decision function can be rewritten as: ( ) ŷ = sign α i y i x i x + b Training examples x i with α i 0 are support vectors. i 0 α i C α i = 0 α i = C Samy Bengio Statistical Machine Learning from Data 20

Samy Bengio Statistical Machine Learning from Data 21 Linear Support Vector Machines Non-Linear SVMs Kernels Final Solution 1 Linear Support Vector Machines 2 3 4

y Samy Bengio Statistical Machine Learning from Data 22 Linear Support Vector Machines Non-Linear SVMs Non-Linear SVMs Kernels Final Solution Project the data into a higher dimensional space: it should be easier to separate the two classes. Given a function φ : R d F, work with φ(x i ) instead of working with x i. 1 0.8 4 0.6 3 0.4 0.2 2 1 0 0 0.2 1 0.4 0.6 2 0.8 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 x 3 1.5 1 0.5 0 0.5 1 1 0.5 0 0.5 1

Samy Bengio Statistical Machine Learning from Data 23 Linear Support Vector Machines The Kernel Trick Non-Linear SVMs Kernels Final Solution Note that we have only dot products φ(x i )φ(x j ) to compute. Unfortunately, this could be very expensive in a high dimensional space. Use instead a kernel: a function k(x, z) which represents a dot product in a hidden feature space. Example: instead of use k(x, z) = φ(x)φ(z) φ(x) = x 2 1 2x1 x 2 x 2 2 k(x, z) = (xz) 2

Common Kernels Non-Linear SVMs Kernels Final Solution Polynomial: k(x, z) = (u xz + v) p (u R, v R, p N +) Gaussian: ) x z 2 k(x, z) = exp ( 2σ 2 (σ R +) The function k(x, z) = tanh(u xz + v) is not a kernel! Samy Bengio Statistical Machine Learning from Data 24

Mercer s Condition Non-Linear SVMs Kernels Final Solution Which functions are kernels??? There exists a mapping φ and an expansion k(x, z) = i φ(x) i φ(z) i if and only if, for any g(x) such that g(x) 2 dx is finite then k(x, z)g(x)g(z)dxdz 0 In practice, a kernel gives rise to a positive semi-definite matrix (example a symmetric similarity matrix). Samy Bengio Statistical Machine Learning from Data 25

Final Solution Non-Linear SVMs Kernels Final Solution Maximize α i 1 2 α i α j y i y j k(x i, x j ) j=1 under the constraints 0 α i C and α i y i = 0 i For 0 < α i < C, compute b using 1 y i j α j y j k(x j, x i ) + b = 0 ( ) Decision function: ŷ = sign α i y i k(x i, x) + b i Samy Bengio Statistical Machine Learning from Data 26

The KKT Conditions Non-Linear SVMs Kernels Final Solution The following KKT conditions are satisfied at the solution. y i (wx i + b) 1 0, i 0 < α i < C, i s.t. y i (wx i + b) = 1 α i = C, i s.t. y i (wx i + b) 1 And note that α i = 0 for all non-support vectors. Samy Bengio Statistical Machine Learning from Data 27

Facts to Remember Non-Linear SVMs Kernels Final Solution SVMs maximize the margin (in the feature space) Use the soft margin trick Project the data into a higher dimensional space for non-linear relations Kernels simplify the computation A Lagrangian method leads to a nice quadratic optimization problem under constraints. Samy Bengio Statistical Machine Learning from Data 28

SVMs in Practice Non-Linear SVMs Kernels Final Solution In order to tune the capacity, the kernel is the most important parameter to choose. Polynomial kernel: increasing the degree will increase the capacity. Gaussian kernel: increasing σ will decrease the capacity. Tune C, the trade-off between the margin and the errors. For non-noisy data sets, C usually has not much influence. Carefully choose C for noisy data sets: small values usually give better results. Samy Bengio Statistical Machine Learning from Data 29

Samy Bengio Statistical Machine Learning from Data 30 Linear Support Vector Machines Complexity SMO 1 Linear Support Vector Machines 2 3 4

Complexity of the QP Problem Complexity SMO You need n 2 in memory just to keep the kernel matrix! Naive optimization technique would then be at least n 3 to n 4. What about datasets of 100000 examples or more??? Various approaches have been proposed: Chunking: at each step, solve the QP problem with all non-zero α i from previous step, and the M worst examples violating the KKT conditions. Decomposition: Solve a series of smaller QP problems, where each one adds an example that violates the KKT conditions. Sequential Minimal Optimization (SMO): solve the smallest optimization problem at each iteration. Samy Bengio Statistical Machine Learning from Data 31

SMO Framework Complexity SMO At every step, choose two Lagrange multipliers α i to jointly optimize, with at least one violating the KKT conditions. there are several tricks to select the most violating ones... Find the optimal value for these two α i and update the SVM model. α 2 = C α 2 = C α 2 = 0 α 2 = 0 α 1 = 0 α 1 = C α 1 = 0 α 1 = C y 1 y 2 α 1 α 2 = k y 1 = y 2 α 1 + α 2 = k This procedure converges to the optimum. Samy Bengio Statistical Machine Learning from Data 32

Samy Bengio Statistical Machine Learning from Data 33 1 Linear Support Vector Machines 2 3 4

Samy Bengio Statistical Machine Learning from Data 34 A Zoo of Kernel Methods in the Literature: Control explicitely the number of SVs: ν-svms For regression problems: Support Vector Regression For density estimation or representation: Kernel PCA For generative models: Fisher kernel For discrete sequences: String kernel... How to design a kernel? Prior knowledge!!! choosing a similarity measure between 2 examples in the data choosing a linear representation of the data choosing a feature space for learning