Machine Learning 4771

Similar documents
Support Vector Machine (SVM) and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 10: A brief introduction to Support Vector Machine

Perceptron Revisited: Linear Separators. Support Vector Machines

(Kernels +) Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Linear & nonlinear classifiers

Statistical Machine Learning from Data

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear & nonlinear classifiers

Support Vector Machines.

Kernel Methods and Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Lecture Notes on Support Vector Machine

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Review: Support vector machines. Machine learning techniques and image analysis

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines: Maximum Margin Classifiers

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines Explained

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Max Margin-Classifier

COMP 562: Introduction to Machine Learning

Brief Introduction to Machine Learning

ML (cont.): SUPPORT VECTOR MACHINES

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine

CS798: Selected topics in Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

SVMs, Duality and the Kernel Trick

CSC 411 Lecture 17: Support Vector Machine

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

Support Vector Machines

SVMs: nonlinearity through kernels

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines for Classification and Regression

Statistical Pattern Recognition

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines and Speaker Verification

Introduction to Support Vector Machine

Support Vector Machines

Support Vector Machines and Kernel Methods

Announcements - Homework

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine & Its Applications

Support Vector Machines and Kernel Methods

Applied Machine Learning Annalisa Marsico

Machine Learning. Support Vector Machines. Manfred Huber

L5 Support Vector Classification

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

SUPPORT VECTOR MACHINE

Support Vector Machines

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture Support Vector Machine (SVM) Classifiers

CS145: INTRODUCTION TO DATA MINING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification and Support Vector Machine

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Applied inductive learning - Lecture 7

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CIS 520: Machine Learning Oct 09, Kernel Methods

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Kernel methods, kernel SVM and ridge regression

Learning From Data Lecture 25 The Kernel Trick

LECTURE 7 Support vector machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Non-linear Support Vector Machines

Support Vector Machines

Support Vector Machines, Kernel SVM

Constrained Optimization and Support Vector Machines

7.1 Support Vector Machine

Statistical Methods for SVM

Transcription:

Machine Learning 477 Instructors: Adrian Weller and Ilia Vovsha

Lecture 3: Support Vector Machines Dual Forms Non Separable Data Support Vector Machines (Bishop 7., Burges Tutorial) Kernels

Dual Form DerivaPon Recall oppmal hyperplane problem in primal space: min w 2 w 2 s.t i : (w T x i + b) This is a convex program, define the Lagrangian and find staponary point: [ ] L(w,b,α) = 2 w 2 (w T x i + b), 0 Minimize L over {w,b}, maximize over {alphas}: L(w,b,α) w L(w,b,α) b = w x i = 0 w = x i = = 0 = 0 3.3

Dual Form This is a convex program, define the Lagrangian and find staponary point: [ ] L(w,b,α) = 2 w 2 (w T x i + b), 0 Minimize L over {w,b}, maximize over {alphas}: L(w,b,α) w w = x i, L(w,b,α) b = 0, 0 Plug back into the Lagrangian and get the dual form: max α D α ( ) = 2 i, j= s.t : = 0, 0 y j α j ( x i x ) j 3.4

Why Solve in Dual Space? QP runs in cubic polynomial Pme (in terms of # of variables) QP in primal space has complexity O(d 3 ), where d is the dimensionality of the input vectors (weight vector) QP in dual space has complexity O(N 3 ), where N is the number of examples More importantly: dual space yields deeper results min w 2 w 2 s.t i : (w T x i + b) max α max α D α ( ) = 2 i, j= s.t : = 0, 0 D α y j α j ( ) α* w* = * x i ( x i x ) j 3.5

Dual Form ProperPes. We obtain the solupon vector w* from the alphas. The vector is a weighted sum of the examples, if weight (apha) is zero, the example is not relevant 2. Both the hyperplane and the objecpve of the oppmizapon problem do not explicitly depend on the dimensionality of the vectors (only on the inner product) 3. The contribupon from each class is equal (regardless of class size) () w* = α * i x i (2a) (2b) f x ( ) = (w*) T x + b* = * (x i max α D α ( ) = x) + b * 2 i, j= (3) = 0 = = = y j α j ( x i x ) j 3.6

Support Vectors. The value of b* is chosen to maximize the margin. The oppmal values of {w*,b*} must sapsfy the KKT condipons 2. From (), we can conclude that nonzero alphas correspond to vectors that sapsfy the equality constraint and hence are the closest to the oppmal hyperplane. We call them support vectors (SVs) 3. The oppmal hyperplane is unique, but the expansion on the SVs is not 4. To compute the oppmal value of b: consider b for each SV and average the values to smooth out numerical errors () i : α * [ i ((w*) T x i + b*) ] = 0 (2) * > 0 ((w*) T x i + b*) = (4) i : > 0, (w T x i + b) = b ˆ i = w T x i b = avg ˆ { b } i 3.7

Sparse SoluPon If most alphas are zero, the solupon is sparse Sparsity is useful for several reasons:. Examples with zero alphas are non support vectors that can be ignored at test Pme (computaponally faster) 2. Only some of the data is relevant to the learning machine 3. Few SVs easier problem beger generalizapon 4. Given large amounts of data, can do oppmizapon more efficiently w* = α * i x i f x ( ) = (w*) T x + b* = * (x i x) + b * 3.8

Non Separable Sets What happens if the data is (linearly) non separable? There is no solupon, since not all constraints can be resolved, the corresponding alphas go to infinity Idea: instead of perfectly classifying each point, relax the problem by introducing non negapve slack variables xi s to allow mistakes (but minimize mistakes) Instead of hard margin, we get so. margin formula5on New set of constraints: i : (w T x i + b) ξ i, ξ i 0 3.9

Δ Margin SeparaPng Hyperplanes To construct the delta margin separapng hyperplane for linearly non separable case we consider the following problem: min F( ξ) = ξ i s.t i : (w T x i + b) ξ i, ξ i 0 w 2 Δ 2 Why this parpcular form? Recall: margin = Δ = f (x) = ± w = w h min r 2 Δ 2, N +, r = max i x i 3.0

SOCP Primal Form (sop margin) Note: we are doing SRM. EffecPvely we are construcpng a structure (each element includes all hyperplanes with margin of at least some value delta) We need to specify delta (or a range of deltas) and somehow pick the best one For each delta we would solve a second order cone program (SOCP) since the addiponal constraint on the norm of w is a second order cone constraint It might be simpler computaponally to consider an equivalent QP (actually its not clear if the SOCP is more expensive ) min F ξ ( ) = ξ i s.t i : (w T x i + b) ξ i, ξ i 0 w 2 Δ 2 3.

QP Primal Form (sop margin) Another approach: modify the primal form for the linearly separable case by adding slack variables min w,ξ 2 w 2 + C ξ i s.t i : (w T x i + b) ξ i, ξ i 0 The penalty parameter (C) penalizes each mistake and balances empirical error and capacity (regulariza5on parameter) We need to specify C (or a range of Cs) and somehow pick the best one. For each C we need to solve a QP 3.2

Sop Margin Dual Form To obtain the dual from the primal, we follow the same derivapon procedure we used in the separable (hard margin) case (define Lagrangian, find saddle point, plug back in): min w,ξ 2 w 2 + C ξ i s.t (w T x i + b) ξ i, ξ i 0 max α D α ( ) = 2 i, j= s.t : = 0, 0 C y j α j ( x i x ) j The dual is almost idenpcal to the one for the separable case, but now the alphas cannot grow beyond the upper bound C 3.3

Sop Margin ProperPes As we try to enforce a classificapon for a data point its Lagrange mulpplier alpha keeps growing. Clamping alpha to stop growing at C makes the machine give up on those nonseparable points (mistakes) Mechanical analogy: support vector forces & torques The oppmal values of {w*,b*} must sapsfy the KKT condipons. To compute the oppmal value of b: consider b for each SV whose alpha value < C (unbounded support vectors) and average the values to smooth out numerical errors 0 < * < C ξ i = 0 ((w*) T x i + b*) = { b } i b ˆ i = w T x i b = avg ˆ 3.4

QP vs. SOCP Why is the QP form (primal & dual) solved by default in pracpce? Obvious answer: computaponal reasons (but also a great example of herd mentality) From oppmizapon perspecpve: QPs are easier (faster) to solve than SOCPs However this is not necessarily the case if we use decomposi5on methods (break the problems into parts, take many small steps instead of few large ones) OpPmal solupons sets coincide (for each delta can find a C which yields the same H) Observe that delta is directly related to the VC dimension. That is (potenpally) delta is a more intuipve parameter to set min F ξ ( ) = ξ i s.t (w T x i + b) ξ i, ξ i 0 min w,ξ 2 w 2 + C ξ i s.t (w T x i + b) ξ i, ξ i 0 w 2 Δ 2 3.5

Support Vector Machines (idea) What happens if the problem is nonlinear? We can only use linear decision rules (problem setng) But using them in input space would give poor performance! Idea:. Map input vectors {x} into some high dimensional feature space (Z) through some nonlinear mapping (ϕ) chosen a priori 2. In the feature space, construct an oppmal (linear) hyperplane x() x(2).. x(n) ϕ(x) z() z(2)..... z(m) H 3.6

Support Vector Machines Note: we have seen this idea before when we discussed regression The nonlinear mapping is akin to using basis funcpons Vectors in feature space are called feature vectors x i Φ( x i ), x i x j To generalize the original problem, we replace all input vectors with feature vectors We obtain a nonlinear classifier in original space D α ( ) = 2 w* = * Φ x i f x ( ) i, j= ( ) + b * ( ) = α * i Φ( x ) i Φ( x) ( ) Φ x i ( ( ) Φ( x )) j y j α ( j Φ( x ) i Φ( x )) j 3.7

Kernels (idea) So far we assumed that the nonlinear mapping is explicit i.e. we had to specify how each element of the feature vector is obtained from the input vector Example: quadrapc polynomial Curse of dimensionality: what if our feature space has dimensionality of one billion? Example: polynomials Φ( x) = [ x 2 2, 2x x 2, x ] 2 Consider d dimensional data and p order polynomials Explicit mapping yields a huge feature space dim H Example: images of size 6x6 with p=4 have dim(h)=83million Observe that the algorithm depends on data only through dot products We can define a kernel func5on which considers the feature space implicitly ( ) = d + p p 3.8

Kernels The concept of kernel funcpons is old (950 s), powerful & general For any algorithm that depends on data only through dot (inner) products, we can replace each inner product with a general kernel funcpon The kernel funcpon defines a similarity measure according to which we compare each pair of examples An arbitrary funcpon can be used as a kernel if it sapsfies Mercer s condi5on D α ( ) = 2 w* = * Φ x i f x ( ) i, j= + b* ( ) = α * i K( x i,x) y j α j K( x i,x ) j x i Φ( x ) i ( x i x ) j K( x i,x ) j 3.9

Mercer s CondiPon Theorem (Mercer): A conpnuous symmetric funcpon K(u,v) has an expansion of the form () below, if and only if, condipon (2) is valid. We say that K(u,v) describes an inner product in some feature space () K u,v, k : a(k) > 0 ( ) = a(k)z u (k)z v (k) k= (2) K( u,v)g(u)g(v) du dv 0, g We assume that g is a reasonable (finite norm) funcpon defined on the same domain as K 3.20

Example: QuadraPc Polynomial To solve the oppmizapon problem we need to evaluate the kernel for each pair of examples in the data set recall objecpve contains a sum over all {i,j} Mercer s condipon requires that the Gram Matrix (matrix of all pairs) is posi5ve semi definite (PSD) Concrete example: Φ( x) = [ x 2 2, 2x x 2, x ] 2 K( x, x ) = Φ( x) Φ( x ) = x 2 x 2 2 + 2x x x 2 x 2 + x 22 x 2 = x x + x 2 x 2 ( ) 2 K x,x K = K x 2, x K x 3, x ( ) K( x, x ) 2 K( x, x ) 3 ( ) K( x 2,x ) 2 K( x 2,x ) 3 ( ) K( x 3,x ) 2 K( x 3,x ) 3 3.2

Typical Kernels Polynomial : RBF : K x i,x j K( x i,x ) j = ( x T i x j +) p ( ) = exp 2σ 2 x i x j 2 Polynomial RBF 3.22

SVMs as Neural Nets Two layer feed forward neural network: First layer selects the basis (similarity measure w.r.t to each SV) Second layer constructs a linear funcpon in the space selected by the first layer X() X(2) K( x,x) y α f x + b* ( ) = α * i K( x i,x) K( x 2, x) y 2 α 2 y = sign( +b) X(n) K( x NSV, x) y NSV α NSV 3.23

SVMs in PracPce Most real world problems are non trivial (non separable): we must use a sop margin formulapon Most real world problems are nonlinear (at least globally): we need to use kernels (radial basis funcpon is the most popular choice) Sop margin form with kernels two parameters to tune (C & kernel parameter) Parameter search problem: we must specify a grid of parameters, solve a QP for each and select the best pair. We open use cross validapon for this purpose Note: leave one out CV is very efficient if we have few support vectors! Large scale data requires special treatment (can t store the kernel matrix in memory) Free sopware implemenpng decomposipon methods: LIBSVM, SVM Light Feature selecpon is not obvious when using kernels (hyperplane defined implicitly): w* = * Φ x i ( ) 3.24