Support Vector Machines

Similar documents
Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Statistical Machine Learning from Data

Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines and Kernel Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Notes on Support Vector Machine

Support Vector Machine (continued)

Support Vector Machines

Convex Optimization and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

Support Vector Machines

Support Vector Machines, Kernel SVM

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines for Regression

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Soft-margin SVM can address linearly separable problems with outliers

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Non-linear Support Vector Machines

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

CS798: Selected topics in Machine Learning

Support vector machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Announcements - Homework

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Review: Support vector machines. Machine learning techniques and image analysis

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear & nonlinear classifiers

Incorporating detractors into SVM classification

Support Vector Machines and Speaker Verification

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

An introduction to Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture Support Vector Machine (SVM) Classifiers

CSC 411 Lecture 17: Support Vector Machine

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

L5 Support Vector Classification

Modelli Lineari (Generalizzati) e SVM

Lecture 10: A brief introduction to Support Vector Machine

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Constrained Optimization and Support Vector Machines

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines Explained

Machine Learning A Geometric Approach

LECTURE 7 Support vector machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Pattern Recognition 2018 Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods. Machine Learning A W VO

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

COMP 875 Announcements

Convex Optimization & Lagrange Duality

Statistical Pattern Recognition

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

(Kernels +) Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 2: Linear SVM in the Dual

SVM optimization and Kernel methods

Lecture 16: Modern Classification (I) - Separating Hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Max Margin-Classifier

Support Vector Machines for Classification: A Statistical Portrait

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

ML (cont.): SUPPORT VECTOR MACHINES

Statistical Methods for NLP

Learning with kernels and SVM

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machine for Classification and Regression

Introduction to SVM and RVM

Pattern Classification, and Quadratic Problems

10701 Recitation 5 Duality and SVM. Ahmed Hefny

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machine

Support Vector Machine (SVM)

Transcription:

Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32

Margin Classifiers margin <w,x> b = 0 Sridhar Mahadevan: CMPSCI 689 p. 2/32

Optimal Margin Classification Consider the problem of finding a set of weights w that produces a hyperplane with the maximum geometric margin. max γ,w,b γ such that y i ( w, x i b) γ,i =1,...,m w =1 We eliminate the non-convex constraint w =1as follows: 1 min w 2 w 2 such that y i ( w, x i b) 1,i=1,...,m Sridhar Mahadevan: CMPSCI 689 p. 3/32

Lagrange Dual Formulation The primal optimal margin classification problem can be formulated as min w f(w) such that g i (w) 0,i=1,...,k and h i (w) =0,i= The dual problem can be formulated using Lagrange multipliers as max α,β:α 0 (L D (α, β)): L D (α, β) =min w ( f(w) k α i g i (w) i=1 ) l β i h i (w) i=1 Sridhar Mahadevan: CMPSCI 689 p. 4/32

Lagrange Dual Formulation Weak Duality Theorem: The dual formulation always produces a solution that is upper bounded by the solution to the primal problem. Strong Duality Theorem: The solution to the Lagrange dual is exactly the same as the primal solution, assuming that the function f(w) and the constraints g i (w) are convex, and h i (w) is an affine set (meaning h i (w) = w, x i b i ). Sridhar Mahadevan: CMPSCI 689 p. 5/32

Weak Duality Theorem Suppose w is a feasible solution to the primal problem, and that α and β constitute a solution to the dual problem. L D (α, β) = minl(u, α, β) u L(w, α, β) = f(w) i α i g i (w) i β i h i (w) f(w) This implies the following condition: max L D(α, β) min{f(w) :g i (w) 0,h i (w) =0} α,β:α 0 w Sridhar Mahadevan: CMPSCI 689 p. 6/32

Sparsity of Parameters Corollary: Let w be a weight vector that satisfies the primal constraints and α,β be the Lagrangian variables that satisfies both the dual constraints. f(w )=L D (α,β ) where αi 0 and g i (w ) 0,h i (w )=0 Then, αi g i (w )=0for i =1,...,k. The proof follows easily by noting that the inequality f(w ) i α i g i (w ) i β i h i (w ) f(w ) becomes an equality only when α i g i (w )=0for i =1,...,k. Sridhar Mahadevan: CMPSCI 689 p. 7/32

Saddle Point Function Saddle Point Function x**2 - y**2 100 80 60 40 20 0-20 -40-60 -80-100 10 5-10 -5 0 5 10-10 -5 0 Sridhar Mahadevan: CMPSCI 689 p. 8/32

Duality Gap and Saddle Points Define a saddle point as the triple w,α,β, where w Ω,α 0, and L(w,α,β) L(w,α,β ) L(w, α,β ) Theorem: The triple w,α,β is a saddle point if and only if w is a solution to the primal problem, and α,β is a solution to the dual problem, and there is no duality gap, so f(w )=L D (α,β ). Strong Duality Theorem: If f(w) is convex, and w Ω, where Ω is a convex set, and g i,h i are affine functions, the duality gap is 0. Sridhar Mahadevan: CMPSCI 689 p. 9/32

Karush Kuhn Tucker Conditions Assume f(w) and constraints g i (w) are convex, and h i (w) is an affine set (i.e, h i (w) = a i,w b i ). Let there be at least one w such that g i (w) < 0 for all i. Then, the KKT conditions assure duality gap is 0. L(w,α,β )=0, i =1,...,n w i (1) L(w,α,β )=0, i =1,...,l β i (2) α i g i (w )=0, i =1,...,k (3) g i (w ) 0,i=1,...,k (4) α i 0, i =1,...,k (5) Sridhar Mahadevan: CMPSCI 689 p. 10/32

Support Vectors We can formulate the classification problem as: 1 min w 2 w 2 such that g i (w) =y i ( w, x i b)1 0, i =1,...,m KKT implies instances for which α i > 0 are those which have functional margins exactly =1(because then g i (w) =0). The functional margin is the smallest of all the margins, which implies that we will only have nonzero α i for the points closest to the decision boundary! These are called the support vectors. Sridhar Mahadevan: CMPSCI 689 p. 11/32

Dual Form We can write the Lagrangian for our optimal margin classifier as L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i b) 1) To solve the dual form, we first minimize with respect to w and b, and then maximize w.r.t. α w L(w, b, a) =w m i=1 α i y i x i =0 w = m i=1 α i y i x i b L(w, b, a) = m i=1 α i y i =0 Sridhar Mahadevan: CMPSCI 689 p. 12/32

Support Vectors We can simplify the Lagrangian into the following form: ( m ) max α i 1 m y i y j α i α j x i,x j α 2 i i,j=1 s.t. α i 0 and i α i y i =0 Sridhar Mahadevan: CMPSCI 689 p. 13/32

Support Vectors Given the maximizing α i, we use the equation w = m i=1 α i y i x i to find the maximizing w. A new instance x is classified using a weighted sum of inner products (over only support vectors!) w,x b = m α i y i x i,x b = α i y i x i,x b i=1 i SV The intercept term b can be found from the primal constraints b = max y i =1 ( w,x i )min yi =1 ( w,x i ) 2 Sridhar Mahadevan: CMPSCI 689 p. 14/32

Geometric Margin Theorem: Consider a linearly separable set of instances (x 1,y 1 ),...,(x m,y m ), and suppose α,b is a solution to the dual optimization problem. Then, the geometric margin can be expressed as γ = 1 w = 1 i SV α i Sridhar Mahadevan: CMPSCI 689 p. 15/32

Geometric Margin Proof: Due to the KKT conditions, it follows that for all support vectors j SV y j f(x j,α,b )=y j ( i SV y i α i x i,x j b ) =1 w 2 = i α i y i x i, j α jy j x j = α jy j α i y i x i,x j = α j (1 y j b )= j SV αj i SV j SV j SV Sridhar Mahadevan: CMPSCI 689 p. 16/32

Dealing with Nonseparable Data nonseparable data high variance, low bias high bias, low variance nonseparable data nonseparable data Sridhar Mahadevan: CMPSCI 689 p. 17/32

Soft Margin Classifiers We reformulate the concept of margin to allow misclassifications: The slack variable ξ i represents the extent to which a margin constraint is violated y i (<w i,x i > b) 1 ξ i where ξ i 0, i =1,...,l Sridhar Mahadevan: CMPSCI 689 p. 18/32

Soft Margin Classifiers Similar to ridge regresson, define a variable λ which represents the extent to which we want to tolerate errors. A soft-margin classifier solves the following constrained optimization problem Minimize 1 l 2 w 2 C i=1 subject to y i (<w i,x i > b) 1 ξ i, i =1,...l ξ 2 i where ξ i 0, i =1,...l Sridhar Mahadevan: CMPSCI 689 p. 19/32

Sequential Minimal Optimization SMO uses coordinate ascent. To maximize F (α 1,...,α n ), pick some α i and optimize it while holding all other parameters fixed. ( m ) max α i 1 m y i y j α i α j <x i,x j > α 2 i i,j=1 s.t. C α i 0 and i α i y i =0 Since α 1 = y m 1 i=2 α iy i, we cannot pick one constraint alone, but can pick any two. Sridhar Mahadevan: CMPSCI 689 p. 20/32

SMO If we pick y 1 and y 2, we know that This implies that y 1 α 1 y 2 α 2 = m i=3 y i α i = ς α 1 = y 1 (ς y 2 α 2 ) This equation defines a line, where α 1 and α 2 must be on the line to be a feasible solution. The objective function can be reformulated as a quadratic function of α 2, and solved analytically to get values of α 2 and α 1. Sridhar Mahadevan: CMPSCI 689 p. 21/32

SMO C H y y L C Sridhar Mahadevan: CMPSCI 689 p. 22/32

ɛ-insensitive loss L y <w, x> b 2ε L y <w, x> b 2ε Sridhar Mahadevan: CMPSCI 689 p. 23/32

SVM Regression We introduce two slack variables ξ i and ˆξ i which represent the penalty for exceeding or being below the target value by more than ɛ. The primal problem can be formulated as l Minimize w 2 λ (ξi 2 ˆξ i 2 ) subject to (< w i,x i > b) y i ɛ ξ i, i =1,...l i=1 and y i (< w i,x i > b) ɛ ˆξ i, i =1,...l where ξ i, ˆξ i 0, i =1,...l Sridhar Mahadevan: CMPSCI 689 p. 24/32

Mercer s Theorem Theorem: Given a function K : R n R n R, K constitutes a kernel if for any finite set of instances x i, 1 i n, the corresponding kernel (or Gram) matrix is symmetric and positive semi-definite. The Gram matrix of k(x, z) is the matrix K =(k(x i,x j )) n i,j=1 Sridhar Mahadevan: CMPSCI 689 p. 25/32

Mercer s Theorem Let us restrict our attention to kernels whose Gram matrices are positive semi-definite, i.e. the eigenvalues are non-negative. Then, we know that K = λ 1 v 1 v T 1... λ n v n v T n = n i=1 λ i v i v T i Consider the nonlinear mapping φ : x i ( λ t v ti ) n t=1 Then, we can see that φ(x i ),φ(x j ) = n t=1 λ t v ti v tj = K(x i,x j ) Sridhar Mahadevan: CMPSCI 689 p. 26/32

Making New Kernels from Old Let K 1 and K 2 be two kernels defined over the same input space R n R n R. Question: Is K(x, y) =K 1 (x, y)k 2 (x, y) also a kernel? Solution: Since K 1 and K 2 are kernels, from Mercer s theorem it follows that for all vectors α, wehave α T K 1 α 0 α T K 2 α 0 Thus, it follows that α T Kα = α T (K 1 K 2 )α 0 This makes K into a kernel as well. Sridhar Mahadevan: CMPSCI 689 p. 27/32

Convolution Kernels Consider an object x = x 1,...,x d, where each part x i X i. We can define the part-of relation R(x 1,...,x d,x) which holds if and only if x 1,...,x d are indeed the parts of x. Of course, there may be more than one way to decompose x into its parts (e.g., think of subsequences of strings, or subtrees of a tree etc.). Let R 1 (x) ={(x 1,...,x d ) R(x 1,...,x d,x)} Sridhar Mahadevan: CMPSCI 689 p. 28/32

Convolution Kernels The convolution kernel k(x, y) is defined as k(x, y) = d k i (x i,y i ) R 1 (x),r 1 (y) i=1 where k i (x i,y i ) is a kernel on the i th component. Watkins (1999) defined string kernels, which can be seen as an instance of a convolution kernel. Sridhar Mahadevan: CMPSCI 689 p. 29/32

String Kernels Consider the set of all subsequences of a word of length n, e.g., the subsequences of bat are ba, at, and b-t. The length of a subsequence u is defined as i f i l 1 if the subsequence begins at position i f in a string s, and ends at position i l. Consider the mapping φ :Σ R Σn, where Σ is an alphabet, Σ is the set of all strings, and Σ n is the set of all strings of length n. Sridhar Mahadevan: CMPSCI 689 p. 30/32

String Kernels Given any subsequence u Σ n,define φ u (s) = i:u=s[i] λl(i) where i is the index vector representing the positions at which the subsequence u occurs in string s, and λ (0, 1). The string kernel is defined as K n (s, t) = s Σ n i:s[i]=u j:t[j]=u λl(i)l(j) Clearly, K n (s, t) = φ(s),φ(t), and so this is a valid kernel. Sridhar Mahadevan: CMPSCI 689 p. 31/32

Fisher Kernels Let P (X θ) be any generative model (e.g, a hidden-markov model). Consider the Fisher score equations U X = l(x θ),in θ other words the gradient of the log-likelihood of a particular input X. Define the information matrix I = E(U X UX T ) where the expectation is over P (X θ). The Fisher kernel is K(x, y) =U T X I1 U Y The Fisher kernel can be asymptotically approximated as K(x, y) UX T U Y Sridhar Mahadevan: CMPSCI 689 p. 32/32