Support Vector Machines. Goals for the lecture

Size: px

Start display at page:

Download "Support Vector Machines. Goals for the lecture"

Mitchell Oliver Gallagher
5 years ago
Views:

1 Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring Soe of the slides in these lectures have been adapted/borrowed fro aterials developed by To Dietterich, Pedro Doingos, To Mitchell, David Page, and Jude Shavlik Goals for the lecture you should understand the following concepts the argin slack variables the linear support vector achine hinge loss nonlinear SVMs the kernel trick the prial and dual forulations of SVM learning support vectors the kernel atrix valid kernels polynoial kernel Gausian kernel string kernels support vector regression 1

2 Burr Settles, UW CS PhD Four key SVM ideas Maxiize the argin don t choose just any separating hyperplane Penalize isclassified exaples use soft constraints and slack variables Use optiization ethods to find odel linear prograing quadratic prograing Use kernels to represent nonlinear functions and handle coplex instances (sequences, trees, graphs, etc.) 2

Soe key vector concepts the dot product between two vectors w and x is defined as: w x = w T x = for exaple 1 3 5 4 2 1 i w i x i = (1)(4) + (3)( 2) + ( 5)( 1) = 3 the 2-nor (Euclidean length) of a

3 Soe key vector concepts the dot product between two vectors w and x is defined as: w x = w T x = for exaple i w i x i = (1)(4) + (3)( 2) + ( 5)( 1) = 3 the 2-nor (Euclidean length) of a vector x is defined as: x 2 = x i 2 i Linear separator learning revisited suppose we encode our classes as {-1, +1} and consider a linear classifier n 1 if w h(x) = i x i + b > 0 1 otherwise an instance x, y will be classified correctly if y(w T x + b) > 0 x 2 x 1 3

4 Large argin classification Given a training set that is linearly separable, there are infinitely any hyperplanes that could separate the positive/negative instances. Which one should we choose? In SVM learning, we find the hyperplane that axiizes the argin. Figure fro Ben-Hur & Weston, Methods in Molecular Biology 2010 The argin suppose we learn a hyperplane h given a training set D let x + denote the closest instance to the hyperplane aong positive instances, and siilarly for x - and negative instances since w # x + b = 0 and c w # x + b = 0 define the sae hyperplane, we have the freedo to choose the noralization of w choose noralization such that w # x * + b = 1 and w # x, + b = 1 for positive and negative support vectors respectively then the argin is given by argin 4 h = 6 7 w w 8 9 x * x, = w: x ;,x < w 8 = 6 w 8 x + x - 4

The hard-argin SVM given a training set D = { x (1), y (1),, x (), y () } we can frae the goal of axiizing the argin as a constrained optiization proble iniize w,b 1 2 w 2 2 adjust these paraeters to

5 The hard-argin SVM given a training set D = { x (1), y (1),, x (), y () } we can frae the goal of axiizing the argin as a constrained optiization proble iniize w,b 1 2 w 2 2 adjust these paraeters to iniize this subject to constraints : y (i) (w T x (i) + b) 1 for i = 1,, correctly classify x (i) with roo to spare and use standard algoriths to find an optial solution to this proble The soft-argin SVM [Cortes & Vapnik, Machine Learning 1995] if the training instances are not linearly separable, the previous forulation will fail we can adjust our approach by using slack variables (denoted by ξ) to tolerate errors iniize 1 w,b,ξ (1) ξ () 2 w 2 + C ξ (i) 2 subject to constraints : y (i) (w T x (i) + b) 1 ξ (i) for i = 1,, ξ (i) 0 C deterines the relative iportance of axiizing argin vs. iniizing slack 5

talked about iniizing squared loss and cross-entropy loss SVMs iniize

6 The effect of C in a soft-argin SVM Figure fro Ben-Hur & Weston, Methods in Molecular Biology 2010 Hinge loss when we covered neural nets, we talked about iniizing squared loss and cross-entropy loss SVMs iniize hinge loss loss (error) when y = 1 squared loss 0/1 loss odel output h x hinge loss 6

7 Nonlinear classifiers What if a linear separator is not an appropriate decision boundary for a given task? For any data set, there exists a apping ϕ to a higherdiensional space such that the data is linearly separable φ(x) = ( φ 1 (x), φ 2 (x),, φ k (x)) Exaple: apping to quadratic space x = x 1, x 2 suppose x is represented by 2 features ( ) φ(x) = x 1 2, 2x 1 x 2, x 2 2, 2x 1, 2x 2, 1 now try to find a linear separator in this space Nonlinear classifiers for the linear case, our discriinant function was given by h(x) = 1 if w x + b > 0 1 otherwise for the nonlinear case, it can be expressed as h(x) = 1 if ŵ φ(x) + b > 0 1 otherwise where ŵ is a higher diensional vector 7

SVMs with polynoial kernels Figure fro Ben-Hur & Weston, Methods in Molecular Biology 2010 The kernel trick explicitly coputing this nonlinear apping does not scale well a dot product between two

8 SVMs with polynoial kernels Figure fro Ben-Hur & Weston, Methods in Molecular Biology 2010 The kernel trick explicitly coputing this nonlinear apping does not scale well a dot product between two higher-diensional appings can soeties be ipleented by a kernel function exaple: quadratic kernel k(x, z) = ( x z +1) 2 = ( x 1 z 1 + x 2 z 2 +1) 2 = x 2 1 z x 1 x 2 z 1 z 2 + x 2 2 z x 1 z 1 + 2x 2 z 2 +1 = ( x 2 1, 2x 1 x 2, x 2 2, 2x 1, 2x 2, 1)i ( z 2 1, 2z 1 z 2, z 2 2, 2z 1, 2z 2, 1) = φ(x) φ(z) 8

9 The kernel trick thus we can use a kernel to copute the dot product without explicitly apping the instances to a higher-diensional space k(x, z) = ( x z +1) 2 = φ(x) φ(z) But why is the kernel trick helpful? Using the kernel trick given a training set D = { x (1), y (1),, x (), y () } suppose the weight vector can be represented as a linear cobination of the training instances w = α i x (i) then we can represent a linear SVM as α i x (i) i x + b and a nonlinear SVM as α i φ x (i) ( )iφ x = α i k x (i),x ( ) + b ( ) + b 9

10 Can we represent a weight vector as a linear cobination of training instances? consider perceptron learning, where each weight can be represented as (i) w j = α i x j proof: each weight update has the for w j (t) = w j (t 1) + ηδ t (i) y (i) x j (i) w j = ηδ (i) t y (i) (i) x j t w j = ηδ (i) t y (i) (i) x j (i) w j = α i x j t δ (i) t = 1 if x (i) isclassified in update t 0 otherwise Note: this is for the case in which y {-1, 1} The prial and dual forulations of the hard-argin SVM prial iniize w,b 1 2 w 2 2 subject to constraints : y (i) (w T x (i) + b) 1 for i = 1,, dual axiize α 1,,α α i 1 2 j=1 k=1 ( ) α j α k y ( j ) y (k ) x ( j ) i x (k ) subject to constraints : α i 0 for i = 1,, α i y (i) = 0 10

The dual forulation with a kernel (hard argin version) prial iniize 1 w w,b 2 2 2 subject to

y(k ) k ( x ( j ), x (k ) ) 2 j=1 k=1 subject to constraints : α i 0 α y i (i ) for i = 1,, =0

training instances those instances having αi > 0 are called support vectors they lie on the

11 The dual forulation with a kernel (hard argin version) prial iniize 1 w w,b subject to constraints : y(i ) (w Tφ ( x (i ) ) + b) 1 for i = 1,, dual axiize α 1,, α α i 1 α jα k y( j ) y(k ) k ( x ( j ), x (k ) ) 2 j=1 k=1 subject to constraints : α i 0 α y i (i ) for i = 1,, =0 Support vectors the solution to the dual forulation is a sparse linear cobination of the training instances those instances having αi > 0 are called support vectors they lie on the argin boundary the solution wouldn t change if all the instances with αi = 0 were deleted support vectors 11

12 The kernel atrix the kernel atrix (a.k.a. Gra atrix) represents pairwise siilarities for instances in the training set k(x (1),x (1) ) k(x (1),x (2) )! k(x (1),x () ) k(x (2),x (1) ) " # k(x (),x (1) ) k(x (),x () ) it represents the inforation about the training set that is provided as input to the optiization process Soe coon kernels polynoial of degree d k(x, z) = ( x z) d polynoial of degree up to d k(x, z) = ( x z +1) d radial basis function (RBF) (a.k.a. Gaussian) ( ) k(x, z) = exp γ x z 2 12

13 The RBF kernel the feature apping ϕ for the RBF kernel is infinite diensional! recall that k(x, z) = ϕ(x) ϕ (z) k(x, z) = exp 1 2 x z 2 for γ = 1 2 = exp 1 2 x 2 = exp 1 2 x 2 exp 1 2 z 2 exp x z exp 1 2 z 2 n=0 ( ) ( x z) n n! fro the Taylor series expansion of exp(x z) The RBF kernel illustrated γ = 10 γ = 100 γ = 1000 Figures fro openclassroo.stanford.edu (Andrew Ng) 13

14 What akes a valid kernel? k(x, z) is a valid kernel if there is soe ϕ such that k(x, z) = φ(x) φ(z) this holds for a syetric function k(x, z) if and only if the kernel atrix K is positive seidefinite for any training set (Mercer s theore) definition of positive seidefinite (p.s.d): v :v T Kv 0 Kernel algebra given a valid kernel, we can ake new valid kernels using a variety of operators kernel coposition k(x,v) = k a (x,v) + k b (x,v) apping coposition φ(x) = ( φ a (x), φ b (x)) k(x,v) = γ k a (x,v), γ > 0 φ(x) = γ φ a (x) k(x,v) = k a (x,v)k b (x,v) k(x,v) = x T Av, A is p.s.d. k(x,v) = f (x) f (v)k a (x,v) φ l (x) = φ ai (x)φ bj (x) φ(x) = L T x, where A = LL T φ(x) = f (x)φ a (x) 14

The power of kernel functions kernels can be

classification task Given: aino-acid sequence

15 The power of kernel functions kernels can be designed and used to represent coplex data types such as strings trees graphs etc. let s consider a specific exaple The protein classification task Given: aino-acid sequence of a protein Do: predict the faily to which it belongs GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH 15

16 The k-spectru feature ap we can represent sequences by counts of all of their k-ers x = AKQDYYYYEI k = 3 ϕ(x) =( 0, 0,, 1,, 1,, 2 ) AAA AAC AKQ DYY YYY the diension of ϕ(x) = A k where A is the size of the alphabet using 6-ers for protein sequences, 20 6 = 64 illion alost all of the eleents in ϕ(x) are 0 since a sequence of length l has at ost l-k+1 k-ers The k-spectru kernel consider the k-spectru kernel applied to x and z with k = 3 x = AKQDYYYYEI z = AKQIAKQYEI ϕ(x) =( 0,, 1,, 1,, 0 ) AAA AKQ YEI YYY ϕ(z) =( 0,, 2,, 1,, 0 ) AAA AKQ YEI YYY ϕ(x) ϕ(z) = = 3 16

17 (k, )-isatch feature ap closely related protein sequences ay have few exact atches, but any near atches the (k, )-isatch feature ap uses the k-spectru representation, but allows up to isatches x = AKQ k = 3, = 1 ϕ(x) =( 0,, 1,, 1,, 1,, 0 ) AAA AAQ AKQ DKQ YYY Using a trie to represent 3-ers exaple: representing all 3-ers of the sequence QAAKKQAKKY A K Q A K K Q A K K Q Y A A K 17

18 Coputing the kernels efficiently with tries [Leslie et al., NIPS 2002] k-spectru kernel for each sequence build a trie representing its k-ers copute kernel ϕ(x) ϕ(z) by traversing trie for x using k-ers fro z update kernel function when reaching a leaf (k, )-isatch kernel for each sequence build a trie representing its k-ers and also k-ers with at ost isatches copute kernel ϕ(x) ϕ(z) by traversing trie for x using k-ers fro z update kernel function when reaching a leaf scales linearly with sequence length: ( ) O k +1 A ( x + z ) Support vector regression the SVM idea can also be applied in regression tasks an ε-insensitive error function specifies that a training instance is well explained if the odel s prediction is within ε of y (i) ( w T x (i) + b) y (i) = ε ε-insensitive error function y y (i) ( w T x (i) + b) = ε x 18

19 Support vector regression (i) ( ˆξ ) iniize 1 w,b,ξ (1) ξ (), ˆξ (1) ˆξ () 2 w 2 + C ξ (i) + 2 subject to constraints : w T x (i) + b for i = 1,, ( ) y (i) ε + ξ (i) (i) ( ) ε + ˆξ y (i) w T x (i) + b ξ (i), ˆξ (i) 0 slack variables allow predictions for soe training instances to be off by ore than ε Learning theory justification for axiizing the argin error D (h) error D (h) + VC log 2 VC +1 + log 4 δ error on true distribution training set error VC-diension of hypothesis class Vapnik showed there is a connection between the argin and VC diension VC 4R 2 argin D (h) 2 constant dependent on training data thus to iniize the VC diension (and to iprove the error bound) è axiize the argin 19

20 Coents on SVMs we can find solutions that are globally optial (axiize the argin) because the learning task is fraed as a convex optiization proble no local inia, in contrast to ulti-layer neural nets there are two forulations of the optiization: prial and dual dual represents classifier decision in ters of support vectors dual enables the use of kernel functions we can use a wide range of optiization ethods to learn SVM standard quadratic prograing solvers SMO [Platt, 1999] linear prograing solvers for soe forulations etc. kernels provide a powerful way to Coents on SVMs allow nonlinear decision boundaries represent/copare coplex objects such as strings and trees incorporate doain knowledge into the learning task using the kernel trick, we can iplicitly use high-diensional appings without explicitly coputing the one SVM can represent only a binary classification task; ulti-class probles handled using ultiple SVMs and soe encoding one class vs. rest ECOC etc. epirically, SVMs have shown state-of-the art accuracy for any tasks the kernel idea can be extended to other tasks (anoaly detection, etc.) 20

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel