Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng
So far Supervsed machne learnng Lnear models Least squares regresson Fsher s dscrmnant, Perceptron, Logstc model Non-lnear models Neural networks, Decson trees, Assocaton rules Unsupervsed machne learnng Clusterng/EM, PCA Generc scaffoldng Probablstc modelng, ML/MAP estmaton Performance evaluaton, Statstcal learnng theory Lnear algebra, Optmzaton methods
Comng up next Supervsed machne learnng Lnear models Least squares regresson, SVM Fsher s dscrmnant, Perceptron, Logstc regresson, SVM Non-lnear models Neural networks, Decson trees, Assocaton rules SVM, Kernel-XXX Unsupervsed machne learnng Clusterng/EM, PCA, Kernel-XXX Generc scaffoldng Probablstc modelng, ML/MAP estmaton Performance evaluaton, Statstcal learnng theory Lnear algebra, Optmzaton methods Kernels
Frst thngs frst SVM: lbrary('e1071') (y { 1,1}) m = svm(x, y, kernel='lnear') predct(m, newx)
Quz Ths lne s called Ths vector s Those lnes are f x =? x 1 =? y 1 =? Functonal margn of x 1? Geometrc margn of x 1? Dstance to orgn?
Quz Separatng hyperplane Normal w Isolnes (level lnes) f x = w T x + b x 1 = (2, 6); y 1 = 1 y 1 f x 1 2 f(x 1 )/ w 3 2 d = b/ w
Quz Suppose we scale w and b by some constant. Wll t: Affect the separatng hyperplane? How? Affect the functonal margns? How? Affect the geometrc margns? How?
Quz Example: w 2w, b = 0
Quz Suppose we scale w and b by some constant. Wll t: Affect the separatng hyperplane? How? No: w T x + b = 0 2w T x + 2b = 0 Affect the functonal margns? How? Yes: 2w T x + 2b y = 2 w T x + b y Affect the geometrc margns? How? No: 2w T x+2b 2w = wt x+b w
Whch classfer s best?
Maxmal margn classfer
Why maxmal margn? Well-defned, sngle stable soluton Nose-tolerant Small parameterzaton (Farly) effcent algorthms exst for fndng t
Maxmal margn: Separable case f x = 1 f x = 1
Maxmal margn: Separable case f x = 1 f x = 1 f x y 1
Maxmal margn: Separable case f x = 1 The (geometrc) dstance to the solne f x = 1 s: f x = 1
Maxmal margn: Separable case f x = 1 The (geometrc) dstance to the solne f x = 1 s: f x d = w = 1 w f x = 1
Maxmal margn: Separable case Among all lnear classfers (w, b) whch keep all ponts at functonal margn of 1 or more, we shall look for the one whch has the largest dstance d to the correspondng solnes,.e. the largest geometrc margn. As d = 1, ths s equvalent to fndng the classfer w wth mnmal w. whch s equvalent to fndng the classfer wth mnmal w 2
Compare Generc lnear classfcaton (separable case): Fnd (w, b), such that all ponts are classfed correctly.e. f x y > 0 Maxmal margn classfcaton (separable case): Fnd (w, b), such that all ponts are classfed correctly wth a fxed functonal margn.e. f x y > 1 and w 2 s mnmal.
Remember SVM optmzaton problem (separable case): mn w,b 1 2 w 2 so that w T x + b y 1
General case ( soft margn ) The same, but we also penalze all margn volatons. SVM optmzaton problem: mn w,b 1 2 w 2 + C ξ where ξ = 1 f x y + ξ = 1 f x y +
General case ( soft margn ) The same, but we also penalze all margn volatons. SVM optmzaton problem: ξ = 1 f x y + mn w,b 1 2 w 2 + C 1 f x y +
General case ( soft margn ) The same, but we also penalze all margn volatons. SVM optmzaton problem: ξ = 1 f x y + mn w,b 1 2 w 2 + C 1 m +
General case ( soft margn ) The same, but we also penalze all margn volatons. mn w,b SVM optmzaton problem: 1 2 w 2 + C hnge(m ) where hnge m = 1 m + ξ = 1 f x y +
Hnge loss hnge m = 1 m +
Classfcaton loss functons Generc classfcaton: mn w,b [m < 0]
Classfcaton loss functons Perceptron:
Classfcaton loss functons Perceptron: mn w,b ( m ) +
Classfcaton loss functons Least squares classfcaton*: mn w,b m 1 2
Classfcaton loss functons Boostng: mn w,b exp( m )
Classfcaton loss functons Logstc regresson: mn w,b log (1 + e m )
Classfcaton loss functons Regularzed logstc regresson: mn w,b log (1 + e m ) +λ 1 2 w 2
Classfcaton loss functons SVM: mn w,b 1 m + + 1 2C w 2
Classfcaton loss functons L2-SVM: mn w,b 1 m + 2 + 1 2C w 2
Classfcaton loss functons L1-regularzed L2-SVM: mn w,b 1 m + 2 + 1 2C w etc
In general mn w,b φ(m ) + λ Ω(w) Model ft Model complexty
Compare to MAP estmaton max Model log P(x Model) + log P(Model) Lkelhood Model pror
Compare to MAP estmaton max Model log P(Data Model) + log P(Model) Lkelhood Model pror
Solvng the SVM mn w,b 1 2 w 2 + C 1 f x y +
Solvng the SVM such that mn w,b 1 2 w 2 + C ξ f x y 1 ξ ξ 0
Solvng the SVM such that mn w,b 1 2 w 2 + C ξ f x y 1 ξ 0 ξ 0
Solvng the SVM such that mn w,b 1 2 w 2 + C ξ f x y 1 ξ 0 ξ 0 Quadratc functon wth lnear constrants!
Solvng the SVM such that Mnmze mn w,b subject to: 1 2 w 2 + C ξ Quadratc programmng ff xx = y 1 1 ξ 0 ξ2 xt Qx + c T x 0 Ax b Cx = d Quadratc functon wth lnear constrants!
Solvng the SVM such that Mnmze mn w,b subject to: 1 2 w 2 + C ξ Quadratc programmng f fx x y= 1 1 ξ 0 ξ2 xt Qx + c T x 0 Ax b Cx = d Quadratc functon wth lnear constrants! > lbrary(quadprog) > solve.qp(q, -c, A, b, neq)
Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: 1 2 w 2 + C ξ mn w,b max α 0,β 0 α (f x y 1 ξ ) β ξ
Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: 1 2 w 2 + C ξ mn w,b max α 0,β 0 α (f x y 1 ξ ) β ξ
Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: mn w,b max α 0,β 0 1 2 w 2 + ξ C α β α f x y 1
Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: mn w,b max α 0,β 0 1 2 w 2 + ξ C α β α f x y 1 C α β = 0
Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: mn w,b max α 0,β 0 1 2 w 2 + ξ C α β α f x y 1 0 α C
Solvng the SVM: Dual mn w,b max α 1 2 w 2 α f x y 1 0 α C
Solvng the SVM: Dual mn w,b max α 1 2 w 2 α f x y 1 0 α C Sparsty: α s nonzero only for those ponts whch have f x y 1 < 0
Solvng the SVM: Dual mn w,b max α 1 2 w 2 α f x y 1 0 α C Now swap the mn and the max (can be done n partcular because everythng s nce and convex).
Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Next solve the nner (unconstraned) mn as usual.
Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Next solve the nner (unconstraned) mn as usual: w = w α y x = 0 b = α y = 0
Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Express w and substtute: w = α y x α y = 0
Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Express w and substtute: w = α y x Dual representaton α y = 0
Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Express w and substtute: max α α 1 α 2 α j y y j x T x j,j 0 α C α y = 0
Solvng the SVM: Dual max α α 1 α 2 α j y y j x T x j,j 0 α C α y = 0
Solvng the SVM: Dual max α 1 T α 1 2 αt K Y α K j = x T x j, 0 α C y T α = 0 Y j = y y j
Solvng the SVM: Dual 1 mn α 2 αt K Y α 1 T α α 0 α C y T α = 0 Then fnd b from the condton: f x y = 1 f 0 < α < C
Support vectors
Support vectors C 0.5 0 C 0 0 1 0.5 α y = 0 0 0 α C 0
Sparsty The dual soluton s often very sparse, ths allows to perform optmzaton effcently Workng set approach.
Kernels f x = w T x + b w = α y x f x f x = α y x T x + b = α y K(x, x) + b
Kernels f x = w T x + b w = α y x Kernel functon f x f x = α y x T x + b = α y K(x, x) + b
Kernels f x = w T x + b f x = w 1 x + w 2 x 2 + b w = α y x f x f x = α y x T x + b = α y K(x, x) + b f x = α y exp ( x x 2 ) + b
Quz SVM s a lnear classfer. Margn maxmzaton can be acheved va mnmzaton of. SVM uses loss and regularzaton. Besdes hnge loss I also know loss and loss. SVM n both prmal and dual form s solved usng programmng.
Quz In prmal formulaton we solve for parameter vector. In dual formulaton we solve for nstead. form of SVM s typcally sparse. Support vectors are those tranng ponts for whch. The relaton between prmal and dual varables s: =. A Kernel s a generalzaton of product.