Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class 0 or class 1? w T x+b > 0 class 1 w T x+b < 0 class 0 But, where do we place the boundary? Max margn classfers Logstc regresson: LL y x; w : y 1 w T x log 1 + e wt x Each data pont x consdered for boundary w Outler data pulls boundary towards t 3 Focus on boundary ponts Fnd largest margn between boundary ponts on both sdes Works well n practce We can call the boundary ponts support vectors 4 1
Maxmum margn defntons M s the margn wdth x + s a +1 pont closest to boundary, x - s a -1 pont closest to boundary x + = λw + x x + x = M w T x + b = 1 w T x + b = 0 w T x + b = 1 M = Classfy as +1 f w T x + b 1 Classfy as -1 f w T x + b 1 Undefned f 1 w T x + b 1 maxmze M mnmze 5 λ dervaton w T x + + b = +1 w T λw + x + b = +1 λ + w T x + b = +1 λ 1 b + b = +1 λ = w T x + b = 1 w T x + b = 0 w T x + b = 1 w T x + b = 1 w T x + + b = +1 x + = λw + x 6 M dervaton Support vector machne (SVM) optmzaton M = λw + x x M = λ M = wt w = maxmze M = λw = λ w w T x + b = 1 w T x + b = 0 w T x + b = 1 mnmze w T x + b = 1 w T x + + b = +1 x + = λw + x x + x = M 7 max w M = mn w subject to w T x + b 1 for x n class 1 w T x + b 1 for x n class -1 Optmzaton wth constrants: f w w j j Gradent descent Matrx calculus = 0 wth Lagrange multplers. 8
Support vector machne (SVM) optmzaton wth slack varables ε 4 ε 1 Support vector machne (SVM) optmzaton wth slack varables What f data not lnearly separable? ε Example: Lnearly separable but wth narrow margns argmn w,b + C ε subject to w T x + b 1 ε for x n class 1 w T x + b 1 + ε for x n class -1 ε 0 Each error ε s penalzed based on dstance from separator ε 3 9 argmn w,b + C ε subject to w T x + b 1 ε for x n class 1 w T x + b 1 + ε for x n class -1 ε 0 10 Hyper-parameters for learnng Alternate SVM formulaton argmn w,b + C ε Optmzaton constrants: C nfluences tolerance for label errors versus narrow margns w j w j + ε x j y g(w T x ) w j λn Gradent ascent: ε nfluences effect of ndvdual data ponts n learnng T number of tranng examples, L number of loops through data balance learnng and over-fttng Regularzaton: λ nfluences the strength of your pror belef 11 α 0 α y = 0 w = α x y Support vectors x have α > 0 y are the data labels +1 or -1 To classfy sample x k, compute: w T x k + b = α y x x k + b 1 3
Example w = α x y Hyper-parameters to learn α 0 α y = 0 x 1 = 1 1, y1 = +1, α 1 = 0.5 x = 0 0, y = +1, α = 0.7 x 3 = 1 1, y3 = 1, α 3 = 1 x 4 = 0.5 3, y4 = 1, α 4 = 0. w = 0.5 1 1 + 0.7 0 0 1 1 0.5 0. 1 3 0.5 + 1 + 0.1 0. 6 = = 0.5 + 1 + 0.6. 1 13 Each data pont x has N features (presumng classfy wth w T x +b) Separator: w and b N elements of w, 1 value for b: N+1 parameters OR t support vectors -> t non-zero α, 1 value for b: t+1 parameters 14 Classfyng wth addtonal dmensons Note: More dmensons makes t easer to separate T tranng ponts: tranng error mnmzed, may rsk over-ft No lnear separator φ x x x 1, x 1 Lnear separator x 1 0 x 1 15-4 -3 - -1 0 1 3 4-4 -3 - -1 0 1 3 4 x 1 0 15 10 5 Quadratc mappng functon (math) x 1, x, x 3, x 4 -> x 1, x, x 3, x 4, x 1, x,, x 1 x, x 1 x 3,, x x 4, x 3 x 4 N features -> N + N + N (N 1) N features N values to learn for w n hgher-dmensonal space Or, observe: v T x + 1 = v 1 x 1 + + v N x N v wth N elements operatng n quadratc space w T x k + b = +v 1 v x 1 x + + v N 1 v N x N 1 x N +v 1 x 1 + + v N x N α y x T x k + b 16 4
Quadratc mappng functon Smplfed x= [x 1, x ] -> [ x 1, x, x 1, x, x 1 x, 1] x =[5,-] -> [10, -4, 5, 4, -0, 1] x k =[3,-1] -> [6, -, 9, 1, -6, 1] φ x T φ x k = 30 + 4 + 5 + 4 + 60 + 1 = 34 Or, observe: x T x k + 1 = 15 + + 1 = 18 =34 17 Mappng functon(s) Map from low-dmensonal space x = x 1, x to hgher dmensonal space φ x = x 1, x, x 1, x, x 1 x, 1 N data ponts guaranteed to be separable n space of N-1 dmensons or more Classfyng x k : w = α φ x y α y φ x T φ x k + b 18 Kernels Classfyng x k : α y φ x T φ x k + b Kernel trck: Estmate hgh-dmensonal dot product wth functon K x, x k Now classfyng x k = φ x T φ x k α y K x, x k + b 19 Radal Bass Kernel Try projecton to nfnte dmensons φ x = x 1,, x n, x 1,, x n,, x 1, x n Taylor expanson: e x = x0 + x1 + x + x3 + + x 0! 1!! 3!! K x, x k = exp x x k σ Note: x x k = x x k T x x k Draw separatng plane to curve around all support vectors 0 5
Example RBF-kernel separator Potental dangers of RBF-kernel separator Large margn Non-lnear separaton Small margn - overfttng Non-lnear separaton 1 The power of SVM (+kernels) Boundary defned by a few support vectors Caused by: maxmzng margn Causes: less overfttng Smlar to: regularzaton Kernels keep number of learned parameters n check Bnary -> M-class classfcaton Learn boundary for class m vs all other classes Only need M-1 separators for M classes M th class s for data outsde of classes 1,, 3,, M-1 Fnd boundary that gves hghest margn for data ponts x 3 4 6
Benefts of generatve methods P D θ and P θ D can generate non-lnear boundary E.g.: Gaussans wth multple varances 5 7