/14/018 Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class 0 or class 1? w T x+b > 0 class 1 w T x+b < 0 class 0 But, where do we place the boundary? Max margn classfers Logstc classfer: LL y x; w : y 1 w T x log 1 + e wt x Each data pont x consdered for boundary w Outler data pulls boundary towards t 3 Focus on boundary ponts Fnd largest margn between boundary ponts on both sdes Works well n practce We can call the boundary ponts support vectors 4 1
/14/018 Maxmum margn defntons w T x + b = 1 w T x + b = 0 w T x + b = 1 M s the margn wdth x + s a +1 pont closest to boundary, x - s a -1 pont closest to boundary M = Classfy as +1 f w T x + b 1 Classfy as -1 f w T x + b 1 Undefned f 1 < w T x + b < 1 w T w Support vector machne (SVM) optmzaton argmn w w T w w T x + b 1 for x n class 1 w T x + b 1 for x n class -1 argmn w w T w + +1 λ 1 w T x + b + x + = λw + x x + x = M maxmze M mnmze w T w 5 9 Support vector machne (SVM) optmzaton argmn w w T w w T x + b 1 for x n class 1 w T x + b 1 for x n class -1 w j w j + λ + 1 j w j x j + + b + λ j w j x j + b 1 Alternate SVM formulaton w = α x y Support vectors x have α > 0 y are the data labels +1 or -1 α 0 α y = 0 10 11
/14/018 Support vector machne (SVM) optmzaton wth slack varables ε 4 What f data not lnearly separable? argmn w,b w T w + C ε w T x + b 1 ε for x n class 1 w T x + b 1 + ε for x n class -1 ε 0 ε 1 Each error ε s penalzed based on dstance from separator ε 3 ε 14 Support vector machne (SVM) optmzaton wth slack varables Example: Lnearly separable but wth narrow margns argmn w,b w T w + C ε w T x + b 1 ε for x n class 1 w T x + b 1 + ε for x n class -1 ε 0 15 Hyper-parameters for learnng argmn w,b w T w + C ε Optmzaton constrants: C nfluences tolerance for label errors versus narrow margns w j w j + εx j y g(w T x ) w j λ Gradent ascent: ε nfluences effect of ndvdual data ponts n learnng T number of tranng examples, L number of loops through data balance learnng and over-fttng Regularzaton: λ nfluences the strength of your pror belef Hyper-parameters to learn Each data pont x has N features (presumng classfy wth w T x +b) Separator: w and b N elements of w, 1 value for b: N+1 parameters OR t support vectors -> t non-zero α, 1 value for b: t+1 parameters 16 17 3
/14/018 Classfyng wth addtonal dmensons Note: More dmensons makes t easer to separate T tranng ponts: tranng error mnmzed, may rsk over-ft No lnear separator φ x x x 1, x 1 Lnear separator x 1 0 x 1 18-4 -3 - -1 0 1 3 4-4 -3 - -1 0 1 3 4 x 1 0 15 10 5 Quadratc mappng functon (math) x 1, x, x 3, x 4 -> x 1, x, x 3, x 4, x 1, x,, x 1 x, x 1 x 3,, x x 4, x 3 x 4 N features -> N + N + N (N 1) N features N values to learn for w n hgher-dmensonal space Or, observe: v T x + 1 = v 1 x 1 + + v N x N v wth N elements operatng n quadratc space w T x k + b = +v 1 v x 1 x + + v N 1 v N x N 1 x N +v 1 x 1 + + v N x N α y x T x k + b 19 Quadratc mappng functon Smplfed x= [x 1, x ] -> [ x 1, x, x 1, x, x 1 x, 1] x =[5,-] -> x k =[3,-1] -> φ x T φ x k = Or, observe: x T x k + 1 = 1 Mappng functon(s) Map from low-dmensonal space x = x 1, x to hgher dmensonal space φ x = x 1, x, x 1, x, x 1 x, 1 N data ponts guaranteed to be separable n space of N-1 dmensons or more Classfyng x k : w = α φ x y α y φ x T φ x k + b 4
/14/018 Kernels Classfyng x k : α y φ x T φ x k + b Kernel trck: Estmate hgh-dmensonal dot product wth functon K x, x k Now classfyng x k = φ x T φ x k α y K x, x k + b 3 Radal Bass Kernel Try projecton to nfnte dmensons φ x = x 1,, x n, x 1,, x n,, x 1, x n Taylor expanson: e x = x0 + x1 + x + x3 + + x 0! 1!! 3!! K x, x k = exp x x k σ Note: x x k = x x k T x x k Draw separatng plane to curve around all support vectors 4 Example RBF-kernel separator Potental dangers of RBF-kernel separator Large margn Non-lnear separaton Small margn - overfttng Non-lnear separaton 5 6 5
/14/018 The power of SVM (+kernels) Boundary defned by a few support vectors Caused by: maxmzng margn Causes: less overfttng Smlar to: regularzaton Kernels keep number of learned parameters n check Bnary -> M-class classfcaton Learn boundary for class m vs all other classes Only need M-1 separators for M classes M th class s for data outsde of classes 1,, 3,, M-1 Fnd boundary that gves hghest margn for data ponts x 7 8 Benefts of generatve methods P D θ and P θ D can generate non-lnear boundary E.g.: Gaussans wth multple varances 9 6