Which Separator? Spring 1

Size: px

Start display at page:

Download "Which Separator? Spring 1"

Owen Russell
6 years ago
Views:

1 Whch Separator? Sprng 1

2 Whch Separator? Mamze the margn to closest ponts Sprng

3 Whch Separator? Mamze the margn to closest ponts Sprng 3

4 Margn of a pont " # y (w $ + b) proportonal to perpendcular dstance of pont to hyperplane k! k w! Sprng 4

5 Margn of a pont " # y (w $ + b) proportonal to perpendcular dstance of pont to hyperplane geometrc margn s " w k! k w! Sprng 5

6 Margn # " y ( w! + b) Scalng w changes value of margn but not actual dstances to separator (geometrc margn) Pck the margn to closest postve and negatve ponts to be 1 + 1( w! " 1( w! 1 + b) = + b) = Sprng 6

7 Margn Pck the margn to closest postve and negatve ponts to be 1 + 1( w! " 1( w! Combnng these 1 + b) = + b) = 1 1 w 1 " (! ) = Dvdng by length of w gves perpendcular dstance between lnes ( geometrc margn) w 1 " (! ) = w w Sprng 7

8 Pckng w to Mamze Margn Pck w to mamze geometrc margn w or, equvalently, mnmze w = w " w or, equvalently, mnmze 1 w = 1 w " w = 1! j w j Sprng 8

9 Pckng w to Mamze Margn Pck w to mamze geometrc margn w or, equvalently, mnmze 1 w =! whle classfyng ponts correctly 1 w " w = 1 j w j y ( w " + b)! or, equvalently, y ( w # + b) " 1! Sprng 9

10 Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Sprng 10

11 Constraned op5mza5on No Constrant -1 1 *=0 *=0 *=1 How do we solve wth constrants? à Lagrange Multplers!!!

12 Lagrange mul5plers Dual varables Add Lagrange multpler Introduce Lagrangan (objectve): Rewrte Constrant We wll solve: Why does ths work at all??? mn s fghtng ma! <b à (-b)<0 à ma α -α(-b) = mn won t let that happen!! >b, α>0à (-b)>0 à ma α -α(-b) = 0, α*=0 Add new constrant mn s cool wth 0, and L(, α)= (orgnal objectve) =b à α can be anythng, and L(, α)= (orgnal objectve) Snce mn s on the outsde, can force ma to behave and constrants wll be satsfed!!!

13 Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! Sprng 11

14 Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! To mnmze epresson: mnmze frst (orgnal) term, and mamze second (constrant) term snce α > 0, encourages constrants to be satsfed but we want least dstorton of orgnal term Sprng 1

15 Constraned Optmzaton mn w 1 w subject to y ( w $ + b) # 1 " 0,! Convert to unconstraned optmzaton by ncorporatng the constrants as an addtonal term ( 1 mn& w ' w % ) +,, $ [ ] y ( w * + b) ) 1 #, " 0! Lagrange multplers To mnmze epresson: mnmze frst (orgnal) term, and mamze second (constrant) term snce α > 0, encourages constrants to be satsfed but we want least dstorton of orgnal term Method of Lagrange multplers Sprng 13

16 Mamzng the Margn 1 L( w, b) = w "! $ " [ ( # + ) 1] y w b Sprng 14

17 Mamzng the Margn 1 L( w, b) = w "! $ " Mnmzed when: [ ( # + ) 1] y w b! * w = " y!" y = Sprng 15

18 Mamzng the Margn 1 L( w, b) = w "! $ " Mnmzed when: [ ( # + ) 1] y w b! * w = " y Substtutng w* nto L yelds dual Lagrangan:!" y = 0 L(") = m #" $ 1 =1 m # =1 m # k=1 " " k y y k k Only dot products of the feature vectors appear Sprng 16

19 Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % Sprng 17

20 Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty (n ths case s known as a support vector) " = 0 " = 0 " = 0 " = Sprng 18

21 Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty ( s known as a support vector) " = 0 w * = " y! b = 1 y " w * " = 0 " = 0 " = Sprng 19

22 Dual Lagrangan ma " L(") subject to #" y = 0 and " $ 0, % In general, snce α >= 0, ether α = 0: constrant s satsfed wth no dstorton at optmum w or α > 0: constrant s satsfed wth equalty ( s known as a support vector) " = 0 w * = " y! b = 1 y " w * Has a unque mamum vector Can be found usng quadratc programmng or gradent ascent " = 0 " = 0 " = Sprng 0

23 SVM Classfer Gven unknown vector u, predct class (1 or -1) as follows: k & h( u) = sgn$ () y % = 1 ' u + # b! " The sum s over k support vectors Sprng 1

24 Bankruptcy Eample α y for support vectors are non-zero, all others are zero Sprng

25 Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples Sprng 3

26 Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems Sprng 4

27 Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts Sprng 5

28 Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. Ma margn lowers hypothess varance Sprng 6

29 Key Ponts Learnng depends only on dot products of sample pars. Recognton depends only on dot products of unknown wth samples. Eclusve relance on dot products enables approach to non-lnearly-separable problems. The classfer depends only on the support vectors, not on all the tranng ponts. Ma margn lowers hypothess varance. The optmal classfer s defned unquely there are no local mama n the search space Polynomal n number of data ponts and dmensonalty Sprng 7

30 Not Lnearly Separable? Requre 0 " # " C C specfed by user; controls tradeoff between sze of margn and classfcaton errors C = 1 for separable case Sprng 8

31 C Change C=10 C= Sprng 9

32 C Change C=100 C= Sprng 30

33 Eample: Lnearly Separable Image by Patrck Wnston Sprng 31

34 Another eample: Not lnearly separable Image by Patrck Wnston Sprng 3

35 Isn t a lnear classfer very lmtng? R R R 1 not lnearly separable lnearly separable usng squared value of features. Important: Lnear separator n transformed feature space maps nto non-lnear separator n orgnal feature space Sprng 33

36 Not separable? Try a hgher dmensonal space! Not separable wth D lne Separable wth 3D plane Sprng 34

37 What you need To get nto the new feature space, you use!( ) The transformaton can be to a hgher-dmensonal feature space and may be non-lnear n the feature values Sprng 35

38 What you need To get nto the new feature space, you use!( ) The transformaton can be to a hgher-dmensonal feature space and may be non-lnear n the feature values. Recall that SVM s only use dot products of the data, so k To optmze classfer, you need!( ) "!( ) To run classfer, you need! ( ) "!( u) So, all you need s a way to compute dot products n transformed space as a functon of vectors n orgnal space! Sprng 36

39 Sprng 37 The Kernel Trck If dot products can be effcently computed by Then, all you need s a functon on low-dm nputs You don t need ever to construct hgh-dmensonal ), ( ) ( ) ( k k K =! "! ), ( k K ) (!

40 Standard Choces For Kernels No change (lnear kernel) "( ) k k )! "( ) = K(, =! k Sprng 38

41 Standard Choces For Kernels No change (lnear kernel) "( ) k k )! "( ) = K(, =! k Polynomal kernel (n th order) k k n K (, ) = (1 +! ) Sprng 39

42 0.1 Polynomal Kernel Eample (one feature) Not separable Sprng 40

43 0.1 Polynomal Kernel Eample (one feature) Not separable!( ) = (,, 1) 0.4 Separable 0.35 ^ Neg Pos!( ) "!( z) = z = (1 + + z z) sqrt() Sprng 41

44 Polynomal Kernel Polynomal kernel for n= and features =[ 1 ] K(, z) = (1 +! z) s equvalent to the followng feature mappng:!( ) = [ ] We can verfy that: "() # "(z) = 1 z 1 + z + 1 z 1 z + 1 z 1 + z +1 = (1 + 1 z 1 + z ) = (1 + # z) = K(,z) Sprng 4

45 Polynomal Kernel Images by Patrck Wnston Sprng 43

46 Sprng 44 Standard Choces For Kernels No change (lnear kernel) Polynomal kernel (n th order) Radal bass kernel (σ s standard devaton) k k k K! = = "! " ), ( ) ( ) ( n k k K ) (1 ), (! + = ) ( ) ( ), (!! k k k e e K k " # " " " " = =

47 Radal-bass kernel Classfer based on sum of Gaussan bumps wth standard devaton σ, centered on support vectors. [ h ( )] h ( u) = sgn! u u h "(u) = k $ # y K(,u) + b =1 1 K(, u) = e "! " u Sprng 45

48 Radal-bass kernel! = Sprng 46

49 y 1 " 1 =1.76 y " = #1.76 y 3 " 3 =1.76 y 4 " 4 = #1.76 Radal-bass kernel b = 0.55! = support vectors Sprng 47

50 y 1 " 1 =1.76 y " = #1.76 y 3 " 3 =1.76 y 4 " 4 = #1.76 h "(u) =.5 4 Radal-bass kernel $ # y K(,u) + b =1 b = 0.55! = 0. 1 K(, u) = e " " u! support vectors Sprng 48

51 Radal-bass kernel (large σ) Images by Patrck Wnston Sprng 49

52 Another radal-bass eample (small σ) Image by Patrck Wnston Sprng 50

53 Cross-Valdaton Error Does mappng to a very hgh-dmensonal space lead to over-fttng? Generally, no, thanks to the fact that only the support vectors determne the decson surface Sprng 51

54 Cross-Valdaton Error Does mappng to a very hgh-dmensonal space lead to over-fttng? Generally, no, thanks to the fact that only the support vectors determne the decson surface. The epected leave-one-out cross-valdaton error depends on number of support vectors, not dmensonalty of feature space. Epected CV error! Epected # support vectors # tranng samples If most data ponts are support vectors, a sgn of possble overfttng, ndependent of the dmensonalty of feature space Sprng 5

55 Summary A sngle global mamum Quadratc programmng or gradent descent Sprng 53

56 Summary A sngle global mamum Quadratc programmng or gradent descent Fewer parameters C and kernel parameters (n for polynomal, σ for radal bass kernel) Sprng 54

57 Summary A sngle global mamum Quadratc programmng or gradent descent Fewer parameters C and kernel parameters (n for polynomal, σ for radal bass kernel) Kernel Quadratc mnmzaton depends only on dot products of sample vectors Recognton depends only on dot products of unknown vector wth sample vectors Relance on only dot products enables effcent feature mappng to hgher-dmensonal spaces where lnear separaton s more effectve Sprng 55

58 Real Data Wsconsn Breast Cancer Data 9 features C=1 37 support vectors are used from 51 tranng data ponts 1 predcton errors on tranng set (98% accuracy) 96% accuracy on 171 held out ponts Essentally same performance as nearest neghbors and decson trees Don t epect such good performance on every data set Sprng 56

59 Success Stores Gene mcroarray data outperformed all other classfers specally desgned kernel Tet categorzaton lnear kernel n >10,000 D nput space best predcton performance 35 tmes faster to tran than net best classfer (decson trees) Many others: applst.html Sprng 57

Support Vector Machines

Support Vector Machines Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class