FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Size: px

Start display at page:

Download "FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu"

Carol Smith
5 years ago
Views:

1 FMA901F: Machne Learnng Lecture 5: Support Vector Machnes Crstan Smnchsescu

2 Back to Bnary Classfcaton Setup We are gven a fnte, possbly nosy, set of tranng data:,, 1,..,. Each nput s pared wth a bnary output 1 or 1 Based only on tranng data, construct a machne that generates outputs, gven nputs Now, a new sample s drawn from the same dstrbuton as the tranng sample We wsh to run the machne on the new sample nput, and be able to classfy t correctly, as ether postve or negatve

3 Example: Face Detecton

4 Dscrmnant Functon Once agan, we wll restrct our attenton to learnng machnes that separate the postve and negatve examples usng a lnear functon, wth parameters, g( x) w x b Lnear Functons bas denoted nstead of

5 Lnear Dscrmnant Functon s a lnear functon: g( x) w x b x 2 w x + b > 0 Postve Negatve A hyper plane n feature space he unt length normal vector of the hyper plane: n n w w w x + b < 0 x 1

6 Lnear Dscrmnant Functon How can we classfy the data usng a lnear dscrmnant n order to mnmze the error rate? x 2 Postve Negatve x 1

7 Lnear Dscrmnant Functon How can we classfy the data usng a lnear dscrmnant n order to mnmze the error rate? x 2 Postve Negatve x 1

8 Lnear Dscrmnant Functon How can we classfy the data usng a lnear dscrmnant n order to mnmze the error rate? x 2 Postve Negatve x 1

9 Lnear Dscrmnant Functon Postve How can we classfy the data usng a lnear dscrmnant n order to mnmze the error rate? Many possble answers! Whch one s the best? x 2 Negatve A: he best s the one that gves the lowest error on new test data! x 1

10 Large Margn Lnear Classfer One opton: the lnear dscrmnant functon wth the maxmum margn Geometrc margn s the dstance to a separatng hyperplane from pont closest to t:,, / Margn, mn,..,,, Examples closest to the hyperplane are support vectors he dscrmnant margn s the maxmum wdth of the band that can be drawn, separatng contrastve support vectors x 2 Postve Negatve separaton zone Margn x + x + x - Support Vectors x 1

11 Large Margn Lnear Classfer 1

12 VC Dmenson Consder a bnary classfcaton problem, and a functon class Each functon of the class nduces a labelng of patterns here are at most 2 labelngs for patterns If a very rch functon class mght be able to realze all 2 separatons, t s sad to shatter the ponts However the functon may not be rch enough he VC dmenson s defned as the largest such that there exst a set of ponts whch the class can shatter, or f no such exsts It s a one number summary for the capacty of the learnng machne

13 Cover s heorem Gves the number of possble lnear separatons of ponts, n general poston, n a -dmensonal space If 1then 2 separatons are possble dm 1 If 1, the number of lnear separatons s 2 1 As we ncrease, there are more terms n the sum, VC Ponts assumed n general poston: however n practcal applcatons ponts could be on lower-dmensonal manfold

14 Large Margn Lnear Classfer Gven a set of data ponts: {( x, y )}, 1,2,, n, where For y 1, wxb0 For y 1, wxb0 x 2 Postve Negatve Cannoncal Hyperplane Under a scale transformaton on both and, we can remove gauge n the above For y 1, wxb1 For y 1, wxb1 x 1

15 Large Margn Lnear Classfer We know that Postve Negatve w w x x b b 1 1 x 2 x + Margn he separaton s x + S 1 n (x x ) 2 w (x x ) w 2 w n x - x 1

16 Large Margn Lnear Classfer Postve Formulaton: Negatve maxmze, 2 w x 2 x + Margn such that x + For y 1, wxb1 For y 1, wxb1 n x - x 1

17 Large Margn Lnear Classfer Postve Formulaton: 1 mnmze w 2, 2 x 2 x + Negatve Margn such that x + For y 1, wxb1 For y 1, wxb1 n x - x 1

18 Large Margn Lnear Classfer Postve Formulaton: 1 mnmze w 2, 2 x 2 x + Negatve Margn such that y ( wxb) 1 Quadratc program wth lnear constrants n x + x - x 1

19 Solvng the Optmzaton Problem Quadratc programmng wth lnear constrants s.t. 1 mnmze w 2, y ( wxb) 1 2 Lagrangan Functon 1 mnmze L (, b, ) y ( b) 1, n 2 p w w w x 2 1 s.t. 0 he Lagrangan needs to be mnmzed w.r.t.,, and maxmzed w.r.t

20 Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1, n 2 p w w w x 2 1 s.t. 0 L p 0 w y x w 1 L p b 0 n 1 n y 0 Soluton s an expanson n terms of tranng examples Due to strct convexty, s unque although s need not be

21 Solvng the Optmzaton Problem 1 mnmze L (, b, ) y ( b) 1, n 2 p w w w x 2 1 s.t. 0 Lagrangan Dual Problem maxmze s.t. 0 1 n n n jyy j j j1 n xx, and 1 y 0

22 Solvng the Optmzaton Problem From the KK condtons, we know: y ( wxb) 1 0 x 2 x + hus, only support vectors have 0 x + x - he soluton has the form: n w yx yx 1 SV Support Vectors x 1 get b from y( wx b) 1 0, where x s support vector 1

23 Solvng the Optmzaton Problem he lnear dscrmnant functon s: g( x) w xb x xb SV Reles on a dot product between the test pont and the support vectors Solvng the optmzaton problem nvolved computng the dot products between all pars of tranng ponts

24 `Soft Margn Lnear Classfer denotes +1 What f data s not lnear separable due to nose or outlers? x 2 denotes -1 Slack varables can be added to allow for the ms classfcaton of dffcult or nosy data 1 2 x 1

25 `Soft Margn Lnear Classfer Formulaton: 1 mnmze,, 2 such that w y ( wx b) C n 1 for 0 1, pont s between margn and correct sde of hyperplane for 1, pont s msclassfed Parameter can be vewed as a means to control over fttng small allows constrants to be easly gnored: large margn large makes constrants hard to gnore: narrow margn enforces all constrants: hard margn

26 `Soft Margn Lnear Classfer Formulaton (Lagrangan Dual Problem) maxmze 1 n n n jyy j j j1 xx such that 0 C n 1 y 0

27 `Soft Margn Interpretaton (I) he constrant more concsely as can be wrtten Hence we need to solve the learnng problem,

28 `Soft Margn Interpretaton (II) We need to solve the learnng problem mn, max0,1 1 pont s outsde margn and does not contrbute to loss 1 pont s on margn and does not contrbute to loss (as n hard margn) 1 pont volates margn constrant and contrbutes to loss

29 SVM uses Hnge Loss Can be vewed as an approxmaton to the 0 1 loss

30 Non lnear SVMs Datasets that are lnearly separable wth nose work out great: 0 x But what are we gong to do f the dataset s just too hard? 0 x How about mappng data to a hgher dmensonal space: x 2 0 x

31 Non lnear SVMs: Feature Space General dea: the orgnal nput space can be mapped to some hgher dmensonal feature space where the tranng set s separable Φ: x φ(x)

32 How to Use the Feature Space? he feature pont correspondng to an nput pont s called the mage (or the lftng) of ; the nput pont, f any, correspondng to a gven feature vector s called the pre mage of he nave way to use a feature space s to explctly compute the mage of every tranng and testng pont, and run algorthm fully n feature space wo potental problems he feature space may be very hgh dmensonal or nfnte dmensonal, so drect (explct) calculatons n such feature space may not be practcal, or even possble We may sometmes want to map back an answer from feature space to the nput space. hs s called the pre mage problem. For some kernels, analytcal expressons are avalable, but n most other cases some form of (local) optmzaton may be necessary

33 Nonlnear SVMs: he Kernel rck Wth ths mappng, our dscrmnant functon s now: g( x) w ( x) b ( x) ( x) b SV No need to know ths mappng explctly, because we only use the dot product of feature vectors both n tranng and n testng A kernel functon s defned as a functon that corresponds to a dot product of two feature vectors n some expanded feature space: K( x, x ) ( x ) ( x ) j j

34 Postve Defnte Kernels Gram Matrx. Gven a functon : Χ or, and patterns,, Χ, the x matrx wth elements, s called the Gram matrx (or kernel matrx) of w.r.t,,. Postve defnte kernel. A complex x matrx satsfyng 0, s called postve defnte. Smlarly, a real symmetrc x matrx satsfyng the above for all s called postve defnte. postve defnte kernels Mercer kernels reproducng kernels admssble kernels support vector kernels covarance functons

35 Lnear kernel: Examples of Kernels Examples of commonly used kernel functons: K( x, x ) x x j j Polynomal kernel: K( x, x ) (1 x x ) j j p Gaussan (Radal Bass Functon (RBF) ) kernel: Sgmod: j K( x, x j) exp( x x ) 2 2 K( x, x ) tanh( x x ) j 0 j 1 2

36 Generalty of Kernel rck Gven an algorthm expressed n terms of a postve defnte kernel, we can construct an alternatve algorthm by replacng wth another postve defnte kernel hs s not lmted to only cases when s a dot product n the nput doman Any algorthm that only depends on dot products (.e. s rotatonally nvarant) can be kernelzed Kernels are defned on general sets (rather than just dot product spaces!) and ther use leads to an embeddng of general data types n lnear spaces

37 Nonlnear SVM: Optmzaton Formulaton (Lagrangan Dual Problem) n n n 1 maxmze yyk(, ) x x j j j j1 such that 0 C n 1 y 0 he soluton of the dscrmnant functon s g( x) K( x, x) b SV

38 Support Vector Machne: Algorthm 1. Choose a kernel functon 2. Choose a value for 3. Solve the quadratc programmng problem (many software packages avalable, e.g. lbsvm) 4. Construct the dscrmnant functon from the support vectors

39 Sequental Mnmal Optmzaton For any two multplers constrants are reduced to: 0, and can be solved analytcally he algorthm: 1. Fnd a Lagrange multpler that volates the KK condtons for the optmzaton problem 2. Pck a second multpler and optmze the par 3. Repeat steps 1 and 2 untl convergence

40 SVM Applet Demo

41 Propertes of Kernels Kernels are symmetrc n ther arguments: 1, 2 2, 1 hey are postve valued for any nputs: 1, 2 0 he Cauchy Schwartz nequalty holds: 2 1, 2 1, 1 2, 2 echncally, to use a functon as a kernel, t must satsfy Mercer s condtons for a postve defnte operator he ntuton s easy to grasp for fnte spaces Dscretze space as densely as desred nto buckets Between each two cells,, compute the kernel functon, and wrte these values as a (symmetrc) matrx, If the matrx s postve defnte, the kernel s OK

42 Kernel Closure Rules Very useful for desgnng new kernels from exstng kernels he sum of any two kernels s a kernel he product of any two kernels s a kernel A kernel plus a constant s a kernel A scalar tmes a kernel s a kernel

43 Support Vector Machne Detector descrptors ranng set descrptors test Support vector machne tranng results est mage Mult-scale search

44 Vdeo: Pedestran Detecton

45 Scalablty Issues Although we crcumvented nfnte dmensonalty, In tranng: # optmzaton varables = # tranng examples N In testng: f ( x) K(x, x) Need to evaluate kernel between test data and each tranng example ranng and testng for mllons of examples unfeasble e.g. n ImageNet need to classfy 9 mllon mages

46 Lnear versus Kernel Methods Model Number of optmzaton varables Lnear f ( x) Kernel f ( x) k( x, x ) Input dmensonalty # tranng examples d N ranng tme O(Nd 2 ) O(N 2 d) ~ O(N 3 d) estng tme O(d) O(Nd) w x Caltech 101 Accuracy (BOW feature) 49% (Vedald and Zsserman 2010) 64% (Vedald and Zsserman 2010) Caltech 101 Accuracy (multple kernels) N/A 82% Good thngs are worth dong slowly (Gehler & Nowozn 2009) (L et al. 2010)

47 Random Fourer Approxmatons Bochner's theorem: A contnuous, shft nvarant kernel k(x,y)=f(x y) on R d s postve defnte k s the Fourer transform of a non negatve measure (let p be the Fourer ransform of k). Snce both k and p are real k(x, y) k (x, y) = p(ω) e jω jω e ω dω Approxmate expectaton usng Monte Carlo estmate, sampled from p p (xy) dω x y p(ω) cos(ω k( x, y) [cos(ω (x y))] [ z (x) z (y)] (x y)) dω where z (x) cos(ω sn(ω x) x) or z ( x) 2 cos(ω x b ), b ~ U[0, 2 ] (recall cos( x y) cos x cos y sn xsn y)

48 Dervaton of Compact Feature Map = = = = =, b = k snce cos [ z, b(x) z, 2 cos cos ω' x+b 2cosω' y+bpbpω ω' x+b cos ω' y+b p b pω ω' x y + cos ω' x+ y + 2b p b pω ω' x y + cos ω' x+ y + 2b p b pω cos x, y 2cos cos ω' x b (y)] y pω ω' x+ y+ 2bpb= 0

49 RF Algorthm Start: translaton nvarant kernel, Goal: m dm vectors, so that k( x, y) z( x) z( y) 1. Compute the Fourer transform p(ω) of, 2. Draw md..d. samples [ ω 1,..., ωm] from p(ω), and draw..d. samples b ~ U[0,2 ]. 3. Let z x) cos( ω x b ),...,cos( ω x b ( 1 1 m m) Kernel Name Gaussan Laplacan Cauchy d k(x-y) e e xy x y 2 1 ( x y ) 2 p(ω) ( 2 ) m e d e 1 d

50 Pros Pros and Cons of RF Monte Carlo convergence rate ndependent of nput dmenson d! Cost of feature generaton s lnear: O(Nmd) State of the art for kernel methods n large problems Cons O( m Only applcable to translaton nvarant kernels k( x, y) k( x y) wth analytc Fourer transform Possbly large number of dmensons requred for dffcult non lnear problems 1/ 2 )

51 Readngs B. Scholkopf, A. Smola: Learnng wth Kernels, MI Press, 2002 Chapters 2, 7 Onlne optons: Scholkopf and Smola: Learnng wth kernels (Support Vector Machne Introducton) Chrstopher J. C. Burges: A utoral on Support Vector Machnes for Pattern Recognton

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest