Support Vector Novelty Detection

Size: px

Start display at page:

Download "Support Vector Novelty Detection"

Shawn Benson
5 years ago
Views:

1 Support Vector Novelty Detecton Dscusson of Support Vector Method for Novelty Detecton (NIPS 2000) and Estmatng the Support of a Hgh- Dmensonal Dstrbuton (Neural Computaton 13, 2001) Bernhard Scholkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C. Wllamson Eugene Wensten March 3rd, 2009 Machne Learnng Semnar

2 One Does Not Belong Mexco's outgong rulng party s threatenng to boycott the nauguraton of the new Presdent elect Vcente Fox... Leaders from around the world are arrvng n Mexco for Frday's nauguraton of Presdent elect Vcente Fox... The curtan has come down on the Summer Games n Sydney but not before the U. S. men's basketball team... Mexco begns a new era today when Presdent elect Vcente Fox takes the oath of offce... Hstory was made n Mexco today when Vcente Fox was sworn n as Presdent. Ths was no ordnary nauguraton... 2

3 Novelty Detecton Intutve defnton: fnd the outlers n a group or a stream of data ponts Some deas n the (enormous) past lterature Ft a Gaussan mxture, tran a HMM, etc.; outlers are ponts wth low lkelhood Apply k-means clusterng, k-nearest neghbors, etc.; outlers are ponts far from clusters/other ponts Ths work: ummm, Occam s Razor? Why solve all these hard problems? 3

4 Separatng from Orgn Want an algorthm that returns +1 n a small regon enclosng most of the ponts, -1 outsde ths regon ρ w Unsupervsed settng: data ponts x 1,..., x` 2 Lnear classfer form f(x) = sgn(w x ρ) w x =0 w x = ρ separable case here 4

5 Add Slack Varables, Kernel Allow ponts to volate margn constrants: slack # ponts:, : learnng parameter 2 Kernel verson uses mappng Add kernel k(x, y) D (W (x) W (y) ), 5 mn w2f, 2R`,r2R x` ν (0, 1] Φ : F 1 2 kwk2 C 1 P º` j r subject to (w W(x ) ) r j, j 0. ξ ξ j w x = ρ Kernelzed classfer: f (x) D sgn((w W (x) ) r ) w x =0

6 Solvng the Optmzaton Lagrangan (α, β 0) L(w,, r,, ) D 1 2 kwk2 C 1 j r º` a ( (w W (x ) ) r C j ) j, Set dervatves w.r.t. w D equal to zero Plug back n to get (kernelzed) fdual 2 problem mn And the classfer form 1 2 w, ξ, ρ a W(x ), a D 1 º` 1 º`, 6 a D 1. a a j k(x, x j ) subject to 0 a 1 a D 1. º`, j f (x) D sgn a k(x, x) r.

7 Recoverng the Threshold Two of the KKT condtons: For any such that, α [w Φ(x ) ρ + ξ ] = 0; β ξ =0 α, β > 0 r D (w W(x ) ) D (SVs), we have a j k(x j, x ). So you get the threshold back as part of your soluton.e., no hackng/tunng/guessng the threshold! j 7

8 Another Vew We can also do novelty detecton by fndng the smallestradus sphere/ball enclosng most of the 2 ponts mn R2R, 2R`,c2F R 2 C 1 º` j subject to kw(x ) ck 2 R 2 C j, j 0 for 2 [`]. Ths leads to the dual mn a a j k(x, x j ) a k(x, x ) j subject to 0 a º`, 1 a D 1 and the soluton c D a W (x ), correspondng to a decson functon of the form 0 1 Tax and Dun, 99 ξ ξ j c R f (x) D 2 j a a j k(x, x j ) C 2 a k(x, x) k(x, x) A. 8

9 Equvalence of Spheres mn subject to 0 a 1 º`, ument o a a j k(x, x j ) a k(x, x ) j a D 1 If kernel k(x, y) depends on only x y, k(x, x) s constant lty con he lnear term e.g., Gaussan Kernel k(x, y) D e kx yk2 /c. In these cases, spheres equvalent to orgn separaton mn 1 2 j a a j k(x, x j ) subject to 0 a 1 º`, a D 1. 9

10 Generalzaton Bound 1 mn 2 3.4, 3.5: kwk2 C 1 º` j r w2f, 2R`,r2R, 3.10: P subject to (w W(x ) ) r j, j 0. f (x) D sgn a k(x, x) r. 3.12: r D (w W(x ) ) D 2 a j k(x j, x ). De nton 2. Let f be a real-valued functon on a space. Fx h 2 R. For x 2 let d(x, f, h ) D maxf0, h f (x)g. Smlarly for a tranng sequence :D (x 1,..., x`), we de ne D(, f, h ) D d(x, f, h ). x2 j Theorem 1 (generalzaton error bound). Suppose we are gven a set of ` examples 2 ` generated..d. from an unknown dstrbuton P, whch does not contan dscrete components. Suppose, moreover, that we solve the optmzaton problem, equatons 3.4 and 3.5 (or equvalently equaton 3.11) and obtan a soluton f w gven explctly by equaton (3.10). Let R w,r :D fx: f w (x) rg denote the nduced decson regon. Wth probablty 1 d over the draw of the random sample 2 `, for any c > 0, P ª x 0 : x R w,r c k `2 C log, (5.7) ` 2d where k D c 1 log c 2 co 2` co 2 C 2D O c log (2` 1) co e 2D C 1 C 2, (5.8) c 1 D 16c 2, c 2 D ln(2)/ 4c 2, c D 103, co D c /kwk, D D D(, f w, 0, r ) D D(, f w,r, 0), and r s gven by equaton (3.12).

11 Generalzaton Comments Coverng number argument, strong connectons to softmargn bnary classfcaton bounds Bound s loose and thus not drectly applcable n practce c D 103, too large by a factor of >50 equaton ( 11

$Experments: Toy Data Synthetc 2-D data, Gaussan kernel k(x, y) D e kx yk2 /c. º, wdth c 0.5, 0.5 0.5, 0.5 0.1, 0.5 0.5, 0.1 frac. SVs/OLs 0.54, 0.43 0.59, 0.47 0.24, 0.03 0.65, 0.38 margn r/kwk 0.$ $In both cases, at least a fracton of ºof all examples s n the estmated regon (cf. Table 1).$

12 Experments: Toy Data Synthetc 2-D data, Gaussan kernel k(x, y) D e kx yk2 /c. º, wdth c 0.5, , , , 0.1 frac. SVs/OLs 0.54, , , , 0.38 margn r/kwk Fgure 1: (Frst two pctures) A sngle-class SVM appled to two toy problems; º D c D 0.5, doman: [ 1, 1] 2. In both cases, at least a fracton of ºof all examples s n the estmated regon (cf. Table 1). The large value of º causes the addtonal data ponts n the upper left corner to have almost no n uence on the decson functon. For smaller values of º, such as 0.1 (thrd pcture) the ponts cannot be gnored anymore. Alternatvely, one can force the algorthm to take these outlers (OLs) nto account by changng the kernel wdth (see equaton 3.3). In the fourth pcture, usng c D 0.1, º D 0.5, the data are effectvely analyzed on a dfferent length scale, whch leads the algorthm to consder the outlers as meanngful ponts. 12

13 Experments: Dgts 9298 dgts, 16x16=256 dmensonalty, last 2007 are test test tran other offset test tran other offset 13 Fgure 3: Experments on the U.S. Postal Servce OCR data set. Recognzer for dgt 0; output hstogram for the exemplars of 0 n the tranng/test set, and on test exemplars of other dgts. The x-axs gves the output values, that s, the argument of the sgn functon n equaton For º D 50% (top), we get 50% SVs and 49% outlers (consstent wth proposton 3 ), 44% true postve test examples, and zero false postves from the other class. For º D 5% (bottom), we get 6% and 4% for SVs and outlers, respectvely. In that case, the true postve rate s mproved to 91%, whle the false-postve rate ncreases to 7%. The offset r s marked n the graphs. Note, nally, that the plots show a Parzen wndows densty estmate of the output hstograms. In realty, many examples st exactly at the threshold value (the nonbound SVs). Snce ths peak s smoothed out by the estmator, the fractons of outlers n the tranng set appear slghtly larger than t should be.

14 Experments: Dgts Fgure 5: Outlers dent ed by the proposed algorthm, ranked by the negatve output ofthe SVM (the argument ofequaton 3.10). The outputs (for convenence n unts of 10 5 ) are wrtten underneath each mage n talcs; the (alleged) class labels are gven n boldface. Note that most of the examples are df cult n that they are ether atypcal or even mslabeled. 14

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest