One-class Classification: ν-svm

One-class Classfcaton: ν-svm Qang Nng Dec. 10, 2015 Abstract One-class classfcaton s a specal knd of classfcaton problem, for whch the tranng set only conssts of samples from one class. Conventonal SVM fals to handle the one-class classfcaton problem because of the lack of nformaton from the other class. The ν-svm addresses ths ssue by estmatng the probablty densty support of the class that we have suffcent samples, and then treat new samples outsde of the support as outlers. The resultng optmzaton problem can be readly solved n a smlar way as the conventonal SVM, and ts generalzaton error can also be theoretcally upper bounded. Both smulaton and real medcal data are used to demonstrate the performance of ν-svm n ths report, whch should prove useful n varous outler/abnormty detecton tasks. 1 Introducton Classfcaton s to dfferentate objects and understand nformaton. When the underlyng probablty dstrbuton s readly avalable, classfcaton tasks can be easly handled wthn the Bayesan framework. For nstance, n bnary classfcaton/detecton, gven pror dstrbuton π y, y = {±1}, and condtonal dstrbuton p y (x), y = {±1}, where x R d s observaton and y s class label, the optmal classfer that mnmzes the 0-1 loss s a lkelhood rato test: 1 f L(x) η δ B (x) =, 1 otherwse where L(x) = p 1 (x)/p 1 (x) s the lkelhood rato functon, and η = π 1 /π 1 s the testng threshold [1]. In practce, however, the underlyng probablty dstrbuton s usually unavalable due to the lack of knowledge about the physcal and statstcal law governng dfferent classes of observatons. On the other hand, observaton data can often be easly collected. Therefore, t has been proposed to learn a classfer based on exstng observatons (.e., the tranng dataset), wth the hope/assumpton that a classfer that separates the tranng dataset well can also classfy future observatons (.e., the test dataset) well. Varous classfcaton methods have been proposed along 1

ths way: emprcal rsk mnmzaton (ERM), support vector machne (SVM), logstc regresson, neural network, etc. [2] Nevertheless, n some real world applcatons, e.g., outler detecton, not only the underlyng probablty dstrbuton s unavalable, but t s also very expensve or even mpossble to collect data from both two classes. As a result, the tranng set only conssts of data from one class (or the data from the other class are nsuffcent). The classfcaton problem n ths scenaro s often called the one-class classfcaton problem. The so-called ν-svm, whch we are gong to explore n ths report, s one of the popular methods to solvng ths problem [3]. Throughout ths report, we would refer bnary and mult-class classfcaton problem as the conventonal classfcaton problem. 2 Challenges As mpled by ts name, one-class classfcaton problem s challengng, because no (or nsuffcent) nformaton about the outlers s avalable and conventonal classfcaton methods cannot be used. To better llustrate ths pont, we take the conventonal SVM (here we focus on the maxmummargn classfer) as an example. As n [2], the maxmum-margn classfer s to construct a classfer δ : R d {±1} such that δ(x) = sgn [g(x)], where the dscrmnant functon g(x) = w T x+w 0. 1 Gven a tranng set wth n samples, {X, Y } n =1, where X R d and Y {±1}, for all, the weght vector w and bas w 0 are obtaned through solvng the followng optmzaton problem: 1 mn w,w 0 2 w 2 2 s.t. y [ w T x + w 0 ] 1, = 1,..., n. If all the tranng samples are from class +1,.e., y = 1 for all, then obvously the soluton s w = 0 and any w0 1, and the resultng classfer s δ(x) 1. Therefore, f we drectly apply the conventonal SVM to one-class classfcaton, the resultng classfer wll have no power n dentfyng outlers. Ths falure of applyng conventonal SVM (and other conventonal classfcaton methods as well) to one-class classfcaton can be explaned by the fact that the conventonal classfcaton methods are desgned to separate dfferent classes. When no or few tranng samples are from class 1, separaton can be trvally satsfed, and the generalzablty of the traned classfer s thus poor. Conceptually speakng, n conventonal classfcaton methods, descrpton about one class s learnt va comparson to other classes, rather than the class tself. In one-class classfcaton, the problem becomes challengng because we need to learn a descrpton about a class tself. An 1 We can also play the kernel trck here,.e., replacng x by Φ(x), where Φ : R d R k s a mappng from nput space to feature space. 2

extreme pont s to estmate the probablty densty from a tranng set, whch would then allow us to solve whatever outler detecton problems. However, probablty densty estmaton tself s stll an open problem n the learnng theory. One of the major drawbacks of probablty densty methods s the requrement for a large tranng set, especally when dealng wth hgh-dmensonal features. To address ths ssue, ν-svm s proposed n [3], whch turns to solve an alternatve problem: probablty densty support estmaton. It learns a doman descrpton about the one-class tranng set, and then use the doman descrpton to detect outlers. The generalzaton error of ν-svm can also be bounded theoretcally. 3 One-class Classfcaton: ν-svms Followng Vapnk s prncple that never to solve a problem that s more general than the one we actually need to solve, ν-svm s actually estmatng the support of the probablty densty (.e., a smallest regon), nstead of estmatng the probablty densty. Specfcally, the ν-svm method s to separate the data from the orgn wth maxmum margn (that s where the SVM n ts name comes from). The strategy s to fnd a smallest regon capturng most of the data ponts, so that wthn that regon, the classfer decdes 1, and otherwse decdes 1 (outler). Next we descrbe the ν-svm method by ts formulaton and algorthm. 3.1 Formulaton Gven a tranng set wth n samples, {x } n =1 where x R d, ν-svm s to solve the followng problem: 1 mn w,ρ,ξ 2 w 2 2 + 1 νn n ξ ρ (1) =1 s.t. w T Φ(x ) ρ ξ, ξ 0, = 1,..., n, where ν (0, 1], and Φ( ) s the transformaton from nput space to feature space. The decson functon s δ(x) = sgn [g(x)], (2) where the dscrmnant functon g(x) = w T Φ(x) ρ. By formulatng Eq. (1), we are expectng that for most tranng samples, the dscrmnant functon s postve, whle mantanng a small value of 1 2 w 2 2 ρ. The trade-off between these two goals s controlled by ν. Let us frst assume the slack varables ξ are zero, whch would be true when ν = 0. One sgnfcant dfference between ν-svm and SVM s the ntroducton of ρ. To understand why the ntroducton of ρ leads to a desrable classfer, we see that n Fg. 1, t can be derved that d = ρ w. 3

The mnmzaton of ρ s equvalent to the maxmzaton of ρ. If a data pont les above the lne (e.g., pont A n Fg. 1), then ρ > 0, and a larger ρ ndcates a larger d; f a data pont les below the lne (e.g., pont B n Fg. 1), then ρ < 0, and a larger ρ ndcates a smaller d. In both cases, the dscrmnant lne s movng toward the data pont. Therefore, t can be seen that the ntroducton of ρ leads to a dscrmnant functon that tghtly bounds the tranng set. Addtonally, to handle the case where there are outlers n tranng set, slack varables ξ are ntroduced, smlarly to what we dd for soft-margn SVM [2]. As stated earler, the trade-off between data consstency and boundary tghtness s controlled by ν, but actually ν s more than smply a regularzaton parameter, whch wll be shown later n ths report. x 2 g(x) = w T x ρ = 0 A B d w o x 1 Fgure 1: The normal vector of dscrmnant boundary g(x) = 0 s w. The dstance d from orgn to the boundary s thus d = ρ / w. If pont A les n the regon g(x) > 0, then orgn should satsfy g(0) = ρ < 0, and ρ s thus postve; f pont B les n the regon g(x) > 0, then orgn should satsfy g(0) = ρ > 0, and ρ s thus negatve. 3.2 Dual Problem Problem Eq. (1) s referred to as the prmal optmzaton problem. As what we dd for conventonal SVM, t s usually preferable to deal wth ts dual problem. Frstly, we ntroduce a Lagrangan wth α, β 0 L(w, ξ, ρ, α, β) = 1 2 w 2 2 + 1 νn ξ ρ α (w T Φ(x ) ρ + ξ ) β ξ, (3) 4

whose frst dervatves w.r.t. the prmal varables w, ξ and ρ are L w = w α Φ(x ) L = 1 ξ νn α β, = 1,..., n, L = 1 + α. ρ Then by settng these dervatves to zero, we have w = α Φ(x ), (4) α = 1 νn β 1, = 1,..., n, (5) νn α = 1. (6) Substtutng Eq. (4), Eq. (5) and Eq. (6) nto Eq. (3), we obtan the dual problem: mn α 1 2 n α α j k(x, x j ) (7),j=1 s.t. 0 α 1 νn, = 1,..., n, and α = 1, where k(x, x j ) = Φ(x ) T Φ(x j ) s the kernel functon. As the prmal problem Eq. (1), Eq. (7) s also a quadratc programmng. Fast teratve algorthms exst for the dual problem Eq. (7). An algorthm orgnally proposed for classfcaton s the so-called sequental mnmal optmzaton (SMO) algorthm [4]. To solve Eq. (7) specfcally, a modfed verson of SMO s proposed and can be found n [3][5]. Once an optmzng α s obtaned by solvng Eq. (7), we can recover w usng Eq. (4). As for ρ, we notce the fact that the constrants n Eq. (1) become equaltes f α and β are postve,.e., 0 < α < 1 νn. Pck any one of such ndces, then 4 Theory ρ = (w ) T Φ(x ) = n αjk(x j, x ). Very nce theoretcal results have been proven n [3]. In ths report, we focus on two of the theorems ntroduced n [3], and go through the proofs. For Theorem 1, an alternatve proof s provded nstead of the orgnal one provded n [3]. For Theorem 2, we fll up the blanks that the authors left behnd and correct typos. 5 j=1

Theorem 1 (ν-property). Assume the soluton to Eq. (1) satsfes ρ 0. The followng statements hold: 1. ν s a lower bound on the fracton of support vectors. 2. ν s an upper bound on the fracton of outlers. Proof. To prove the two propertes, the authors of [3] used a proposton whch relates the w and ρ obtaned n one-class classfcaton wth those obtaned n correspondng bnary classfcaton. Here, however, we can prove them alternatvely as follows. Let I = { : α 0}. From Eq. (5) and (6), we have n 1 = α = α = I νn β I νn. =1 I I Therefore, I νn,.e., the number of nonzero α s s lower bounded by νn. Note that nonzero α s correspond to support vectors, so property 1 holds. Let J = {j : β j = 0}. Agan from Eq. (5) and (6), we have n n 1 1 = α = νn β = J νn + α j J νn. =1 =1 j / J Therefore, J νn,.e., the number of zero β s s upper bounded by νn. Note that zero β s correspond to outlers (ξ > 0), so property 2 holds. Besdes the ν-property whch reveals the underlyng meanng of regularzaton parameter ν, the learnng generalzablty of ν-svm n terms of probablty densty support estmaton can also be characterzed as follows. Defnton 1. Let f : X R. For a fxed θ R and x X, let d(x, f, θ) = max{θ f(x), 0}. Then for a tranng set T = {x } n =1, defne D(T, f, θ) = d(x, f, θ). x T Theorem 2 (Generalzaton Error Bound). Assume we are gven a tranng set T = {x } n =1 generated..d. from an underlyng but unknown dstrbuton P whch does not contan dscrete components. Suppose a functon f w (x) = w T Φ(x) and bas ρ are obtaned by solvng the optmzaton problem Eq. (1). Let R w,ρ = {x : f w (x) ρ} denote the decson regon. Then wth probablty 1 δ over the draw of a random sample from P, for any γ > 0, where P {x : x / R w,ρ γ } 2 n (k + log 2 n 2 ), (8) 2δ k = c 1log 2 (c 2ˆγn) ˆγ 2 + 2Ḓ ( ( )) (2n 1)ˆγ γ log 2 e + 1 + 2, (9) 2D c 1 = 16c 2, c 2 = ln 2/(4c 2 ), c = 103, ˆγ = γ/ w, and D = D(T, f w, ρ). 6

A tranng set T determnes a decson regon R w,ρ, so that f a new sample falls nto R w,ρ, we assert t s generated from dstrbuton P ; otherwse, we assert t s an outler. We make such assertons because we expect that ponts generated accordng to P wll ndeed le n R w,ρ. Theorem 2 gves us such a guarantee that wth a certan probablty (.e., 1 δ), the probablty of a new sample les outsde of the regon R w,ρ γ s bounded from above. Moreover, Theorem 2 also serves as a characterzaton of ν-svm, from whch we can gan the followng nsghts. 1. The theorem suggests not to drectly use the offset ρ obtaned by solvng Eq. (1), but a smaller value ρ γ, whch corresponds to a larger decson regon R w,ρ. 2. If D = 0, then as n, the bound n Eq. (8) goes to zero,.e., the complete support s obtaned asymptotcally. However, D s measured wth respect to ρ, whle the bound appled to a larger regon R w,ρ γ. Any pont n R w,ρ γ R w,ρ wll contrbute to D. Therefore, D s strctly postve, and ths bound does not mply asymptotc convergence to the true support. 3. The exstence of ν s to allow outlers n the tranng set, and to mprove robustness. Snce a larger ν ndcates a larger D, hence a larger k, we see that an unecessarly large ν wll lead to a larger bound. Therefore, pror knowledge about the percentage of outlers n the tranng set s desred. The proof of Theorem 2 requres concepts of coverng number and functon spaces, and can be found n the Appendx. 5 Experments A comprehensve off-the-shelf package for SVM s the LIBSVM [6], n whch ν-svm s also avalable. 5.1 ν-property In ths secton, we wsh to verfy the ν-property n Theorem 1. A crescent-shaped two-dmensonal smulaton dataset from [7] s used, of whch the 500 samples are shown n Fg. 2. An example of usng ν-svm to do one-class classfcaton s shown n Fg. 3, where the Gaussan kernel k(x, y) = e 0.06 x y 2 was used, and ν was 0.05. We can see that a smooth, crescent-shaped decson boundary (blue curve) was learned, whch tghtly bounds a large porton of the tranng samples, whle allowng a certan porton of outlers (black stars). Usng the same kernel functon, the fracton of support vectors (SVs) and outlers (OLs) gven dfferent values of ν s summarzed n Table 1 to verfy Theorem 1. It can be seen from Table 1 that the fracton of SVs was lower bounded by ν. Moreover, the fracton of OLs s approxmately 7

Fgure 2: A smple 2-dmensonal dataset wth 500 samples from [7]. Blue crcle: tranng sample. Fgure 3: An example of usng ν-svm to learn a smallest regon that captures most of the ponts. Blue curve: decson boundary obtaned. Black star: outlers. 8

upper bounded by ν, nspte of some small fluctuatons (e.g., when ν = 30%, 70%) whch can be explaned by the fact that we are not n the asymptotc regme. Table 1 does ndcate that ν can be used to approxmate/control the fracton of SVs and OLs. Table 1: The fracton of SVs and OLs gven dfferent values of ν. ν (%) Fracton of SVs (%) Fracton of OLs (%) 5 6.2 5.0 10 11.0 10.0 30 31.6 30.2 50 50.2 49.8 70 70.2 70.2 90 90.2 90.0 5.2 Breast Cancer Classfcaton A dataset retreved from the Wsconsn Breast Cancer Databases from UCI s used to demonstrate the performance of ν-svm. 2 It contans 699 nstances n total collected between 1989 and 1991 by Dr. WIllam H. Wolberg at Unversty of Wsconsn Hosptals [8], 458 nstances of whch correspond to bengn breast cancer, and 241 of whch malgnant. The dmensonalty of feature space s 9: clump thckness, unformty of cell sze, unformty of cell shape, margnal adheson, sngle epthelal cell sze, bare nucle, bland chromatn, normal nucleol, and mtoses, all quantzed from 1 to 10. Fgure 4 s the scatter plot of the dataset after dmensonalty reducton by PCA. Fgure 4: Scatter plot of the breast cancer dataset. Orgnal data were projected onto the frst two prncpal egenbases of the emprcal covarance marx. A natural clusterng of bengn and malgnant can be observed. 2 Lnk to data: http://homepage.tudelft.nl/n9d04/occ/505/oc_505.mat 9

Table 2 summarzes the performance of ν-svm compared wth conventonal bnary SVM. When the tranng set has an nsuffcent number of malgnant nstances, there are usually two optons to do classfcaton. One s to stll tran conventonal SVMs usng the whole dataset, and the other one s to alternatvely tran ν-svms usng only the bengn nstances n the tranng set. By comparson between the frst two rows, we can tell that t s better n terms of detectng malgnant cancers f we only use bengn nstances and tran ν-svms. By comparng ν-svm (row 2) wth row 3-5 n Table 2, we can also see that when the sze of the tranng set remans the same, one may prefer usng one-class classfcaton f the detecton of outlers s more mportant, unless suffcent numbers of samples are avalable for both classes (e.g., row 6). Fgure 5 provdes a vsual explanaton to why ν-svm s a better choce when dealng wth unbalanced learnng tasks. Therefore, we can see the mportance of usng one-class classfcaton when nformaton from one class s nsuffcent. Table 2: The performance of ν-svm. The left two columns represent the number of bengn/malgnant nstances used n the tranng set. The rght two columns are the probablty of detecton of bengn cancer, and the probablty of detecton of malgnant cancer, respectvely. The row n bold face represents ν-svm. # Bengn # Malgnant Detecton of Bengn (%) Detecton of Malgnant (%) 300 20 100.0 87.2 300 0 97.5 96.5 290 10 100.0 45.4 280 20 100.0 87.9 270 30 99.4 96.5 200 100 99.4 97.9 6 Dscusson We have seen the mportance of usng one-class classfcaton methods to deal wth the learnng tasks where the tranng set only has one class. As one popular one-class classfcaton method, ν-svm can be proved to be equvalent to another method named SVDD: Support Vector Doman Descrpton [9]. 6.1 SVDD Suppose the descrpton about a data set T = {x } n =1 s requred. Whle ν-svm s to bound the data set usng hyperplanes, SVDD s to use spheres nstead. Specfcally, we wsh to fnd a smallest ball that most of the data ponts n T can be put nto. The resultng prmal optmzaton 10

(a) (b) (c) (d) (e) (f) Fgure 5: (a) Traned ν-svms usng 300 bengn samples, and (b) ts performance on test set; (c)(e) Traned soft-margn SVMs usng 290 bengn samples plus 10 malgnant samples, and 200 bengn samples plus 100 malgnant samples, respectvely, and (d)(f) ther performances. Blue: bengn samples. Red: malgnant samples. Crcle: tranng samples. Cross: test samples. Black curve: decson boundary obtaned accordngly. Note when only an nsuffcent number of malgnant samples are avalable, ν-svm can fnd a decson boundary that tghtly bounds the bengn samples as n (a). However, conventonal SVM wll be sgnfcantly mpared, unless suffcent number of samples from both classes are avalable as n (e). 11

problem s mn R,ξ,c R2 + C n ξ (10) =1 s.t. Φ(x ) c 2 R 2 + ξ, ξ 0, = 1,..., n, where c and R s the center and radus of the desred ball, ξ s the slack varable, and C s a regularzaton parameter balancng the trade-off between ball radus and data consstency. The dual problem s thus 6.2 Relaton to ν-svm mn α n α α j k(x, x j ),j=1 n α k(x, x ) (11) =1 s.t. 0 α C, = 1,..., n, and n α = 1. The method of SVDD addresses the same problem n a dfferent way, but nterestngly, s closely related to ν-svm. We descrbe ts relaton to ν-svm be the followng theorem. Theorem 3 (Connecton between ν-svm and SVDD). If k(x, y) only depends on x y, then the soluton to ν-svm s the same wth that to SVDD, wth ν = 1 nc. Proof. Frstly, t s obvous that f k(x, y) only depends on x y, then k(x, x) wll be a constant. If ν s further set to be 1 nc, then (7) and (11) wll be exactly the same problem. Therefore, the optmzng α wll be the same for both methods. Then we only need to show that the decson functons of ν-svm and SVDD concde gven the same α. We already know that [ ] δ ν SV M (x) = sgn α k(x, x) ρ, =1 δ SV DD (x) = sgn R 2,j α α j k(x, x j ) + 2 α k(x, x) k(x, x). Let x m be one of the ponts that have 0 < α m < C = 1 νn. Then we have ρ = α k(x, x m ), R 2 =,j α α j k(x, x j ) 2 α k(x, x m ) + k(x m, x m ). 12

Therefore, [ δ ν SV M (x) = sgn α k(x, x) ] α k(x, x m ), [ δ SV DD (x) = sgn 2 α k(x, x) 2 ] α k(x, x m ) + k(x m, x m ) k(x, x). Snce k(x m, x m ) = k(x, x) and sgn [g(x)] = sgn [2g(x)], we have δ ν SV M (x) δ SV DD (x). Theorem 3 s consstent wth our ntuton snce when k(x, y) only depends on x y, then all the mapped patterns le on a sphere n the kernel space. Therefore, the smallest sphere found n SVDD can be equvalently segmented by a hyperplane (ν-svm). It s rather mportant, not only because t theoretcally relates two popular one-class classfcaton methods, but also mples that the generalzaton error bound derved for ν-svm also works for SVDD, and some parameter selecton methods for SVDD (e.g., [7]) can also be appled to ν-svm. 7 Concluson One-class classfcaton, also known as the data doman descrpton, s not only a classfcaton problem, but also an mportant step towards learnng nformaton and understandng knowledge from tranng data. The ν-svm method addresses the one-class classfcaton problem by fndng a smallest regon (the probablty densty support) that can bound most of the tranng samples. The resultng optmzaton problem s smlar to that of the conventonal SVM, and fast teratve algorthms exst for solvng t. Its generalzaton error has also proved to be bounded from above, whch s a very desrable property of learnng algorthms. In ths report, we have provded our own proof for the so-called ν-property (Theorem 1), and verfed t through a smulaton data set. The ν-property provdes underlyng meanng for the regularzaton parameter ν, and thus can be leveraged to control the fracton of support vectors and outlers n practce. Real world data are also used to demonstrate the usefulness of ν-svm when dealng wth nsuffcent negatve tranng samples. Results ndcate that when there are nsuffcent negatve samples n the tranng set, t s better f we only use the postve samples and resort to one-class classfcaton. We have also proved n Theorem 3 that ν-svm s equvalent to another popular one-class classfcaton method, SVDD, under certan crcumstances. 13

Appendx: Proof of Theorem 2 Before provng Theorem 2, some necessary defntons and lemmas are ntroduced wthout proof as follows. Defnton 2 (ɛ-coverng Number). Let (X, d) s a metrc space, and A X. For ɛ > 0, a set U X s called an ɛ-cover for A f for every a A, u U such that d(a, u) ɛ. Then ɛ-coverng number of A s the mnmal cardnalty of an ɛ-cover for A, and s denoted by N (ɛ, A, d). Specfcally n ths report, suppose X s a compact subset of R d, and F s a lnear functon space wth the dstance defned by the nfnte-norm,.e., for f F, f l = max x T f(x). Then let N (ɛ, F, n) max T X n N (ɛ, F, l ). Defnton 3. Let L(X ) be the set of non-negatve functons f on X wth countable support. Defne 1-norm on L(X ) by f 1 x supp(f) f(x). Then L B (X ) {f L(X ) : f 1 B}. Lemma 1 (Theorem 14 n [3]). Suppose we are gven a tranng set T = {x } n =1 generated..d. from an underlyng but unknown dstrbuton P whch does not contan dscrete components, where x X, for all. For any γ > 0, f F, fx B D(T, f, θ), then wth probablty 1 δ P {x : f(x) < θ 2γ} 2 n (k + log n 2 δ ), where k = log 2 N (γ/2, F, 2n) + log 2 N (γ/2, L B (X ), 2n). Lemma 2 (Lemma 7.14 of [10]). For all γ > 0, where b = B 2γ. log 2 N (γ, L B e(n + b 1) (X ), n) blog 2, b Lemma 3 (Wllamson et al. [11]). Let F be the class of lnear classfers wth norm at most 1 confned to a unt ball centered at the orgn, then for ɛ c/ n, where c = 103. log 2 N (ɛ, F, n) c2 log 2 ( 2 ln 2 c 2 ɛ 2 n ) ɛ 2, Usng Lemma 1, 2, and 3 as tools, we are now ready to prove Theorem 2. 14

Proof of Theorem 2. Note n Theorem 2, R w,ρ γ = {x : f w (x) ρ γ}, so we have {x : x / R w,ρ γ } {x : f w (x) < ρ γ}. Therefore, the dea s to apply Lemma 1 (by replacng 2γ by γ) for provng Theorem 2. Frstly, notce that we can treat the offset ρ as 0 wthout loss of generalty. Secondly, n order to nvoke Lemma 3 whle calculatng k n Lemma 1, the lnear class F s requred to be confned to a unt ball centered at the orgn. Hence, we rescale functon f to be ˆf = f w/ w. The decson boundary remans the same f we also rescale ˆγ = γ/ w. Snce n Lemma 1, B s fxed. However, n Theorem 2, B does not have to be fxed. Hence we apply Lemma 1 for each value of log 2 N (ˆγ/4, L B (X ), 2n). (12) Because for error bound 2 n (k + log 2 n δ ) to be nontrval, k has to be smaller than n 2, so s Eq. (12). Then t suffces to make at most n δ 2 applcatons, and as a result, uses a confdence of n/2 for each applcaton. Therefore, by Lemma 1, we have where P {x : ˆf(x) < ˆγ} 2 n (k + log 2 n 2 2δ ), k = log 2 N (ˆγ/4, F, 2n) + log 2 N (ˆγ/4, L B (X ), 2n). In addton, lettng b = and usng Lemma 2 and 3, we have k 2Bˆγ 16c 2 log 2 ( ln 2 ˆγ 2 ( n) 4c e(2n + b 1) 2 ˆγ 2 + blog 2 b ( e(2n + b 1) 16c2 log 2 ( ln 2 4c 2 ˆγ 2 n) ˆγ 2 + blog 2 16c2 log 2 ( ln 2 4c 2 ˆγ 2 n) ) ) + 2 b ( e(2n + (2D/ˆγ) 1) ˆγ 2 + 2Ḓ γ log 2 2D/ˆγ = 16c2 log 2 ( ln 2 ˆγ 2 n) 4c 2 ˆγ 2 + 2Ḓ ( ( (2n 1)ˆγ γ log 2 e D k. Then smply by replacng c 1 = 16c 2, and c 2 = ln 2 4c 2, Theorem 2 s proved. ) + 2 )) + 1 + 2 15

References [1] P. Mouln and V. V. Veeravall, Detecton and estmaton theory. ECE561 lecture notes, UIUC, 2015. [2] P. Mouln, Topcs n sgnal processng: Statstcal learnng and pattern recognton. ECE544NA lecture notes, UIUC, 2015. [3] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Wllamson, Estmatng the support of a hgh-dmensonal dstrbuton, Neural computaton, vol. 13, no. 7, pp. 1443 1471, 2001. [4] J. Platt, Fast tranng of support vector machnes usng sequental mnmal optmzaton, Advances n kernel methods?support vector learnng, vol. 3, 1999. [5] A. J. Smola and B. Schölkopf, A tutoral on support vector regresson, Statstcs and computng, vol. 14, no. 3, pp. 199 222, 2004. [6] C.-C. Chang and C.-J. Ln, LIBSVM: A lbrary for support vector machnes, ACM Transactons on Intellgent Systems and Technology, vol. 2, pp. 27:1 27:27, 2011. Software avalable at http://www.cse.ntu.edu.tw/~cjln/lbsvm. [7] D. M. Tax and R. P. Dun, Unform object generaton for optmzng one-class classfers, The Journal of Machne Learnng Research, vol. 2, pp. 155 173, 2002. [8] W. Wolberg and O. Mangasaran, Multsurface method of pattern separaton for medcal dagnoss appled to breast cytology,, n Proceedngs of the Natonal Academy of Scences, pp. 9193 9196, Dec 1990. [9] D. M. Tax and R. P. Dun, Support vector doman descrpton, Pattern recognton letters, vol. 20, no. 11, pp. 1191 1199, 1999. [10] J. Shawe-Taylor and N. Crstann, On the generalzaton of soft margn algorthms, Informaton Theory, IEEE Transactons on, vol. 48, no. 10, pp. 2721 2735, 2002. [11] R. C. Wllamson, A. J. Smola, and B. Schölkopf, Entropy numbers of lnear functon classes., n COLT, pp. 309 319, 2000. 16