One-class Classification: ν-svm

Similar documents
Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Kernel Methods and SVMs Extension

Natural Language Processing and Information Retrieval

Support Vector Novelty Detection

Lecture 10 Support Vector Machines II

Support Vector Machines

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Support Vector Machines

Maximal Margin Classifier

Support Vector Machines

Generalized Linear Methods

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Linear Classification, SVMs and Nearest Neighbors

10-701/ Machine Learning, Fall 2005 Homework 3

Which Separator? Spring 1

Lecture Notes on Linear Regression

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kristin P. Bennett. Rensselaer Polytechnic Institute

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Composite Hypotheses testing

Feature Selection: Part 1

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

More metrics on cartesian products

Lecture 3: Dual problems and Kernels

Lecture 12: Discrete Laplacian

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Numerical Heat and Mass Transfer

Homework Assignment 3 Due in class, Thursday October 15

Lecture 12: Classification

Chapter 6 Support vector machine. Séparateurs à vaste marge

Module 9. Lecture 6. Duality in Assignment Problems

Errors for Linear Systems

1 Convex Optimization

Assortment Optimization under MNL

Support Vector Machines

Support Vector Machines

COS 521: Advanced Algorithms Game Theory and Linear Programming

Singular Value Decomposition: Theory and Applications

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Support Vector Machines CS434

3.1 ML and Empirical Distribution

CSE 252C: Computer Vision III

A Robust Method for Calculating the Correlation Coefficient

Problem Set 9 Solutions

The Order Relation and Trace Inequalities for. Hermitian Operators

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

The Minimum Universal Cost Flow in an Infeasible Flow Network

Linear Approximation with Regularization and Moving Least Squares

Supporting Information

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Online Classification: Perceptron and Winnow

Week 5: Neural Networks

Unified Subspace Analysis for Face Recognition

Ensemble Methods: Boosting

Report on Image warping

Learning Theory: Lecture Notes

Lecture 4. Instructor: Haipeng Luo

Multilayer Perceptron (MLP)

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Regularized Discriminant Analysis for Face Recognition

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Lecture 3. Ax x i a i. i i

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Computing MLE Bias Empirically

The exam is closed book, closed notes except your one-page cheat sheet.

Maximum Likelihood Estimation (MLE)

Lagrange Multipliers Kernel Trick

Negative Binomial Regression

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Feature Selection in Multi-instance Learning

VQ widely used in coding speech, image, and video

ECE559VV Project Report

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Lecture 10 Support Vector Machines. Oct

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

MMA and GCMMA two methods for nonlinear optimization

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Linear Feature Engineering 11

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Intro to Visual Recognition

Evaluation of simple performance measures for tuning SVM hyperparameters

Pattern Classification

Support Vector Machines CS434

CSC 411 / CSC D11 / CSC C11

Statistical Foundations of Pattern Recognition

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Bounds on the Generalization Performance of Kernel Machines Ensembles

Statistical machine learning and its application to neonatal seizure detection

Gaussian Mixture Models

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Transcription:

One-class Classfcaton: ν-svm Qang Nng Dec. 10, 2015 Abstract One-class classfcaton s a specal knd of classfcaton problem, for whch the tranng set only conssts of samples from one class. Conventonal SVM fals to handle the one-class classfcaton problem because of the lack of nformaton from the other class. The ν-svm addresses ths ssue by estmatng the probablty densty support of the class that we have suffcent samples, and then treat new samples outsde of the support as outlers. The resultng optmzaton problem can be readly solved n a smlar way as the conventonal SVM, and ts generalzaton error can also be theoretcally upper bounded. Both smulaton and real medcal data are used to demonstrate the performance of ν-svm n ths report, whch should prove useful n varous outler/abnormty detecton tasks. 1 Introducton Classfcaton s to dfferentate objects and understand nformaton. When the underlyng probablty dstrbuton s readly avalable, classfcaton tasks can be easly handled wthn the Bayesan framework. For nstance, n bnary classfcaton/detecton, gven pror dstrbuton π y, y = {±1}, and condtonal dstrbuton p y (x), y = {±1}, where x R d s observaton and y s class label, the optmal classfer that mnmzes the 0-1 loss s a lkelhood rato test: 1 f L(x) η δ B (x) =, 1 otherwse where L(x) = p 1 (x)/p 1 (x) s the lkelhood rato functon, and η = π 1 /π 1 s the testng threshold [1]. In practce, however, the underlyng probablty dstrbuton s usually unavalable due to the lack of knowledge about the physcal and statstcal law governng dfferent classes of observatons. On the other hand, observaton data can often be easly collected. Therefore, t has been proposed to learn a classfer based on exstng observatons (.e., the tranng dataset), wth the hope/assumpton that a classfer that separates the tranng dataset well can also classfy future observatons (.e., the test dataset) well. Varous classfcaton methods have been proposed along 1

ths way: emprcal rsk mnmzaton (ERM), support vector machne (SVM), logstc regresson, neural network, etc. [2] Nevertheless, n some real world applcatons, e.g., outler detecton, not only the underlyng probablty dstrbuton s unavalable, but t s also very expensve or even mpossble to collect data from both two classes. As a result, the tranng set only conssts of data from one class (or the data from the other class are nsuffcent). The classfcaton problem n ths scenaro s often called the one-class classfcaton problem. The so-called ν-svm, whch we are gong to explore n ths report, s one of the popular methods to solvng ths problem [3]. Throughout ths report, we would refer bnary and mult-class classfcaton problem as the conventonal classfcaton problem. 2 Challenges As mpled by ts name, one-class classfcaton problem s challengng, because no (or nsuffcent) nformaton about the outlers s avalable and conventonal classfcaton methods cannot be used. To better llustrate ths pont, we take the conventonal SVM (here we focus on the maxmummargn classfer) as an example. As n [2], the maxmum-margn classfer s to construct a classfer δ : R d {±1} such that δ(x) = sgn [g(x)], where the dscrmnant functon g(x) = w T x+w 0. 1 Gven a tranng set wth n samples, {X, Y } n =1, where X R d and Y {±1}, for all, the weght vector w and bas w 0 are obtaned through solvng the followng optmzaton problem: 1 mn w,w 0 2 w 2 2 s.t. y [ w T x + w 0 ] 1, = 1,..., n. If all the tranng samples are from class +1,.e., y = 1 for all, then obvously the soluton s w = 0 and any w0 1, and the resultng classfer s δ(x) 1. Therefore, f we drectly apply the conventonal SVM to one-class classfcaton, the resultng classfer wll have no power n dentfyng outlers. Ths falure of applyng conventonal SVM (and other conventonal classfcaton methods as well) to one-class classfcaton can be explaned by the fact that the conventonal classfcaton methods are desgned to separate dfferent classes. When no or few tranng samples are from class 1, separaton can be trvally satsfed, and the generalzablty of the traned classfer s thus poor. Conceptually speakng, n conventonal classfcaton methods, descrpton about one class s learnt va comparson to other classes, rather than the class tself. In one-class classfcaton, the problem becomes challengng because we need to learn a descrpton about a class tself. An 1 We can also play the kernel trck here,.e., replacng x by Φ(x), where Φ : R d R k s a mappng from nput space to feature space. 2

extreme pont s to estmate the probablty densty from a tranng set, whch would then allow us to solve whatever outler detecton problems. However, probablty densty estmaton tself s stll an open problem n the learnng theory. One of the major drawbacks of probablty densty methods s the requrement for a large tranng set, especally when dealng wth hgh-dmensonal features. To address ths ssue, ν-svm s proposed n [3], whch turns to solve an alternatve problem: probablty densty support estmaton. It learns a doman descrpton about the one-class tranng set, and then use the doman descrpton to detect outlers. The generalzaton error of ν-svm can also be bounded theoretcally. 3 One-class Classfcaton: ν-svms Followng Vapnk s prncple that never to solve a problem that s more general than the one we actually need to solve, ν-svm s actually estmatng the support of the probablty densty (.e., a smallest regon), nstead of estmatng the probablty densty. Specfcally, the ν-svm method s to separate the data from the orgn wth maxmum margn (that s where the SVM n ts name comes from). The strategy s to fnd a smallest regon capturng most of the data ponts, so that wthn that regon, the classfer decdes 1, and otherwse decdes 1 (outler). Next we descrbe the ν-svm method by ts formulaton and algorthm. 3.1 Formulaton Gven a tranng set wth n samples, {x } n =1 where x R d, ν-svm s to solve the followng problem: 1 mn w,ρ,ξ 2 w 2 2 + 1 νn n ξ ρ (1) =1 s.t. w T Φ(x ) ρ ξ, ξ 0, = 1,..., n, where ν (0, 1], and Φ( ) s the transformaton from nput space to feature space. The decson functon s δ(x) = sgn [g(x)], (2) where the dscrmnant functon g(x) = w T Φ(x) ρ. By formulatng Eq. (1), we are expectng that for most tranng samples, the dscrmnant functon s postve, whle mantanng a small value of 1 2 w 2 2 ρ. The trade-off between these two goals s controlled by ν. Let us frst assume the slack varables ξ are zero, whch would be true when ν = 0. One sgnfcant dfference between ν-svm and SVM s the ntroducton of ρ. To understand why the ntroducton of ρ leads to a desrable classfer, we see that n Fg. 1, t can be derved that d = ρ w. 3

The mnmzaton of ρ s equvalent to the maxmzaton of ρ. If a data pont les above the lne (e.g., pont A n Fg. 1), then ρ > 0, and a larger ρ ndcates a larger d; f a data pont les below the lne (e.g., pont B n Fg. 1), then ρ < 0, and a larger ρ ndcates a smaller d. In both cases, the dscrmnant lne s movng toward the data pont. Therefore, t can be seen that the ntroducton of ρ leads to a dscrmnant functon that tghtly bounds the tranng set. Addtonally, to handle the case where there are outlers n tranng set, slack varables ξ are ntroduced, smlarly to what we dd for soft-margn SVM [2]. As stated earler, the trade-off between data consstency and boundary tghtness s controlled by ν, but actually ν s more than smply a regularzaton parameter, whch wll be shown later n ths report. x 2 g(x) = w T x ρ = 0 A B d w o x 1 Fgure 1: The normal vector of dscrmnant boundary g(x) = 0 s w. The dstance d from orgn to the boundary s thus d = ρ / w. If pont A les n the regon g(x) > 0, then orgn should satsfy g(0) = ρ < 0, and ρ s thus postve; f pont B les n the regon g(x) > 0, then orgn should satsfy g(0) = ρ > 0, and ρ s thus negatve. 3.2 Dual Problem Problem Eq. (1) s referred to as the prmal optmzaton problem. As what we dd for conventonal SVM, t s usually preferable to deal wth ts dual problem. Frstly, we ntroduce a Lagrangan wth α, β 0 L(w, ξ, ρ, α, β) = 1 2 w 2 2 + 1 νn ξ ρ α (w T Φ(x ) ρ + ξ ) β ξ, (3) 4

whose frst dervatves w.r.t. the prmal varables w, ξ and ρ are L w = w α Φ(x ) L = 1 ξ νn α β, = 1,..., n, L = 1 + α. ρ Then by settng these dervatves to zero, we have w = α Φ(x ), (4) α = 1 νn β 1, = 1,..., n, (5) νn α = 1. (6) Substtutng Eq. (4), Eq. (5) and Eq. (6) nto Eq. (3), we obtan the dual problem: mn α 1 2 n α α j k(x, x j ) (7),j=1 s.t. 0 α 1 νn, = 1,..., n, and α = 1, where k(x, x j ) = Φ(x ) T Φ(x j ) s the kernel functon. As the prmal problem Eq. (1), Eq. (7) s also a quadratc programmng. Fast teratve algorthms exst for the dual problem Eq. (7). An algorthm orgnally proposed for classfcaton s the so-called sequental mnmal optmzaton (SMO) algorthm [4]. To solve Eq. (7) specfcally, a modfed verson of SMO s proposed and can be found n [3][5]. Once an optmzng α s obtaned by solvng Eq. (7), we can recover w usng Eq. (4). As for ρ, we notce the fact that the constrants n Eq. (1) become equaltes f α and β are postve,.e., 0 < α < 1 νn. Pck any one of such ndces, then 4 Theory ρ = (w ) T Φ(x ) = n αjk(x j, x ). Very nce theoretcal results have been proven n [3]. In ths report, we focus on two of the theorems ntroduced n [3], and go through the proofs. For Theorem 1, an alternatve proof s provded nstead of the orgnal one provded n [3]. For Theorem 2, we fll up the blanks that the authors left behnd and correct typos. 5 j=1

Theorem 1 (ν-property). Assume the soluton to Eq. (1) satsfes ρ 0. The followng statements hold: 1. ν s a lower bound on the fracton of support vectors. 2. ν s an upper bound on the fracton of outlers. Proof. To prove the two propertes, the authors of [3] used a proposton whch relates the w and ρ obtaned n one-class classfcaton wth those obtaned n correspondng bnary classfcaton. Here, however, we can prove them alternatvely as follows. Let I = { : α 0}. From Eq. (5) and (6), we have n 1 = α = α = I νn β I νn. =1 I I Therefore, I νn,.e., the number of nonzero α s s lower bounded by νn. Note that nonzero α s correspond to support vectors, so property 1 holds. Let J = {j : β j = 0}. Agan from Eq. (5) and (6), we have n n 1 1 = α = νn β = J νn + α j J νn. =1 =1 j / J Therefore, J νn,.e., the number of zero β s s upper bounded by νn. Note that zero β s correspond to outlers (ξ > 0), so property 2 holds. Besdes the ν-property whch reveals the underlyng meanng of regularzaton parameter ν, the learnng generalzablty of ν-svm n terms of probablty densty support estmaton can also be characterzed as follows. Defnton 1. Let f : X R. For a fxed θ R and x X, let d(x, f, θ) = max{θ f(x), 0}. Then for a tranng set T = {x } n =1, defne D(T, f, θ) = d(x, f, θ). x T Theorem 2 (Generalzaton Error Bound). Assume we are gven a tranng set T = {x } n =1 generated..d. from an underlyng but unknown dstrbuton P whch does not contan dscrete components. Suppose a functon f w (x) = w T Φ(x) and bas ρ are obtaned by solvng the optmzaton problem Eq. (1). Let R w,ρ = {x : f w (x) ρ} denote the decson regon. Then wth probablty 1 δ over the draw of a random sample from P, for any γ > 0, where P {x : x / R w,ρ γ } 2 n (k + log 2 n 2 ), (8) 2δ k = c 1log 2 (c 2ˆγn) ˆγ 2 + 2Ḓ ( ( )) (2n 1)ˆγ γ log 2 e + 1 + 2, (9) 2D c 1 = 16c 2, c 2 = ln 2/(4c 2 ), c = 103, ˆγ = γ/ w, and D = D(T, f w, ρ). 6

A tranng set T determnes a decson regon R w,ρ, so that f a new sample falls nto R w,ρ, we assert t s generated from dstrbuton P ; otherwse, we assert t s an outler. We make such assertons because we expect that ponts generated accordng to P wll ndeed le n R w,ρ. Theorem 2 gves us such a guarantee that wth a certan probablty (.e., 1 δ), the probablty of a new sample les outsde of the regon R w,ρ γ s bounded from above. Moreover, Theorem 2 also serves as a characterzaton of ν-svm, from whch we can gan the followng nsghts. 1. The theorem suggests not to drectly use the offset ρ obtaned by solvng Eq. (1), but a smaller value ρ γ, whch corresponds to a larger decson regon R w,ρ. 2. If D = 0, then as n, the bound n Eq. (8) goes to zero,.e., the complete support s obtaned asymptotcally. However, D s measured wth respect to ρ, whle the bound appled to a larger regon R w,ρ γ. Any pont n R w,ρ γ R w,ρ wll contrbute to D. Therefore, D s strctly postve, and ths bound does not mply asymptotc convergence to the true support. 3. The exstence of ν s to allow outlers n the tranng set, and to mprove robustness. Snce a larger ν ndcates a larger D, hence a larger k, we see that an unecessarly large ν wll lead to a larger bound. Therefore, pror knowledge about the percentage of outlers n the tranng set s desred. The proof of Theorem 2 requres concepts of coverng number and functon spaces, and can be found n the Appendx. 5 Experments A comprehensve off-the-shelf package for SVM s the LIBSVM [6], n whch ν-svm s also avalable. 5.1 ν-property In ths secton, we wsh to verfy the ν-property n Theorem 1. A crescent-shaped two-dmensonal smulaton dataset from [7] s used, of whch the 500 samples are shown n Fg. 2. An example of usng ν-svm to do one-class classfcaton s shown n Fg. 3, where the Gaussan kernel k(x, y) = e 0.06 x y 2 was used, and ν was 0.05. We can see that a smooth, crescent-shaped decson boundary (blue curve) was learned, whch tghtly bounds a large porton of the tranng samples, whle allowng a certan porton of outlers (black stars). Usng the same kernel functon, the fracton of support vectors (SVs) and outlers (OLs) gven dfferent values of ν s summarzed n Table 1 to verfy Theorem 1. It can be seen from Table 1 that the fracton of SVs was lower bounded by ν. Moreover, the fracton of OLs s approxmately 7

Fgure 2: A smple 2-dmensonal dataset wth 500 samples from [7]. Blue crcle: tranng sample. Fgure 3: An example of usng ν-svm to learn a smallest regon that captures most of the ponts. Blue curve: decson boundary obtaned. Black star: outlers. 8

upper bounded by ν, nspte of some small fluctuatons (e.g., when ν = 30%, 70%) whch can be explaned by the fact that we are not n the asymptotc regme. Table 1 does ndcate that ν can be used to approxmate/control the fracton of SVs and OLs. Table 1: The fracton of SVs and OLs gven dfferent values of ν. ν (%) Fracton of SVs (%) Fracton of OLs (%) 5 6.2 5.0 10 11.0 10.0 30 31.6 30.2 50 50.2 49.8 70 70.2 70.2 90 90.2 90.0 5.2 Breast Cancer Classfcaton A dataset retreved from the Wsconsn Breast Cancer Databases from UCI s used to demonstrate the performance of ν-svm. 2 It contans 699 nstances n total collected between 1989 and 1991 by Dr. WIllam H. Wolberg at Unversty of Wsconsn Hosptals [8], 458 nstances of whch correspond to bengn breast cancer, and 241 of whch malgnant. The dmensonalty of feature space s 9: clump thckness, unformty of cell sze, unformty of cell shape, margnal adheson, sngle epthelal cell sze, bare nucle, bland chromatn, normal nucleol, and mtoses, all quantzed from 1 to 10. Fgure 4 s the scatter plot of the dataset after dmensonalty reducton by PCA. Fgure 4: Scatter plot of the breast cancer dataset. Orgnal data were projected onto the frst two prncpal egenbases of the emprcal covarance marx. A natural clusterng of bengn and malgnant can be observed. 2 Lnk to data: http://homepage.tudelft.nl/n9d04/occ/505/oc_505.mat 9

Table 2 summarzes the performance of ν-svm compared wth conventonal bnary SVM. When the tranng set has an nsuffcent number of malgnant nstances, there are usually two optons to do classfcaton. One s to stll tran conventonal SVMs usng the whole dataset, and the other one s to alternatvely tran ν-svms usng only the bengn nstances n the tranng set. By comparson between the frst two rows, we can tell that t s better n terms of detectng malgnant cancers f we only use bengn nstances and tran ν-svms. By comparng ν-svm (row 2) wth row 3-5 n Table 2, we can also see that when the sze of the tranng set remans the same, one may prefer usng one-class classfcaton f the detecton of outlers s more mportant, unless suffcent numbers of samples are avalable for both classes (e.g., row 6). Fgure 5 provdes a vsual explanaton to why ν-svm s a better choce when dealng wth unbalanced learnng tasks. Therefore, we can see the mportance of usng one-class classfcaton when nformaton from one class s nsuffcent. Table 2: The performance of ν-svm. The left two columns represent the number of bengn/malgnant nstances used n the tranng set. The rght two columns are the probablty of detecton of bengn cancer, and the probablty of detecton of malgnant cancer, respectvely. The row n bold face represents ν-svm. # Bengn # Malgnant Detecton of Bengn (%) Detecton of Malgnant (%) 300 20 100.0 87.2 300 0 97.5 96.5 290 10 100.0 45.4 280 20 100.0 87.9 270 30 99.4 96.5 200 100 99.4 97.9 6 Dscusson We have seen the mportance of usng one-class classfcaton methods to deal wth the learnng tasks where the tranng set only has one class. As one popular one-class classfcaton method, ν-svm can be proved to be equvalent to another method named SVDD: Support Vector Doman Descrpton [9]. 6.1 SVDD Suppose the descrpton about a data set T = {x } n =1 s requred. Whle ν-svm s to bound the data set usng hyperplanes, SVDD s to use spheres nstead. Specfcally, we wsh to fnd a smallest ball that most of the data ponts n T can be put nto. The resultng prmal optmzaton 10

(a) (b) (c) (d) (e) (f) Fgure 5: (a) Traned ν-svms usng 300 bengn samples, and (b) ts performance on test set; (c)(e) Traned soft-margn SVMs usng 290 bengn samples plus 10 malgnant samples, and 200 bengn samples plus 100 malgnant samples, respectvely, and (d)(f) ther performances. Blue: bengn samples. Red: malgnant samples. Crcle: tranng samples. Cross: test samples. Black curve: decson boundary obtaned accordngly. Note when only an nsuffcent number of malgnant samples are avalable, ν-svm can fnd a decson boundary that tghtly bounds the bengn samples as n (a). However, conventonal SVM wll be sgnfcantly mpared, unless suffcent number of samples from both classes are avalable as n (e). 11

problem s mn R,ξ,c R2 + C n ξ (10) =1 s.t. Φ(x ) c 2 R 2 + ξ, ξ 0, = 1,..., n, where c and R s the center and radus of the desred ball, ξ s the slack varable, and C s a regularzaton parameter balancng the trade-off between ball radus and data consstency. The dual problem s thus 6.2 Relaton to ν-svm mn α n α α j k(x, x j ),j=1 n α k(x, x ) (11) =1 s.t. 0 α C, = 1,..., n, and n α = 1. The method of SVDD addresses the same problem n a dfferent way, but nterestngly, s closely related to ν-svm. We descrbe ts relaton to ν-svm be the followng theorem. Theorem 3 (Connecton between ν-svm and SVDD). If k(x, y) only depends on x y, then the soluton to ν-svm s the same wth that to SVDD, wth ν = 1 nc. Proof. Frstly, t s obvous that f k(x, y) only depends on x y, then k(x, x) wll be a constant. If ν s further set to be 1 nc, then (7) and (11) wll be exactly the same problem. Therefore, the optmzng α wll be the same for both methods. Then we only need to show that the decson functons of ν-svm and SVDD concde gven the same α. We already know that [ ] δ ν SV M (x) = sgn α k(x, x) ρ, =1 δ SV DD (x) = sgn R 2,j α α j k(x, x j ) + 2 α k(x, x) k(x, x). Let x m be one of the ponts that have 0 < α m < C = 1 νn. Then we have ρ = α k(x, x m ), R 2 =,j α α j k(x, x j ) 2 α k(x, x m ) + k(x m, x m ). 12

Therefore, [ δ ν SV M (x) = sgn α k(x, x) ] α k(x, x m ), [ δ SV DD (x) = sgn 2 α k(x, x) 2 ] α k(x, x m ) + k(x m, x m ) k(x, x). Snce k(x m, x m ) = k(x, x) and sgn [g(x)] = sgn [2g(x)], we have δ ν SV M (x) δ SV DD (x). Theorem 3 s consstent wth our ntuton snce when k(x, y) only depends on x y, then all the mapped patterns le on a sphere n the kernel space. Therefore, the smallest sphere found n SVDD can be equvalently segmented by a hyperplane (ν-svm). It s rather mportant, not only because t theoretcally relates two popular one-class classfcaton methods, but also mples that the generalzaton error bound derved for ν-svm also works for SVDD, and some parameter selecton methods for SVDD (e.g., [7]) can also be appled to ν-svm. 7 Concluson One-class classfcaton, also known as the data doman descrpton, s not only a classfcaton problem, but also an mportant step towards learnng nformaton and understandng knowledge from tranng data. The ν-svm method addresses the one-class classfcaton problem by fndng a smallest regon (the probablty densty support) that can bound most of the tranng samples. The resultng optmzaton problem s smlar to that of the conventonal SVM, and fast teratve algorthms exst for solvng t. Its generalzaton error has also proved to be bounded from above, whch s a very desrable property of learnng algorthms. In ths report, we have provded our own proof for the so-called ν-property (Theorem 1), and verfed t through a smulaton data set. The ν-property provdes underlyng meanng for the regularzaton parameter ν, and thus can be leveraged to control the fracton of support vectors and outlers n practce. Real world data are also used to demonstrate the usefulness of ν-svm when dealng wth nsuffcent negatve tranng samples. Results ndcate that when there are nsuffcent negatve samples n the tranng set, t s better f we only use the postve samples and resort to one-class classfcaton. We have also proved n Theorem 3 that ν-svm s equvalent to another popular one-class classfcaton method, SVDD, under certan crcumstances. 13

Appendx: Proof of Theorem 2 Before provng Theorem 2, some necessary defntons and lemmas are ntroduced wthout proof as follows. Defnton 2 (ɛ-coverng Number). Let (X, d) s a metrc space, and A X. For ɛ > 0, a set U X s called an ɛ-cover for A f for every a A, u U such that d(a, u) ɛ. Then ɛ-coverng number of A s the mnmal cardnalty of an ɛ-cover for A, and s denoted by N (ɛ, A, d). Specfcally n ths report, suppose X s a compact subset of R d, and F s a lnear functon space wth the dstance defned by the nfnte-norm,.e., for f F, f l = max x T f(x). Then let N (ɛ, F, n) max T X n N (ɛ, F, l ). Defnton 3. Let L(X ) be the set of non-negatve functons f on X wth countable support. Defne 1-norm on L(X ) by f 1 x supp(f) f(x). Then L B (X ) {f L(X ) : f 1 B}. Lemma 1 (Theorem 14 n [3]). Suppose we are gven a tranng set T = {x } n =1 generated..d. from an underlyng but unknown dstrbuton P whch does not contan dscrete components, where x X, for all. For any γ > 0, f F, fx B D(T, f, θ), then wth probablty 1 δ P {x : f(x) < θ 2γ} 2 n (k + log n 2 δ ), where k = log 2 N (γ/2, F, 2n) + log 2 N (γ/2, L B (X ), 2n). Lemma 2 (Lemma 7.14 of [10]). For all γ > 0, where b = B 2γ. log 2 N (γ, L B e(n + b 1) (X ), n) blog 2, b Lemma 3 (Wllamson et al. [11]). Let F be the class of lnear classfers wth norm at most 1 confned to a unt ball centered at the orgn, then for ɛ c/ n, where c = 103. log 2 N (ɛ, F, n) c2 log 2 ( 2 ln 2 c 2 ɛ 2 n ) ɛ 2, Usng Lemma 1, 2, and 3 as tools, we are now ready to prove Theorem 2. 14

Proof of Theorem 2. Note n Theorem 2, R w,ρ γ = {x : f w (x) ρ γ}, so we have {x : x / R w,ρ γ } {x : f w (x) < ρ γ}. Therefore, the dea s to apply Lemma 1 (by replacng 2γ by γ) for provng Theorem 2. Frstly, notce that we can treat the offset ρ as 0 wthout loss of generalty. Secondly, n order to nvoke Lemma 3 whle calculatng k n Lemma 1, the lnear class F s requred to be confned to a unt ball centered at the orgn. Hence, we rescale functon f to be ˆf = f w/ w. The decson boundary remans the same f we also rescale ˆγ = γ/ w. Snce n Lemma 1, B s fxed. However, n Theorem 2, B does not have to be fxed. Hence we apply Lemma 1 for each value of log 2 N (ˆγ/4, L B (X ), 2n). (12) Because for error bound 2 n (k + log 2 n δ ) to be nontrval, k has to be smaller than n 2, so s Eq. (12). Then t suffces to make at most n δ 2 applcatons, and as a result, uses a confdence of n/2 for each applcaton. Therefore, by Lemma 1, we have where P {x : ˆf(x) < ˆγ} 2 n (k + log 2 n 2 2δ ), k = log 2 N (ˆγ/4, F, 2n) + log 2 N (ˆγ/4, L B (X ), 2n). In addton, lettng b = and usng Lemma 2 and 3, we have k 2Bˆγ 16c 2 log 2 ( ln 2 ˆγ 2 ( n) 4c e(2n + b 1) 2 ˆγ 2 + blog 2 b ( e(2n + b 1) 16c2 log 2 ( ln 2 4c 2 ˆγ 2 n) ˆγ 2 + blog 2 16c2 log 2 ( ln 2 4c 2 ˆγ 2 n) ) ) + 2 b ( e(2n + (2D/ˆγ) 1) ˆγ 2 + 2Ḓ γ log 2 2D/ˆγ = 16c2 log 2 ( ln 2 ˆγ 2 n) 4c 2 ˆγ 2 + 2Ḓ ( ( (2n 1)ˆγ γ log 2 e D k. Then smply by replacng c 1 = 16c 2, and c 2 = ln 2 4c 2, Theorem 2 s proved. ) + 2 )) + 1 + 2 15

References [1] P. Mouln and V. V. Veeravall, Detecton and estmaton theory. ECE561 lecture notes, UIUC, 2015. [2] P. Mouln, Topcs n sgnal processng: Statstcal learnng and pattern recognton. ECE544NA lecture notes, UIUC, 2015. [3] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Wllamson, Estmatng the support of a hgh-dmensonal dstrbuton, Neural computaton, vol. 13, no. 7, pp. 1443 1471, 2001. [4] J. Platt, Fast tranng of support vector machnes usng sequental mnmal optmzaton, Advances n kernel methods?support vector learnng, vol. 3, 1999. [5] A. J. Smola and B. Schölkopf, A tutoral on support vector regresson, Statstcs and computng, vol. 14, no. 3, pp. 199 222, 2004. [6] C.-C. Chang and C.-J. Ln, LIBSVM: A lbrary for support vector machnes, ACM Transactons on Intellgent Systems and Technology, vol. 2, pp. 27:1 27:27, 2011. Software avalable at http://www.cse.ntu.edu.tw/~cjln/lbsvm. [7] D. M. Tax and R. P. Dun, Unform object generaton for optmzng one-class classfers, The Journal of Machne Learnng Research, vol. 2, pp. 155 173, 2002. [8] W. Wolberg and O. Mangasaran, Multsurface method of pattern separaton for medcal dagnoss appled to breast cytology,, n Proceedngs of the Natonal Academy of Scences, pp. 9193 9196, Dec 1990. [9] D. M. Tax and R. P. Dun, Support vector doman descrpton, Pattern recognton letters, vol. 20, no. 11, pp. 1191 1199, 1999. [10] J. Shawe-Taylor and N. Crstann, On the generalzaton of soft margn algorthms, Informaton Theory, IEEE Transactons on, vol. 48, no. 10, pp. 2721 2735, 2002. [11] R. C. Wllamson, A. J. Smola, and B. Schölkopf, Entropy numbers of lnear functon classes., n COLT, pp. 309 319, 2000. 16