A fast iterative algorithm for support vector data description

Similar documents
Kernel Methods and SVMs Extension

Lecture 10 Support Vector Machines II

MMA and GCMMA two methods for nonlinear optimization

Lecture Notes on Linear Regression

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Assortment Optimization under MNL

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Inexact Newton Methods for Inverse Eigenvalue Problems

On a direct solver for linear least squares problems

Which Separator? Spring 1

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Feature Selection: Part 1

Support Vector Machines CS434

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Generalized Linear Methods

Natural Language Processing and Information Retrieval

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

IV. Performance Optimization

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

10-701/ Machine Learning, Fall 2005 Homework 3

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Errors for Linear Systems

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Some modelling aspects for the Matlab implementation of MMA

COS 521: Advanced Algorithms Game Theory and Linear Programming

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

The Minimum Universal Cost Flow in an Infeasible Flow Network

4DVAR, according to the name, is a four-dimensional variational method.

Problem Set 9 Solutions

Lecture 3: Dual problems and Kernels

Singular Value Decomposition: Theory and Applications

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

1 Convex Optimization

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Numerical Heat and Mass Transfer

Lecture 12: Discrete Laplacian

Maximal Margin Classifier

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Linear Feature Engineering 11

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Lecture 20: November 7

The Order Relation and Trace Inequalities for. Hermitian Operators

VQ widely used in coding speech, image, and video

6.854J / J Advanced Algorithms Fall 2008

1 GSW Iterative Techniques for y = Ax

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Fisher Linear Discriminant Analysis

Supporting Information

Module 9. Lecture 6. Duality in Assignment Problems

Structure and Drive Paul A. Jensen Copyright July 20, 2003

EEE 241: Linear Systems

Lagrange Multipliers Kernel Trick

Lecture 21: Numerical methods for pricing American type derivatives

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

CSC 411 / CSC D11 / CSC C11

Support Vector Machines CS434

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Grover s Algorithm + Quantum Zeno Effect + Vaidman

APPENDIX A Some Linear Algebra

Norms, Condition Numbers, Eigenvalues and Eigenvectors

On the Multicriteria Integer Network Flow Problem

The Geometry of Logit and Probit

Lecture 3. Ax x i a i. i i

Linear Approximation with Regularization and Moving Least Squares

An Interactive Optimisation Tool for Allocation Problems

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Chapter Newton s Method

Approximate Smallest Enclosing Balls

CHAPTER 7 CONSTRAINED OPTIMIZATION 2: SQP AND GRG

Lecture 12: Classification

Large-scale packing of ellipsoids

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Hongyi Miao, College of Science, Nanjing Forestry University, Nanjing ,China. (Received 20 June 2013, accepted 11 March 2014) I)ϕ (k)

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Linear Classification, SVMs and Nearest Neighbors

NON-CENTRAL 7-POINT FORMULA IN THE METHOD OF LINES FOR PARABOLIC AND BURGERS' EQUATIONS

Difference Equations

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Time-Varying Systems and Computations Lecture 6

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Composite Hypotheses testing

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

SIO 224. m(r) =(ρ(r),k s (r),µ(r))

A Robust Method for Calculating the Correlation Coefficient

Random Walks on Digraphs

A new construction of 3-separable matrices via an improved decoding of Macula s construction

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Formulas for the Determinant

Online Classification: Perceptron and Winnow

Ensemble Methods: Boosting

Transcription:

https://do.org/10.1007/s13042-018-0796-7 ORIGINAL ARTICLE A fast teratve algorthm for support vector data descrpton Songfeng Zheng 1 Receved: 9 February 2017 / Accepted: 26 February 2018 Sprnger-Verlag GmbH Germany, part of Sprnger Nature 2018 Abstract Support vector data descrpton (SVDD) s a well known model for pattern analyss when only postve examples are relable. SVDD s usually traned by solvng a quadratc programmng problem, whch s tme consumng. Ths paper formulates the Lagrangan of a smply modfed SVDD model as a dfferentable convex functon over the nonnegatve orthant. The resultng mnmzaton problem can be solved by a smple teratve algorthm. The proposed algorthm s easy to mplement, wthout requrng any partcular optmzaton toolbox. Theoretcal and expermental analyss show that the algorthm converges r-lnearly to the unque mnmum pont. Extensve experments on pattern classfcaton were conducted, and compared to the quadratc programmng based SVDD (QP-SVDD), the proposed approach s much more computatonally effcent (hundreds of tmes faster) and yelds smlar performance n terms of recever operatng characterstc curve. Furthermore, the proposed method and QP-SVDD extract almost the same set of support vectors. Keywords Support vector data descrpton Quadratc programmng Penalty functon method Lagrangan dual functon Support vectors 1 Introducton There s a class of pattern recognton problems, such as novelty detecton, where the task s to dscrmnate the pattern of nterest from outlers. In such a stuaton, postve examples for tranng are relatvely easer to obtan and more relable. However, although negatve examples are very abundant, t s usually dffcult to sample enough useful negatve examples for accurately modelng the outlers snce they may belong to any class. In ths case, t s reasonable to assume postve examples clusterng n a certan way. As such, the goal s to accurately descrbe the class of postve examples as opposed to the wde range of negatve examples. For ths purpose, Tax et al. [30, 31, 33] proposed a support vector data descrpton (SVDD) method, whch fts a tght hypersphere n the nonlnearly transformed feature space to nclude most of the postve examples. Thus, SVDD could be regarded as a descrpton of the data dstrbuton of nterest. Extensve experments [30, 31, 33] showed that * Songfeng Zheng SongfengZheng@MssourState.edu 1 Department of Mathematcs, Mssour State Unversty, Sprngfeld, MO 65897, USA SVDD s able to correctly dentfy negatve examples n testng even though t has not seen any durng tranng. Lke support vector machne (SVM) [34, Chap. 10], SVDD s a kernel based method, possessng all the related advantages of kernel machnes. SVDD has been appled to varous problems, ncludng mage classfcaton [38], handwrtten dgt recognton [32], face recognton [18], remote sensng mage analyss [22], medcal mage analyss [29], and multclass problems [17, 37], to name a few. In addton, SVDD s a prelmnary step for support vector clusterng [4]. The formulaton of SVDD leads us to a quadratc programmng problem. Although decomposton technques [24, 25] or sequental mnmzaton method [26] could be employed to solve the quadratc programmng, the tranng of SVDD has tme complexty roughly of order O(n 3 ), where n s the tranng set sze (see Sects. 4.2 and 4.3 for expermental verfcaton). Thus, tranng an SVDD model could be very expensve for large dataset. As such, gven the wde applcaton of SVDD, t s hghly desrable to develop a tme-effcent yet accurate enough tranng algorthm for SVDD. In ths paper, we frst slghtly modfy the formulaton of SVDD model, resultng n a more convex mnmzaton problem wth smpler constrants. We then apply the quadratc penalty functon method [28, Sect. 6.2.2] from Vol.:(0123456789)

optmzaton theory to absorb an equalty constrant n the dual problem, obtanng a dfferentable convex functon over the nonnegatve orthant as the approxmated Lagrangan functon, whch can be effcently mnmzed by a smple teratve algorthm. We thus call the proposed model as Lagrangan SVDD (L-SVDD). The proposed L-SVDD algorthm s easy to mplement, requrng no partcular optmzaton toolbox besdes basc standard matrx operatons. Theoretcal and expermental analyss show that the algorthm converges r-lnearly to the global mnmum pont and the algorthm has computatonal complexty of the order O(n 2 ) multplyng the teraton number. We test the proposed approach on face detecton and handwrtten dgt recognton problems, and detaled performance measure comparson demonstrates that L-SVDD often yelds testng accuracy very close to or slghtly better than that of the Quadratc Programmng based SVDD (QP- SVDD). More mportantly, L-SVDD s much more computatonally effcent than QP-SVDD (.e., 200 400 tmes faster on the consdered experments). Furthermore, the two methods extract almost dentcally the same set of support vectors. The followng are several words about our notatons. All scalars are represented by symbols (.e., Englsh or Greek letters) wth normal font. All vectors wll be denoted by bold lower case symbols, and all are column vectors unless transposed to a row vector by a prme superscrpt. All matrces wll be denoted by bold upper case symbols. For a vector x n R n, the plus functon x + s defned as (x + ) = max{0, x }, for = 1,, n. For two vectors a and b n R n, a b means a b for each = 1,, n. The notaton a b means the two vectors a and b are perpendcular, that s, a 1 b 1 + a 2 b 2 + + a n b n = 0. For a vector x R n, x stands for ts 2-norm, that s, x = x 2 1 + + x2. For a square n matrx A of sze n n, A represent the matrx norm, that s A = sup x 0 n Ax x. Thus, Ax A x. If A s postve defnte matrx, A s just the largest egenvalue of A. The rest of ths paper s organzed as follows: Sect. 2 brefly revews the formulaton of SVDD and presents a smple modfcaton to the orgnal SVDD model; Sect. 3 formulates the approxmated Lagrangan dual problem and proposes a smple teratve algorthm to solve t, and the convergence propertes of the algorthm wll also be nvestgated; Sect. 3 dscusses the feasblty of two alternatve schemes as well; Sect. 4 compares the performance measures n terms of recever operatng characterstc curve and tranng tme of the proposed L-SVDD algorthm to those of QP-SVDD on two publcly avalable real-world datasets, and we also compare the support vectors the two methods extracted; the expermental results n Sect. 4 also verfy the role of each parameter n the convergence behavor of the algorthm, and the computatonal complexty s verfed by the expermental results as well; fnally, Sect. 5 summarzes ths paper and dscusses some possble future research drectons. 2 Support vector data descrpton Gven tranng data {x, = 1,, n} wth the feature vector x R p, let Φ( ) be a nonlnear transformaton 1 whch maps the orgnal data vector nto a hgh dmensonal Hlbert feature space. SVDD s lookng for a hypersphere n, wth radus R > 0 and center c, whch has a mnmum volume contanng most of the data. Therefore, we have to mnmze R 2 constraned to Φ(x ) c 2 R 2, for = 1,, n. In addton, snce the tranng sample mght contan outlers, we can ntroduce a set of slack varables ξ 0, as n the framework of support vector machne (SVM) [34, Chap. 10]. The slack varable ξ measures how much the squared dstance from the tranng example x to the center c exceeds the radus squared. Therefore, the slack varable could be understood as a measure of errors. Takng all the consderatons nto account, the SVDD model can be obtaned by solvng the followng optmzaton problem mn R,c,ξ F(R, c, ξ) =R 2 + C wth constrants where ξ =(ξ 1,, ξ n ) s the vector of slack varables, and the parameter C > 0 controls the tradeoff between the volume of the hypersphere and the permtted errors. The Lagrangan dual of the above optmzaton problem s (refer to [30, 31, 33] for detaled dervatons) wth constrants ξ, Φ(x ) c 2 R 2 + ξ, ξ 0, for = 1,, n, mn L(α) = α α α j K(x, x j ) j=1 α K(x, x ), α = 1, 0 α C for = 1,, n, 1 In SVM and SVDD lterature, the explct form of the functon Φ( ) s not mportant, and n fact, t s often dffcult to wrte out Φ( ) explctly. What s mportant s the kernel functon, that s, the nner product of Φ(x ) and Φ(x j ). (1) (2) (3) (4)

where K(x, x j )=Φ(x ) Φ(x j ) s the kernel functon whch satsfes Mercer s condton [34, Chap. 10], and α =(α 1,, α n ) wth α beng the Lagrangan multpler for the -th constrant n Eq. (2). Smlar to the works on SVM [21] and support vector regresson [23], we consder the sum of squared errors n the objectve functon gven n Eq. (1), that s, we modfy Eq. (1) to mn R,c,ξ F(R, c, ξ) =R 2 + C ξ 2. Wth ths slght modfcaton, the objectve functon becomes more convex because of the square terms. Furthermore, the nonnegatvty constrant n Eq. (2) could be removed, whch s proved n the followng. Proposton 1 If ( R, c, ξ) s the mnmum pont of functon F(R, c, ξ) defned n Eq. (5), wth the constrants Φ(x ) c 2 R 2 + ξ, for = 1,, n, then all components of ξ should be nonnegatve. Proof Assume ξ =( ξ 1, ξ 2,, ξ n ) and wthout loss of generalty, assume ξ 1 < 0. Let ξ =(0, ξ 2,, ξ n ), that s, we replace the frst component of ξ (whch s negatve) by 0 and keep others unchanged. Snce ξ 1 satsfes the constrant Φ(x 1 ) c 2 R 2 + ξ 1, we must have Φ(x 1 ) c 2 R 2 + 0 snce ξ 1 < 0. By assumpton, the constrants are satsfed at ξ 2,, ξ n. Hence, the constrants are satsfed at all components of ξ. However, there s F( R, c, ξ) = R 2 + C =2 ξ 2 snce ξ 1 < 0 by assumpton, and ths s contradcton to the assumpton that ( R, c, ξ) s the mnmum pont. Thus, at the mnmum pont, there must be ξ 1 0. In the same manner, t can be argued that all components of ξ should be nonnegatve. Consequently, the nonnegatve constrant on ξ s not necessary, thus can be removed. The Lagrangan functon of ths new problem s L(R, c, ξ, α) =R 2 + C + ξ 2 < R 2 + C wth α s beng the nonnegatve Lagrangan multplers. At the optmal pont, the partal dervatve to the prmal varables are zeros. Thus, there are [ α Φ(x ) c 2 R 2 ] ξ, ξ 2 = F( R, c, ξ) (5) L R = 0 2R 2Rα = 0 and Substtutng these results to the Lagrangan functon, we have the dual functon as The purpose s now to maxmze L(α) or mnmze L(α) = L(α) 2 wth respect to nonnegatve α s wth the constrant n Eq. (6), that s wth constrants α = 1, Wth the sum of errors replaced by sum of squared errors, the resultng dual problem n Eq. (7) has an extra quadratc term, compared to the orgnal dual problem n Eq. (3), and ths wll mprove the convexty of the objectve functon. Moreover, by comparng the constrants n Eqs. (4) and (8), t s clear that the new optmzaton problem has smpler constrants, wthout any upper bound for the dual varables. As mplemented n popular SVM toolboxes [7, 14, 15], the quadratc programmng problems n Eqs. (3) and (7) can be solved by decomposton methods [24, 25] or sequental mnmal optmzaton method [26]. However, these algorthms are computatonally expensve wth tme complexty roughly O(n 3 ). Thus, a fast tranng algorthm for SVDD whch can acheve smlar accuracy as the quadratc programmng method s hghly apprecated. L ξ = 0 2Cξ α = 0 ξ = α 2C L c = 0 2 α (Φ(x ) c) =0 c = L(α) = 1 4C mn α α 2 j=1 L(α) = 1 α 2 + 4C α K(x, x ), α α j K(x, x j )+ for = 1, 2,, n, α Φ(x ). (6) α K(x, x ). α α j K(x, x j ) j=1 α = 1, α 0 for = 1,, n. 2 We slghtly abuse the notaton here because L(α) was used n Eq. (3). However, ths wll not cause any confuson because all of our followng dscussons are based on Eq. (7). (7) (8)

3 Lagrangan support vector data descrpton 3.1 The algorthm Let K be the n n kernel matrx, that s, K j = K(x, x j ), let vector u be formed by the dagonal elements of kernel matrx K, let I n be the n n dentty matrx, and let 1 n ( 0 n ) be the n-dmensonal vector of all 1 s (0 s). The optmzaton problem n Eq. (7) could be wrtten compactly n matrx form as mn α L(α) = 1 2 α ( In 2C + 2K wth constrants 1 n α = 1 and α 0 n. To deal wth the equalty constrant n Eq. (10), we consder the penalty functon method [28, Sect. 6.2.2] from optmzaton theory. The basc dea s ntegratng the orgnal objectve functon wth a functon whch ncorporates some constrants, n order to approxmate a constraned optmzaton problem by an unconstraned problem or one wth smpler constrants. For our problem, we consder the followng functon where J n s the n n matrx of all elements beng 1. As proved n [28], as the penalty parameter ρ, the mnmum pont of Eq. (11) wth α 0 n converges to the soluton to Eq. (9) wth constrants n Eq. (10). Let The matrx Q s postve defnte because both K and J n are postve sem-defnte whle I n s postve defnte. Ignorng the constant term n Eq. (11), we can formulate the approxmated mnmzaton problem as The Kuhn Tucker statonary-pont problem [20, p. 94, KTP 7.2.4] for Eq. (13) s From these equatons, we have ) α u α, f (α) = 1 ( ) In 2 α 2C + 2K α u α + ρ(1 α 1)2 n = 1 ( ) In 2 α 2C + 2K + 2ρJ n α (u + 2ρ1 n ) α + ρ, Q = I n 2C + 2K + 2ρJ n and v = u + 2ρ1 n. 1 mn α 0 n 2 α Qα v α. Qα v w = 0 n, w α = 0, α 0 n, w 0 n. w = Qα v 0 n and (Qα v) α = 0, (9) (10) (11) (12) (13) whch can be summarzed as solvng the classcal lnear complementarty problem [10], that s, solvng for α, such that, 0 n (Qα v) α 0 n. (14) Snce the matrx Q s symmetrc postve-defnte, the exstence and unqueness of the soluton to Eq. (14) s guaranteed [9]. The optmalty condton n Eq. (14) s satsfed f and only f for any γ >0, the relatonshp Qα v =(Qα v γα) + (15) holds. See Appendx for a proof. To obtan a soluton to the above problem, we start from an ntal pont α 0, and apply the followng teratve scheme α k+1 = Q 1 (v +(Qα k v γα k ) + ). (16) The ntal pont α 0 could be any vector, but n our mplementaton, we take α 0 = Q 1 v. We summarze the algorthm for L-SVDD as n Algorthm 1 below, and the convergence analyss wll be gven n Sect. 3.2. Algorthm 1: Lagrangan Support Vector Data Descrpton 0. Intalzaton: choose the startng pont as α 0 = Q 1 v, fnd α 1 usng Eq. (16), set k = 1, set the teraton number as M, and set the error tolerance as ε. 1. whle k < M and α k α k 1 > ε do: 2. Set k = k + 1. 3. Fnd α k usng Eq. (16). 4. end whle 5. Return the vector α k. Remark 1 In Algorthm 1, we termnate the program when the soluton does not change too much. Snce the purpose s to fnd a soluton to Eq. (14), we can also termnate the program f the absolute value of nner product (α k ) (Qα k v) s below a certan level. However, snce t ncludes matrx and vector multplcaton, ths stoppng crteron s more expensve to evaluate than the one n Algorthm 1. Thus, n our mplementaton, we choose to use the stoppng rule gven n Algorthm 1. Our expermental results show that when the algorthm stops, the nner product ndeed s very close to 0. Please see Sects. 4.2 and 4.3 for the detaled results. Remark 2 Each teraton of Algorthm 1 ncludes matrx multplyng vector, vector addton/subtracton, and takng postve part of a vector component-wse, among whch the most expensve operaton s matrx multplyng vector, whch has computatonal complexty of order O(n 2 ). We thus expect the computatonal complexty of Algorthm 1 to be about teraton number multplyng O(n 2 ). Sectons 4.2 and 4.3 verfy ths analyss expermentally.

Smlar to SVM, we call the tranng examples wth correspondng α s nonzero as support vectors. Once α s are obtaned, the radus R can be computed from the set of support vectors [30, 31]. In the stage of decson makng, f the dstance from a new example x to the center s less than the radus R, t s classfed as a postve example; otherwse, t s classfed as a negatve example. That s, the decson rule s ( f (x) =sgn R 2 2 ) Φ(x) α Φ(x ) ( ) (17) = sgn 2 α K(x, x) K(x, x)+b, where b = R 2 n n j=1 α α j K(x, x j ). 3.2 Convergence analyss To analyze the convergence property of Algorthm 1, we need the followng Lemma 1 Let a and b be two ponts n R p, then a + b + a b. (18) Proof For two real numbers a and b, there are four stuatons (1) a 0 and b 0, then a + b + = a b ; (2) a 0 and b 0, then a + b + = a 0 a b ; (3) a 0 and b 0, then a + b + = 0 b a b ; (4) a 0 and b 0, then a + b + = 0 0 a b. In summary, for one dmensonal case, there s a + b + 2 a b 2. Assume that Eq. (18) s true for p dmensonal vectors a p and b p. Denote the p + 1 dmensonal vectors a and b as 3 a =(a p, a p+1 ) and b =(b p, b p+1 ), where a p+1 and b p+1 are real numbers. Then, a + b + 2 = ((a p ) + (b p ) +, (a p+1 ) + (b p+1 ) + ) 2 = (a p ) + (b p ) + 2 + ((a p+1 ) + (b p+1 ) + ) 2 a p b p 2 +(a p+1 b p+1 ) 2 = a b 2, (19) where n Eq. (19), we used the assumpton on the p dmensonal vectors, the specal result for one dmensonal case, and the defnton of Eucldean norm. By nducton, Eq. (18) s proved. 3 For notatonal convenence, n ths proof, we assume all the vectors are row vectors. Clearly, the result also apples to column vectors. Wth the ad of Lemma 1, we are ready to study the convergence behavor of Algorthm 1, and we have the followng concluson. Proposton 2 Wth 0 <γ<1 C, the sequence α k obtaned by Algorthm 1 converges r-lnearly [2] to the unque soluton α of Eq. (13), that s lm sup α k α 1 k < 1. k Proof The convexty of the objectve functon n Eq. (13) and the convexty of the feasble regon ensure the exstence and unqueness of soluton α to Eq. (13). Snce α s the soluton to Eq. (13), t must satsfy the optmalty condton n Eq. (15), that s, for any γ >0 Q α = v +(Q α v γ α) +. Multplyng Eq. (16) by Q and subtractng Eq. (20), and then takng norm gves us (21) Applyng Lemma 1 to the vectors Qα k v γα k and Q α v γ α n Eq. (21), we have In the defnton of Q n Eq. (12), t s clear that the matrx 2K + 2ρJ n s postve sem-defnte because both K and J n are, and we denote ts egenvalues as λ, = 1, 2,, n. Then the egenvalues of Q are 1 + λ 2C and the egenvalues of Q 1 are ( 1 + λ 2C ) 1, = 1, 2,, n. To make the sequence Qα k Q α converge, Eq. (22) ndcates that we need I γq 1 < 1, that s, the egenvalues of I γq 1 are all between 1 and 1, or Thus, wth the choce of 0 <γ<1 C, we have (20) Qα k+1 Q α = (Qα k v γα k ) + (Q α v γ α) +. Qα k+1 Q α (Qα k v γα k ) (Q α v γ α) = (Q γi)(α k α) = (I γq 1 )(Qα k Q α) I γq 1 Qα k Q α. ( ) 1 1 1 < 1 γ 2C + λ < 1 for = 1, 2,, n, ( ) 1 0 <γ<2 2C + λ = 1 C + 2λ for = 1, 2,, n. c = I γq 1 < 1. (22)

Recursvely applyng Eq. (22), we have that for any k, Qα k Q α c k Qα 0 Q α. Consequently, α k α = Q 1 Q(α k α) Q 1 Qα k Q α c k Q 1 Qα 0 Q α = Ac k, where A = Q 1 Qα 0 Q α > 0. Hence, lm sup k α k α 1 k lm sup A 1 k c = c < 1. k Ths proves the proposton. (23) Remark 3 The proof of Proposton 2 enables us to estmate the teraton number M. If we requre the accuracy of the soluton to be ε, n the sense that α M α <ε. From Eq. (23), t s suffcent to have α M α < Ac M = ε, where A = Q 1 Qα 0 Q α and c = I γq 1. Ths enables us to solve for M as M = log ε log A. log c However, A cannot be calculated because α s unknown. Hence, n the mplementaton, we set M as a large number and termnate the program accordng to the crteron n Algorthm 1. From the proof of Proposton 2, the convergence rate of Algorthm 1 depends on c, the norm of matrx I γq 1. The analyss n the proof shows that a smaller value of c gves a faster convergence rate. By the defnton of matrx norm, there s c = I γq 1 = max { ( ) 1 1 } 1 γ 2C + λ, from whch t s clear that a larger value of γ makes c smaller and consequently makes the algorthm converge faster. In accord wth Proposton 2, let us assume that γ = a C for some constant 0 < a < 1. We have, c = max = max = max = max { ( 1 1 γ { 1 a C { 1 { 1 1 + 2Cλ ) 1 } 2C + λ ( ) 1 1 } 2C + λ ( ) 1 2a + λ C 1 } a } 2a. Thus, a small value of C and the smallest egenvalue wll 2a make large hence c small, and consequently the algo- 1+2Cλ rthm wll converge faster. However, to the best of our knowledge, there s no theoretcal concluson about the dependence between the egenvalues of 2K + 2ρJ n and ρ. Fortunately, our numercal tests revealed that wth the change of ρ, the largest egenvalue of 2K + 2ρJ n changes dramatcally whle the smallest egenvalue does not change too much. Snce c depends on the smallest egenvalue of 2K + 2ρJ n, we thus conclude that the convergence rate of Algorthm 1 s not affected much by ρ. In summary, we reach the concluson that, to acheve a faster convergence rate, we should set γ large and C small ( γ depends on C n our mplementaton), and ρ does not sgnfcantly mpact the convergence behavor. We should menton that C also controls the error of the model, so settng C small mght make the resultng model perform poorly n classfcaton. We wll numercally verfy these analyss n Sect. 4.2. 3.3 Dscusson on two alternatves We tran the L-SVDD model by solvng the lnear complementarty problem n Eq. (14), and Algorthm 1 s based on the condton n Eq. (15) at the optmum pont. Alternatvely, the optmalty condton can be wrtten as α =(α γ (Qα v)) +, (24) where γ > 0. In prncple, smlar to Algorthm 1, we can desgn an algorthm based on recursve relaton α k+1 =(α k γ (Qα k v)) +, (25) wth some approprately selected γ. Intutvely, the algorthm based on Eq. (25) should be more computatonally effcent n each teraton than Algorthm 1 whch s based on Eq. (16). The reason s that Eq. (16) nvolves three vector addton/subtracton operatons and two matrx and vector multplcatons, whle Eq. (25) only ncludes two vector addton/subtracton operatons and one matrx and vector multplcaton. To choose γ, as n the proof of Proposton 2, we let the unque soluton to Eq. (14) be α, whch must satsfy α =( α γ (Q α v)) +. (26) Subtractng Eq. (26) from Eq. (25), takng norm, and applyng Lemma 1, we have α k+1 α = (α k γ (Qα k v)) + ( α γ (Q α v)) + (α k γ (Qα k v)) ( α γ (Q α v)) = (I γ Q)(α k α) I γ Q α k α. Thus, to make the potental algorthm converge, we must have I γ Q < 1.

Denote the egenvalues of matrx 2K + 2ρJ n as λ, = 1, 2,, n, then the egenvalues of I γ Q are 1 γ ( 1 + λ 2C ). To ensure I γ Q < 1, we need all the egenvalues of I γ Q to be between 1 and 1, that s ( ) 1 1 < 1 γ 2C + λ < 1 for = 1, 2,, n, or 2 4C 0 < γ < 1 + λ = for = 1, 2,, n. 1 + 2Cλ 2C Thus, we should choose γ as 0 < γ < 4C 1 + 2Cλ max, where λ max = max{λ 1, λ 2,, λ n }. However, our expermental results 4 show that λ max s large because ρ need to be large, as requred by the penalty functon method. As a result, ths wll make γ very small and consequently, make the potental algorthm based on Eq. (25) converge slowly. We tested ths alternatve dea n our experments, and the result showed that, under the same stoppng crteron, compared to Algorthm 1, although each teraton s more tme effcent, the algorthm based on Eq. (25) needs much more teratons to converge, 5 hence spendng more tme. The purpose of Algorthm 1 s to solve the quadratc mnmzaton problem gven n Eq. (13), and another alternatve s to apply nteror pont method (IPM) [2, 5, 28] to ths problem. Let z k be the k-th step soluton to the problem of mnmzng some convex functon g(z) wth some constrants usng IPM, and denote the global mnmum pont as z μ. In [2], t was proved that z k converges to z μ not only r-lnearly, but also q-superlnearly n the sense that z k+1 z μ z k z μ 0. Proposton 2 shows that the proposed L-SVDD algorthm also has r-lnear convergence rate. However, L-SVDD algorthm cannot acheve q-superlnear convergence rate. Ths means that theoretcally, IPM should converge n fewer teratons than L-SVDD. In each teraton of L-SVDD algorthm, the operaton s qute smple, wth the most expensve computaton beng matrx and vector multplcaton. However, each IPM teraton s much more complcated, because t ncludes evaluatng the objectve functon and constrant values, calculatng 4 To the best of our knowledge, there s no theoretcal result regardng the dependence between the largest egenvalue of matrx 2K + 2ρJ n and the parameter ρ. 5 For nstance, on the face detecton problem n Sect. 4.2, to acheve the error tolerance of 10 5, the algorthm based on Eq. (25) needs more than 20,000 teratons to converge. the gradents and Hessan, fndng the search drecton, and conductng a backtrackng lne search to update the soluton. These operatons nclude several matrx nverson and more matrx multplcatons. Thus, each teraton of IPM s much more expensve than L-SVDD. Hence, although IPM converges n fewer teratons, t mght spend more computng tme and resource (e.g., memory) than L-SVDD. We developed the nteror pont method based SVDD model (IPM-SVDD) by adaptng the MATLAB code from https ://pcarb o.gthu b.o/conve xprog.html. Secton 4.2 presents the comparson between IPM-SVDD and L-SVDD n terms of teraton number and CPU tme for achevng convergence. 4 Expermental results and analyss On a face dataset and the USPS handwrtten dgt dataset, we compared the performances of the proposed Lagrangan SVDD (L-SVDD) and the ordnary Quadratc Programmng based SVDD (QP-SVDD), whch s obtaned by applyng a quadratc programmng solver to Eq. (7) wth constrants n Eq. (8). 4.1 Experment setup and performance measures The program for L-SVDD was developed usng MATLAB, and we dd not do any specfc code optmzaton; QP-SVDD was mplemented based on the MATLAB SVM toolbox [14] wth the core quadratc programmng solver wrtten n C++. The source code of ths work s avalable upon request. All the experments were conducted on a laptop computer wth Intel(R) Core(TM) 5-2450M CPU 2.50 GHz and 4 GB memory, wth Wndows 7 Professonal operatng system and MATLAB R2007b as the platform. Durng all experments that ncorporated measurement of runnng tme, one core was used solely for the experments, and the number of other processes runnng on the system was mnmzed. In our experments, we adopted the Gaussan kernel K(u, v) =exp { u v 2 2σ 2 } wth σ = 8. We set the SVDD control parameter C = 2 n the algorthms. The parameter settng n our experments mght not be optmal to acheve the best testng performance. Nonetheless, our purpose s not to acheve the least testng error, but to compare the performances between L-SVDD and QP-SVDD; therefore, the comparson s far as long as the parameter settngs are the same for the two algorthms. In general, we can select the optmal parameter settng (C, σ) by applyng cross valdaton, generalzed approxmate cross valdaton [36], or other crtera mentoned n [8, 13], but

1 ROC Curve for QP SVDD 1 ROC Curve for L SVDD 0.8 0.8 True Postve Rate 0.6 0.4 True Postve Rate 0.6 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 False Postve Rate (a) 0 0 0.2 0.4 0.6 0.8 1 False Postve Rate (b) Fg. 1 The ROC curves of a QP-SVDD and b L-SVDD on CBCL testng dataset snce t s not the focus of ths paper, we choose not to pursue further n ths ssue. To make the L-SVDD algorthm converge fast, from the concluson at the end of Sect. 3.2, we should set γ a large value. In all of our experments, we chose γ = 0.95 C to ensure the convergence of Algorthm 1. Accordng to the theoretcal property of penalty functon method [28, Sect. 6.2.2], the parameter ρ n the defnton of Q n Eq. (12) should be as large as possble. In our experments, we found that settng ρ = 200 s enough to ensure the closeness of 1 nα and 1 (n fact, all the expermental results have 1 nα 1 < 0.002 when the L-SVDD tranng algorthm termnates). Secton 4.2 also studes the roles of the parameters C, γ, and ρ n the convergence behavor of Algorthm 1. The teraton number M n Algorthm 1 was set to be 3000, although n our experments we found that most tmes the algorthm converges n 1000 teratons. The tolerance parameter n Algorthm 1 was set to be 10 5. In the decson rule gven by Eq. (17), we changed the value of the parameter b, and for each value of b, we calculated the true postve rate and false postve rate. We then plot the true postve rate vs. the false postve rate, resultng the recever operatng characterstc (ROC) curve [11], whch wll be used to llustrate the performance of the classfer. The classfer performs better f the correspondng ROC curve s hgher. In order to numercally compare the ROC curves of dfferent methods, we calculate the area under the curve (AUC) [11]. The larger the AUC, the better the overall performance of the classfer. To compare the support vectors extracted by the two algorthms, we regard the support vectors gven by QP- SVDD as true support vectors, and denote them as the set SV Q, denote the support vectors from L-SVDD as SV L. We defne precson and recall [27] as precson = SV Q SV L SV L where S represents the sze of a set S. Hgh precson means that L-SVDD fnds substantally more correct support vectors than ncorrect ones, whle hgh recall means that L-SVDD extracts most of the correct support vectors. 4.2 Face detecton and recall = SV Q SV L, SV Q Ths experment used the face dataset provded by the Center for Bologcal and Computatonal Learnng (CBCL) at MIT. The tranng set conssts of 6977 mages, wth 2429 face mages and 4548 non-face mages; the testng set conssts of 24,045 mages, ncludng 472 face mages and 23,573 nonface mages. Each mage s of sze 19 19, wth pxel values between 0 and 1. We dd not extract any specfc features for face detecton (e.g., the features used n [35]), nstead, we drectly used the pxel values as nput to L-SVDD and QP-SVDD. Fgure 1 shows the ROC curves of QP-SVDD and L-SVDD on the testng set. The ROC curves are so close that f we plot them n the same fgure, they wll be ndstngushable. The closeness of the ROC curves ndcates that the two resultng classfers should perform very closely. Indeed, the AUC for QP-SVDD s 0.8506 whle the AUC for L-SVDD s 0.8512, that s, L-SVDD classfer performs even slghtly better. On our computer, QP-SVDD spent 7301.8 s n tranng whle t took the L-SVDD tranng algorthm 862 teratons to converge, consumng only 13.0729 s. In another word, the proposed L-SVDD s almost 560 tmes faster than QP-SVDD n tranng. Thus, we could conclude that L-SVDD s much more tme-effcent n tranng

0.18 0.16 Convergence Curve of α Calculated Curve Ftted Exponental Curve 1 0 x 10 3 0.14 2 Norm of Dfference 0.12 0.1 0.08 0.06 Inner Product 3 4 5 0.04 6 0.02 7 0 0 100 200 300 400 500 600 700 800 900 Iteraton Number (a) 8 0 100 200 300 400 500 600 700 800 900 Iteraton Number (b) Fg. 2 a On the CBCL face tranng dataset, the convergence of the L-SVDD soluton α k to the QP-SVDD soluton α Q n terms of the norm of the dfference vector. b The evoluton of the nner product (α k ) (Qα k v) wth the teraton number Table 1 The tranng tme (t, n seconds) of QP-SVDD wth dfferent tranng set sze (n) Tranng set sze (n) 400 800 1200 1600 2000 2400 Tranng tme (t) 21.8 218.6 850.1 2090.7 4033.0 6864.7 t n 3 ( 10 6 ) 0.3405 0.4270 0.4920 0.5104 0.5041 0.4966 For the purpose of complexty analyss, the values for t n 3 ( 10 6 ) are also lsted than QP-SVDD, whle stll possessng smlar classfcaton performance. QP-SVDD found 79 support vectors, whle L-SVDD extracted 85 support vectors, wth the precson rate 92.94% and the recall rate 100%. These numbers ndcate that L-SVDD and QP-SVDD obtan models wth smlar complexty. The precson and recall rates show that L-SVDD correctly fnds all the support vectors whle only mstakenly regards a few tranng examples as support vectors. We should menton that, by mathematcal and expermental analyss, the orgnal SVDD papers [31, 33] conclude that, for the SVDD model wth Gaussan kernel, the number of support vectors decreases as the kernel parameter σ or the control parameter C ncreases. Ths concluson should be true for L-SVDD because t s just another soluton (although approxmate) to the same problem. We tested ths conjecture by changng dfferent parameter settngs, and found that the number of support vectors for L-SVDD also follows the concluson presented n [31, 33]. Snce ths ssue s not partcular to L-SVDD, we choose not to present the results. We denote the soluton of QP-SVDD as α Q, and calculate the norm of the dfference between the k-th teraton soluton of L-SVDD and the QP-SVDD soluton, that s, α k α Q. The black curve n Fg. 2a represents the evoluton of the norm α k α Q, whch shows that the dfference ndeed decreases to zero exponentally. We further ft the obtaned values of α k α Q to an exponental functon Ac k, and we plot the ftted curve n Fg. 2a as the red dashed curve. We see that the calculated values and the ftted values are close to each other, and ths numercally verfes our theoretcal result n Eq. (23). The purpose of Algorthm 1 s to solve Eq. (14) wth respect to α. To observe the evoluton of the soluton, Fg. 2b plots the nner product (α k ) (Qα k v), whch approaches 0 as the tranng program proceeds. In fact, when the algorthm termnated, the nner product value was 3.6883 10 4, very close to 0. To numercally nvestgate the relatonshp between the tranng tme of QP-SVDD and the tranng set sze, we gradually ncrease the tranng set sze and record the tranng tme (n seconds) of QP-SVDD, and the results are gven n Table 1, n whch we also lst the rato between the tranng tme and the cubc of tranng set sze. Table 1 shows that, wth the ncreasng of tranng set sze, the tranng tme ncreases dramatcally, and the majorty of the rato t n 3 are around 0.4618 10 6 (average of the 3rd row). Ths suggests that the tranng tme complexty of QP-SVDD s around O(n 3 ). We conducted the same experment wth L-SVDD, and Table 2 gves the tranng tme of L-SVDD for dfferent tranng set sze, and the number of teratons needed are

Table 2 The tranng tme (t, n seconds) of L-SVDD wth dfferent tranng set sze (n) Tranng set sze (n) 400 800 1200 1600 2000 2400 Tranng tme (t) 0.1092 0.5772 1.5912 3.5256 7.0512 11.8249 # Iteraton 1947 441 560 726 853 Tme per teraton (tp) 0.0006 0.0017 0.0036 0.0063 0.0097 0.0139 tp n 2 ( 10 8 ) 0.3573 0.2599 0.2506 0.2459 0.2428 0.2407 For the purpose of tme complexty analyss, the teraton numbers needed to converge, the tme per teraton (tp), and tp n 2 are also lsted Table 3 On a subset from face data, the values of n α wth dfferent parameter settngs ρ 100 150 200 300 400 500 600 C = 1 0.9982 0.9988 0.9991 0.9994 0.9996 0.9996 0.9997 C = 2 0.9983 0.9988 0.9991 0.9994 0.9996 0.9997 0.9997 C = 4 0.9983 0.9989 0.9991 0.9994 0.9996 0.9997 0.9997 C = 8 0.9983 0.9989 0.9991 0.9994 0.9996 0.9997 0.9997 Table 4 On a subset of face data, the number of teratons needed for L-SVDD to converge wth dfferent values of γ γ 0.9 / C 0.92 / C 0.94 / C 0.95 / C 0.97 / C 0.98 / C 0.99 / C # Iteraton 516 508 499 496 487 483 479 Table 5 On a subset of face data, the number of teratons needed for L-SVDD to converge wth dfferent values of C C 1 2 4 6 8 10 16 # Iteraton 145 272 495 685 852 1012 1425 Table 6 On a subset of face data, the number of teratons needed for L-SVDD to converge wth dfferent values of ρ ρ 100 150 200 300 400 500 # Iteraton 495 495 496 494 492 489 also shown. To verfy our theoretcal analyss n Remark 2, we calculated the tme per teraton (tp), and the rato between tp and n 2, whch are also presented n Table 2. Table 2 shows that the values of tp n 2 are around a constant,.e., 0.2662 10 8 (average of the 5th row). Ths ndcates that the tme per teraton s of order O(n 2 ), hence the total tranng tme complexty of L-SVDD s about the teraton number multplyng O(n 2 ), whch s consstent wth the theoretcal analyss n Remark 2. In the dervaton of Eq. (11), to absorb the constrant n α = 1 n Eq. (10), we ntroduced a parameter ρ n the penalty functon, and the penalty functon method requres that ρ should be large enough to ensure the valdty of the constrant [28, Sect. 6.2.2]. To verfy ths, we randomly selected 600 face mages as tranng data and run the L-SVDD algorthm for dfferent settngs of ρ and C (snce γ depends on C through γ = 0.95 C ). Table 3 gves the values of n α for dfferent parameter settngs. The results show that as long as ρ large enough, n α s n reasonably close to 1. Moreover, for a fxed ρ, α s almost the same for a varety of C values. Ths paper sets ρ = 200, whch s suffcent to make the constrant hold approxmately. To expermentally nvestgate the mpact of parameters C, ρ, and γ to the convergence behavor of L-SVDD, we randomly select 600 face mages, and run L-SVDD tranng algorthm wth dfferent parameter settngs. We frst fx C = 4 and ρ = 200, and change the value of γ. Table 4 lsts the number of teratons to converge along wth dfferent values of γ. It s observed that as γ ncreases, the algorthm converges faster. Next, we fx ρ = 100, set γ = 0.95 C, and change the value of C, and Table 5 gves the number of teratons to converge along wth dfferent values of

Table 7 The teraton numbers needed and the tranng tme (n seconds) of IPM-SVDD and L-SVDD wth dfferent parameter settngs ( γ was set to be 0.95 / C) λ: 100 200 C : 2 4 8 2 4 8 # Iter IPM-SVDD 28 29 29 28 29 29 Tme IPM-SVDD 37.2530 37.6430 38.0330 36.6758 37.6742 37.7366 # Iter L-SVDD 272 495 852 272 496 846 Tme L-SVDD 0.2808 0.4836 0.6864 0.3432 0.4368 0.7176 C. We see that a smaller C wll result n faster convergence. Fnally, we fx C = 4 and γ = 0.95 C, and change the parameter ρ. Table 6 shows the number of teratons to converge wth dfferent values of ρ. It s evdent that ρ hardly has any sgnfcant nfluence to the convergence rate. All these numercal results are consstent wth our theoretcal analyss n Sect. 3.2. To compare the nteror pont method based SVDD (IPM- SVDD) to the proposed L-SVDD, we randomly select 600 face mages and tran the two SVDD models wth dfferent parameter settngs, and the convergence measures are gven n Table 7. Table 7 shows that, wth all the parameter settngs, although IPM-SVDD converges n fewer teratons, t spends consderably more tranng tme than L-SVDD (.e., 50 100 tmes slower). Ths s expected due to the reason that was stated n Sect. 3.3. We also notce that n all cases, the dfference between the IPM soluton ( α I ) and L-SVDD soluton ( α L ) was wthn 10 4 n terms of the norm of the dfference vector,.e., α I α L < 10 4, and ths means that the two methods get very close models. In our experments, t was also observed that f we set the tranng set sze n = 1200, the IPM-SVDD tranng program would have memory ssue on our computer but L-SVDD dd not have such problem even for n = 2400. Ths s because IPM needs varables for gradent, Hessan, search drectons, and other ntermedate results, whle L-SVDD does not have any ntermedate result to store n memory. From our analyss and expermental results, we may clam that the proposed L-SVDD s not only tme effcent but also space effcent, compared to IPM-SVDD. 4.3 Handwrtten dgt recognton The handwrtten dgts dataset conssts of 7291 tranng examples and 2007 testng examples, and s avalable at https ://web.stanf ord.edu/~hast e/elems tatle arn/. The dataset conssts of normalzed handwrtten dgts ( 0 to 9 ), each s a 16 16 gray-scale mage. Same as the experment n Sect. 4.2, we smply use these 256 pxel values as nputs to the algorthms. To compare the performance of L-SVDD to that of QP- SVDD, we created ten bnary classfcaton problems, wth each dgt as postve class, respectvely. The performance measures of QP-SVDD and L-SVDD are gven n Table 8, whch lsts the number of support vectors, AUC on testng set, and tranng tme for each algorthm. We observe that for all the ten problems, the AUC of both algorthms are qute close, snce the dfference between AUC s at most 0.001, and most problems have AUC dfference even under 0.0004. The ROC curves of QP-SVDD and L-SVDD, smlar to the results on CBCL face data, are almost ndstngushable (also ndcated by the AUC values), we thus choose not to present the ROC curves. Table 8 also lsts the tranng tmes of QP-SVDD and L-SVDD on all the ten problems, and the szes of the consdered problems are also gven. It s observed that for the problems wth sze under 1000, L-SVDD termnates wthn 1 second; for the two problems wth larger sze, L-SVDD needs 2 3 s. To clearly llustrate the speed advantage of L-SVDD, Table 8 lsts the rato between the tranng tmes of QP-SVDD and L-SVDD for dfferent problems. It s clear that on most of the problems, L-SVDD s 200 400 tmes faster than the QP counterpart. More mportantly, we should menton that n our mplementaton, the core quadratc programmng code for QP-SVDD was developed n C++ whch s much more computatonally effcent than MATLAB, n whch L-SVDD was developed. Takng ths factor nto account, L-SVDD would be much more tme-effcent than QP-SVDD, f they were mplemented n the same programmng language and ran on the same platform. To gan further nsght about the computatonal complexty of QP-SVDD, we compare the tranng tme of QP- SVDD on each problem. Snce the problem for dgt 8 has the smallest sze, we use t as the baselne. We calculate the rato of the tranng tme on each dgt to that on dgt 8, along wth the cubc of the problem sze rato, and the results are gven n Table 9. Table 9 shows that the two numbers are close enough for each problem, whch ndcates that the tranng tme of QP-SVDD grows roughly n the rate of O(n 3 ). Table 11 lsts the number of teratons needed for L-SVDD tranng algorthm to converge. We see that the algorthm usually termnates n 600 teratons, wth only one excepton. To analyze the computatonal complexty of L-SVDD, we frst calculate the average tme spent on one teraton for each problem, usng the nformaton gven n Tables 8 and 11. We denote the results as tp 0 through tp 9 ( tp stands for tme per teraton ). Same as the analyss

Table 8 On the handwrtten dgt dataset, the performance comparson between QP-SVDD and L-SVDD Dgt # Tran Method # SV AUC Tranng tme Tme rato 0 1194 QP-SVDD 109 0.9811 867.8648 409.0615 L-SVDD 111 0.9813 2.1216 1 1005 QP-SVDD 18 0.9905 478.6579 158.9803 L-SVDD 19 0.9905 3.0108 2 731 QP-SVDD 128 0.8977 165.7199 366.3128 L-SVDD 128 0.8973 0.4524 3 658 QP-SVDD 82 0.9480 121.6184 236.2440 L-SVDD 84 0.9485 0.5148 4 652 QP-SVDD 87 0.9396 117.4064 268.7875 L-SVDD 88 0.9405 0.4368 5 556 QP-SVDD 94 0.8889 70.6685 283.1270 L-SVDD 95 0.8891 0.2496 6 664 QP-SVDD 81 0.9785 124.8008 285.7161 L-SVDD 82 0.9783 0.4056 7 645 QP-SVDD 62 0.9711 113.3035 258.0662 L-SVDD 63 0.9709 0.4836 8 542 QP-SVDD 86 0.9165 66.8152 237.9459 L-SVDD 87 0.9168 0.2808 9 644 QP-SVDD 65 0.9752 114.5671 253.2429 L-SVDD 66 0.9753 0.4524 The lsted are the tranng set sze, the number of support vectors found by each method, the AUC on the testng set by each method, and the tranng tmes n seconds. The last column gves the rato between the tranng tmes of the two algorthms on each problem Table 9 For QP-SVDD, the cubc of problem sze ratos and the tranng tme ratos on the handwrtten dgt dataset (n 0 n 8 ) 3 = 10.69 (n 1 n 8 ) 3 = 6.38 (n 2 n 8 ) 3 = 2.45 (n 3 n 8 ) 3 = 1.79 (n 4 n 8 ) 3 = 1.74 t 0 t 8 = 12.98 t 1 t 8 = 7.16 t 2 t 8 = 2.48 t 3 t 8 = 1.82 t 4 t 8 = 1.76 (n 5 n 8 ) 3 = 1.08 (n 6 n 8 ) 3 = 1.84 (n 7 n 8 ) 3 = 1.68 (n 8 n 8 ) 3 = 1 (n 9 n 8 ) 3 = 1.67 t 5 t 8 = 1.06 t 6 t 8 = 1.87 t 7 t 8 = 1.69 t 8 t 8 = 1 t 9 t 8 = 1.71 The results show that QP-SVDD roughly has computatonal complexty O(n 3 ) Table 10 For L-SVDD, the square of problem sze ratos and the rato of tme per teraton n tranng on the handwrtten dgt dataset (n 0 n 8 ) 2 = 4.85 (n 1 n 8 ) 2 = 3.44 (n 2 n 8 ) 2 = 1.82 (n 3 n 8 ) 2 = 1.47 (n 4 n 8 ) 2 = 1.45 tp 0 tp 8 = 4.54 tp 1 tp 8 = 2.76 tp 2 tp 8 = 2.00 tp 3 tp 8 = 1.43 tp 4 tp 8 = 1.58 (n 5 n 8 ) 2 = 1.05 (n 6 n 8 ) 2 = 1.50 (n 7 n 8 ) 2 = 1.42 (n 9 n 8 ) 2 = 1 (n 9 n 8 ) 2 = 1.41 tp 5 tp 8 = 0.99 tp 6 tp 8 = 1.39 tp 7 tp 8 = 1.43 tp 8 tp 8 = 1 tp 9 tp 8 = 1.43 The results show that L-SVDD roughly has computatonal complexty O(n 2 ) multplyng the teraton number for QP-SVDD, we use dgt 8 as the baselne, and calculate the rato of the tme per teraton n tranng on each dgt to that on dgt 8, along wth the square of the problem sze rato. The results are presented n Table 10, whch shows that these numbers are reasonably close for each problem, and ths ndcates that the tme per teraton grows roughly wth order O(n 2 ). Consequently, the tme complexty of tranng L-SVDD (.e., Algorthm 1) s about O(n 2 ) multplyng the teraton number, whch s consstent wth the result n Sect. 4.2. These analyss verfes our theoretcal analyss n Remark 2. In our mplementatons for QP-SVDD and L-SVDD, the kernel matrx was calculated beforehand, thus the tme spent on calculatng kernel matrx was not counted as tranng tme. In fact, computng kernel matrx spent much more tme than Algorthm 1 tself n our experments. 6 Our theoretcal 6 For example, for dgt 0, our computer spent 47.0655 s on constructng kernel matrx, whle only 2.1216 s on Algorthm 1.

Table 11 Comparson of the support vectors extracted by QP-SVDD and those extracted by L-SVDD, n terms of precson and recall rates Dgt 0 1 2 3 4 Precson 98.20% 94.74% 100% 97.62% 98.86% Recall 100% 100% 100% 100% 100% #Iteraton 604 1410 292 465 357 Dgt 5 6 7 8 9 Precson 98.95% 98.78% 98.41% 98.85% 97.01% Recall 100% 100% 100% 100% 100% #Iteraton 325 378 438 363 409 Ths table also presents the teraton numbers needed for L-SVDD to converge 0.14 0.12 Convergence Curve of α Calculated Curve Ftted Exponental Curve 1 0 x 10 3 Norm of Dfference 0.1 0.08 0.06 0.04 0.02 Inner Product 2 3 4 5 0 0 50 100 150 200 250 300 350 Iteraton Number (a) 6 0 50 100 150 200 250 300 350 Iteraton Number (b) Fg. 3 a On dgt 4, the convergence of the L-SVDD soluton α k to the QP-SVDD soluton α Q n terms of the norm of the dfference vector. b On dgt 4, the evoluton of the nner product (α k ) (Qα k v) wth the teraton number analyss of the computatonal complexty for QP-SVDD and L-SVDD s consstent wth the expermental results, but the order of complexty seems not to support the results presented n Table 8, whch shows that L-SVDD s much more tme-effcent than QP-SVDD. To explan ths apparent paradox, we wrte the tranng tme of QP-SVDD as t Q = k 1 n 3, and that of L-SVDD could be wrtten as t L = k 2 Mn 2, where M s the teraton number of L-SVDD, n s the tranng set sze, and k 1 and k 2 are constants. Usng the nformaton form Table 8, we can calculate that n average k 1 4.3644 10 7 ; and we calculate k 2 2.5785 10 9 from Tables 8 and 11. Note that these numbers are comparable to those obtaned n Sect. 4.2 from Tables 1 and 2. Therefore, k 1 s much larger than k 2 n average (about 170 tmes), and ths explans why L-SVDD s so much more tme-effcent than the QP counterpart. Table 8 shows that QP-SVDD and L-SVDD extract roughly the same number of support vectors. To nvestgate the overlap of the sets of support vectors, we calculate the precson and recall rates, gven n Table 11. The results show that all the recall rates are 100%, whch means that L-SVDD extracts all the support vectors whch are found by QP-SVDD, and the very hgh precson rates demonstrate that L-SVDD only mstakenly regards a few tranng examples as support vectors. Smlar to the experment on face dataset, on the problem for dgt 4, we calculate the norm of dfference between the vector α k from the k-th teraton of L-SVDD and the vector α Q from QP-SVDD, shown as the black curve n Fg. 3a, whch also gves the ftted exponental functon as the red dashed curve. We once agan see that the calculated values and the ftted values are very close to each other, and ths numercally verfes our theoretcal analyss presented n Eq. (23). Fgure 3b presents the nner product (α k ) (Qα k v) wth the teraton number on the problem of dgt 4, whch clearly shows that the nner product approaches 0 as the tranng algorthm proceeds. In fact, when the algorthm termnated, the nner product value was 1.6360 10 4. The

curves on other dgts gve the same pattern, thus we do not show them all. 5 Concluson and future works Support vector data descrpton (SVDD) s a well known tool for data descrpton and patter classfcaton, wth wde applcatons. In lterature, the SVDD model s often traned by solvng the dual of a constraned optmzaton problem, resultng n a quadratc programmng. However, the quadratc programmng s computatonally expensve to solve, wth tme complexty about O(n 3 ), where n s the tranng set sze. Usng the sum of the squared error and the dea of quadratc penalty functon method, we formulate the Lagrangan dual problem of SVDD as mnmzng a convex quadratc functon over the postve orthant, whch can be solved effcently by a smple teratve algorthm. The proposed Lagrangan SVDD (L-SVDD) algorthm s very easy to mplement, requrng no partcular optmzaton toolbox other than basc matrx operatons. Theoretcal and expermental analyss show that the L-SVDD soluton converges to the Quadratc Programmng based SVDD (QP-SVDD) soluton r-lnearly. Extensve experments were conducted on varous pattern classfcaton problems, and we compared the performance of L-SVDD to that of QP-SVDD, n terms of ROC curve measures and the tranng tme. Our results show that L-SVDD has smlar classfcaton performance as ts QP counterpart; both L-SVDD and QP-SVDD extract almost the same set of support vectors. However, L-SVDD s a couple hundreds tmes faster than QP-SVDD n tranng. The experments also verfed the theoretcal analyss about the convergence rate and the tranng tme complexty of L-SVDD. Due to the lmt of avalable computng resource, we dd not test L-SVDD on larger tranng set. However, f the tranng set s too large, we conjecture that the performance of L-SVDD mght deterorate. One reason s that n Algorthm 1 for L-SVDD, we need to calculate the nverse of an n n matrx Q, where n s the tranng set sze. It s well known that nvertng a large matrx s numercally unrelable and tme/memory consumng. Another reason s that Algorthm 1 nvolves matrx and vector multplcaton, whose computng cost scales up wth n 2. Ths work verfes the speed advantage of L-SVDD when the tranng set sze s at order of several thousands. It would be nsghtful to nvestgate the performance of L-SVDD for larger scaled problems. Ths paper formulates the optmzaton problem for L-SVDD as a lnear complementarty problem shown n Eq. (14), and we propose Algorthm 1 to solve t. In optmzaton lterature, there are many methods to solve lnear complementarty problem, for example, Newton s method [1], pvotng method [16], and the methods n [10]. Thus, as next step of work, we plan to apply these methods to L-SVDD model. Smlar to the works on support vector machne for classfcaton [12] and regresson [3], we can remove the constrant n Eq. (13) by addng extra penalty terms, obtanng an unconstraned optmzaton problem for an approxmated SVDD model. Then gradent based optmzaton methods can be appled to ths approxmaton, for example, Newton s method [5], coordnate gradent descent [6], block-wse coordnate descent [19], or conjugate gradent method [39, 40]. Ths s another possble extenson to the current work. Acknowledgements The author would lke to thank the edtors and four anonymous revewers for ther constructve suggestons whch greatly helped mprove the paper. Ths work was supported by a Summer Faculty Fellowshp from Mssour State Unversty. Orthogonalty condton for two nonnegatve vectors We show that two nonnegatve vectors a and b are perpendcular, f and only f a =(a γb) + for any real γ >0. If two nonnegatve real numbers a and b satsfy ab = 0, there s at least one of a and b s 0. If a = 0 and b 0, then for any γ >0, a γb 0, so that (a γb) + = 0 = a ; f a > 0, we must have b = 0, then for any real γ >0, (a γb) + =(a) + = a. In both cases, there s a =(a γb) + for any real number γ >0. Conversely, assume that two nonnegatve real numbers a and b can be wrtten as a =(a γb) + for any real number γ>0. If a and b are both strctly postve, then a γb < a snce γ >0. Consequently, (a γb) + < a, whch s contradct to the assumpton that a =(a γb) +. Thus at least one of a and b must be 0,.e., ab = 0. Now assume that nonnegatve vectors a and b n space R p are perpendcular, that s, p a b = 0. Snce both of a and b are nonnegatve, there must be a b = 0 for = 1, 2,, p. By the last argument, ths s equvalent to a =(a γb ) + for any γ >0 and any = 1, 2,, p. In vector form, we have a =(a γb) +. References 1. Aganagć M (1984) Newton s method for lnear complementarty problems. Math Program 28(3):349 362 2. Armand P, Glbert JC, Jan-Jégou S (2000) A feasble BFGS nteror pont algorthm for solvng convex mnmzaton problems. SIAM J Optm 11(1):199 222