Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

Lecture 6: Minimum encoding ball and Support vector data description (SVDD) Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 May 12, 2014

Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The imum enclosing ball problem with errors The imum enclosing ball problem in a RKHS The two class Support vector data description (SVDD)

The imum enclosing ball problem [Tax and Duin, 2004] Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 3 / 35

The imum enclosing ball problem [Tax and Duin, 2004] the center Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 3 / 35

The imum enclosing ball problem [Tax and Duin, 2004] the radius Given n points, {x i, i = 1, n} { R 2 R IR,c IR d with x i c 2 R 2, i = 1,...,n. What is that in the convex programg hierarchy? LP, QP, QCQP, SOCP and SDP Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 3 / 35

The convex programg hierarchy (part of) LP x with and f x Ax d 0 x QCQP 1 x 2 x Gx+f x with x B i x+a i x d i i = 1, n QP { x 1 2 x Gx+f x with Ax d SOCP x with f x x a i b i x+d i i = 1, n The convex programg hierarchy? Model generality: LP < QP < QCQP < SOCP < SDP Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 4 / 35

MEB as a QP in the primal Theorem (MEB as a QP) The two following problems are equivalent, { R 2 R IR,c IR d with x i c 2 R 2, i = 1,...,n { w,ρ 1 2 w 2 ρ with w x i ρ+ 1 2 x i 2 with ρ = 1 2 ( c 2 R 2 ) and w = c. Proof: x i c 2 R 2 x i 2 2x i c+ c 2 R 2 2x i c R 2 x i 2 c 2 2x i c R 2 + x i 2 + c 2 x i c 1 2 ( c 2 R 2 ) + 1 2 x i 2 }{{} ρ Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 5 / 35

MEB and the one class SVM SVDD: { w,ρ 1 2 w 2 ρ with w x i ρ+ 1 2 x i 2 SVDD and linear OCSVM (Supporting Hyperplane) if i = 1, n, x i 2 = constant, it is the the linear one class SVM (OC SVM) The linear one class SVM [Schölkopf and Smola, 2002] { w,ρ 1 2 w 2 ρ with w x i ρ with ρ = ρ+ 1 2 x i 2 OC SVM is a particular case of SVDD Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 6 / 35

When i = 1, n, x i 2 = 1 c 0 with x i c 2 R 2 w x i ρ ρ = 1 2 ( c 2 R + 1) SVDD and OCSVM "Belonging to the ball" is also "being above" an hyperplane Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 7 / 35

MEB: KKT L(c, R,α) = R 2 + ( α i xi c 2 R 2) KKT conditionns : stationarty 2c α i 2 n α i x i = 0 1 α i = 0 primal admiss. x i c 2 R 2 dual admiss. α i 0 ( complementarity α i xi c 2 R 2) = 0 The representer theorem i = 1, n i = 1, n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 8 / 35

MEB: KKT L(c, R,α) = R 2 + ( α i xi c 2 R 2) the radius KKT conditionns : stationarty 2c α i 2 n α i x i = 0 1 α i = 0 primal admiss. x i c 2 R 2 dual admiss. α i 0 ( complementarity α i xi c 2 R 2) = 0 The representer theorem i = 1, n i = 1, n Complementarity tells us: two groups of points the support vectors x i c 2 = R 2 and the insiders α i = 0 Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 8 / 35

MEB: Dual The representer theorem: c = α i x i = α i α i x i ( L(α) = α i xi α j x j 2) j=1 α i α j xi x j = α Gα j=1 and α i xi x i = α diag(g) with G = XX the Gram matrix: G ij = xi x j, α Gα α diag(g) α IR n with e α = 1 and 0 α i, i = 1...n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 9 / 35

SVDD primal vs. dual Primal Dual R 2 R IR,c IR d with x i c 2 R 2, i = 1,...,n α Gα α diag(g) α with e α = 1 and 0 α i, i = 1...n d + 1 unknown n constraints can be recast as a QP perfect when d << n n unknown with G the pairwise influence Gram matrix n box constraints easy to solve to be used when d > n

Looking for R 2 { α Gα α diag(g) α with e α = 1, 0 α i, i = 1, n The Lagrangian: L(α,µ,β) = α Gα α diag(g)+µ(e α 1) β α Stationarity cond.: α L(α,µ,β) = 2Gα diag(g)+µe β = 0 The bi dual { α Gα+µ α with 2Gα+diag(G) µe by identification R 2 = µ+α Gα = µ+ c 2 µ is the Lagrange multiplier associated with the equality constraint α i = 1 Also, because of the complementarity condition, if x i is a support vector, then β i = 0 implies α i > 0 and R 2 = x i c 2.

The imum enclosing ball problem with errors the slack The same road map: initial formuation reformulation (as a QP) Lagrangian, KKT dual formulation bi dual Initial formulation: for a given C R,a,ξ R 2 + C ξ i with x i c 2 R 2 +ξ i, i = 1,...,n and ξ i 0, i = 1,...,n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 13 / 35

The MEB with slack: QP, KKT, dual and R 2 SVDD as a QP: w,ρ again with OC SVM as a particular case. With G = XX Dual SVDD: 1 2 w 2 ρ+ C 2 with w x i ρ+ 1 2 x i 2 1 2 ξ i and ξ i 0, i = 1, n α α Gα α diag(g) with e α = 1 and 0 α i C, i = 1, n for a given C 1. If C is larger than one it is useless (it s the no slack case) R 2 = µ+c c with µ denoting the Lagrange multiplier associated with the equality constraint n α i = 1. ξ i

Variations over SVDD Adaptive SVDD: the weighted error case for given w i, i = 1, n c IR p,r IR,ξ IR n with R + C w i ξ i x i c 2 R+ξ i ξ i 0 i = 1, n The dual of this problem is a QP [see for instance Liu et al., 2013] { α XX α α diag(xx ) α IR n n with α i = 1 0 α i Cw i i = 1, n Density induced SVDD (D-SVDD): c IR p,r IR,ξ IR n with R + C ξ i w i x i c 2 R+ξ i ξ i 0 i = 1, n

SVDD in a RKHS The feature map: IR p H c f( ) x i k(x i, ) x i c IR p R 2 k(x i, ) f( ) 2 H R2 Kernelized SVDD (in a RKHS) is also a QP R 2 + C ξ i f H,R IR,ξ IR n with k(x i, ) f( ) 2 H R2 +ξ i i = 1, n ξ i 0 i = 1, n

SVDD in a RKHS: KKT, Dual and R 2 L = R 2 + C = R 2 + C ξ i + ξ i + ( α i k(xi,.) f(.) 2 H R 2 ) ξ i β i ξ i α i ( k(xi,x i ) 2f(x i )+ f 2 H R 2 ξ i ) β i ξ i KKT conditions Stationarity 2f(.) n α i 2 n α ik(.,x i ) = 0 The representer theorem 1 n α i = 0 C α i β i = 0 Primal admissibility: k(x i,.) f(.) 2 R 2 +ξ i, ξ i 0 Dual admissibility: α i 0, β i 0 Complementarity αi ( k(xi,.) f(.) 2 R 2 ξ i ) = 0 β i ξ i = 0

SVDD in a RKHS: Dual and R 2 L(α) = = α i k(x i,x i ) 2 α i k(x i,x i ) f(x i )+ f 2 H with f(.) = j=1 α i α j k(x i,x j ) }{{} G ij α j k(.,x j ) j=1 G ij = k(x i,x j ) α Gα α diag(g) α with e α = 1 and 0 α i C, i = 1...n As it is in the linear case: R 2 = µ+ f 2 H with µ denoting the Lagrange multiplier associated with the equality constraint n α i = 1.

SVDD train and val in a RKHS Train using the dual form (in: G, C; out: α, µ) α α Gα α diag(g) with e α = 1 and 0 α i C, i = 1...n Val with the center in the RKHS: f(.) = n α ik(.,x i ) φ(x) = k(x,.) f(.) 2 H R2 = k(x,.) 2 H 2 k(x,.), f(.) H + f(.) 2 H R2 = k(x,x) 2f(x)+R 2 µ R 2 = 2f(x)+k(x,x) µ = 2 α i k(x,x i )+k(x,x) µ φ(x) = 0 is the decision border Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 20 / 35

An important theoretical result For a well-calibrated bandwidth, The SVDD estimates the underlying distribution level set [Vert and Vert, 2006] The level sets of a probability density function IP(x) are the set C p = {x IR d IP(x) p} It is well estimated by the empirical imum volume set V p = {x IR d k(x,.) f(.) 2 H R 2 0} The frontiers coincides

SVDD: the generalization error For a well-calibrated bandwidth, (x 1,...,x n ) i.i.d. from some fixed but unknown IP(x) Then [Shawe-Taylor and Cristianini, 2004] with probability at least 1 δ, ( δ ]0, 1[), for any margin m > 0 IP ( k(x,.) f(.) 2 H R 2 + m ) 1 mn ξ i + 6R2 ln(2/δ) m n + 3 2n

Equivalence between SVDD and OCSVM for translation invariant kernels (diagonal constant kernels) Theorem Let H be a RKHS on some domain X endowed with kernel k. If there exists some constant c such that x X, k(x,x) = c, then the two following problems are equivalent, f,r,ξ with R + C ξ i k(x i,.) f(.) 2 H R+ξ i ξ i 0 i = 1, n with ρ = 1 2 (c + f 2 H R) and ε i = 1 2 ξ i. f,ρ,ξ with 1 2 f 2 H ρ+c f(x i ) ρ ε i ε i 0 i = 1, n ε i

Proof of the Equivalence between SVDD and OCSVM f H,R IR,ξ IR n R + C ξ i with k(x i,.) f(.) 2 H R+ξ i, ξ i 0 i = 1, n since k(x i,.) f(.) 2 H = k(x i,x i )+ f 2 H 2f(x i) R + C ξ i f H,R IR,ξ IR n with 2f(x i ) k(x i,x i )+ f 2 H R ξ i, ξ i 0 i = 1, n. Introducing ρ = 1 2 (c + f 2 H R) that is R = c + f 2 H 2ρ, and since k(x i,x i ) is constant and equals to c the SVDD problem becomes 1 f H,ρ IR,ξ IR n 2 f 2 H ρ+ C 2 with f(x i ) ρ 1 2 ξ i, ξ i 0 i = 1, n ξ i

leading to the classical one class SVM formulation (OCSVM) 1 f H,ρ IR,ξ IR n 2 f 2 H ρ+c ε i with f(x i ) ρ ε i, ε i 0 i = 1, n with ε i = 1 2 ξ i. Note that by putting ν = 1 nc we can get the so called ν formulation of the OCSVM 1 f H,ρ IR,ξ IR n 2 f 2 H nνρ + ξ i with f (x i ) ρ ξ i, ξ i 0 i = 1, n with f = Cf, ρ = Cρ, and ξ = Cξ.

Duality Note that the dual of the SVDD is { α IR n with α Gα α g n α i = 1 0 α i C i = 1, n where G is the kernel matrix of general term G i,j = k(x i,x j ) and g the diagonal vector such that g i = k(x i,x i ) = c. The dual of the OCSVM is the following equivalent QP { 1 α IR n 2 α Gα with n α i = 1 0 α i C i = 1, n Both dual forms provide the same solution α, but not the same Lagrange multipliers. ρ is the Lagrange multiplier of the equality constraint of the dual of the OCSVM and R = c +α Gα 2ρ. Using the SVDD dual, it turns out that R = λ eq +α Gα where λ eq is the Lagrange multiplier of the equality constraint of the SVDD dual form.

The two class Support vector data description (SVDD) 4 4 3 3 2 2 1 1 0 0 1 1 2 3 4 3 2 1 0 1 2 3 2 3 4 3 2 1 0 1 2 3. R 2 +C c,r,ξ +,ξ ( ξ + i + y i =1 ξ i y i = 1 ) with x i c 2 R 2 +ξ + i, ξ + i 0 i such that y i = 1 and x i c 2 R 2 ξ i, ξ i 0 i such that y i = 1 Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 28 / 35

The two class SVDD as a QP R 2 +C c,r,ξ +,ξ ( ξ + i + y i =1 ξ i y i = 1 ) with x i c 2 R 2 +ξ + i, ξ + i 0 i such that y i = 1 and x i c 2 R 2 ξ i, ξ i 0 i such that y i = 1 { xi 2 2x i c+ c 2 R 2 +ξ + i, ξ + i 0 i such that y i = 1 x i 2 2x i c+ c 2 R 2 ξ i, ξ i 0 i such that y i = 1 2x i c c 2 R 2 + x i 2 ξ + i, ξ + i 0 i such that y i = 1 2x i c c 2 + R 2 x i 2 ξ i, ξ i 0 i such that y i = 1 2y i x i c y i ( c 2 R 2 + x i 2 ) ξ i, ξ i 0 i = 1, n change variable: ρ = c 2 R 2 c,ρ,ξ c 2 ρ + C n ξ i with 2y i x i c y i (ρ x i 2 ) ξ i i = 1, n and ξ i 0 i = 1, n

The dual of the two class SVDD G ij = y i y j x i x j The dual formulation: α IR n with α Gα n α iy i x i 2 y i α i = 1 0 α i C i = 1, n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 30 / 35

The two class SVDD vs. one class SVDD The two class SVDD (left) vs. the one class SVDD (right) Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 31 / 35

Small Sphere and Large Margin (SSLM) approach Support vector data description with margin [Wu and Ye, 2009] ( R 2 +C ξ + w,r,ξ IR n i + ) ξ i y i =1 y i = 1 with x i c 2 R 2 1+ξ + i, ξ + i 0 i such that y i = 1 and x i c 2 R 2 + 1 ξ i, ξ i 0 i such that y i = 1 x i c 2 R 2 + 1 ξ i and y i = 1 y i x i c 2 y i R 2 1+ξ i L(c, R,ξ,α,β) = R 2 ( +C ξ i + α i yi x i c 2 y i R 2 ) + 1 ξ i β i ξ i 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3

SVDD with margin dual formulation L(c, R,ξ,α,β) = R 2 +C Optimality: c = ξ i + α i y i x i ; L(α) = ( α i yi x i c 2 y i R 2 ) + 1 ξ i β i ξ i α i y i = 1 ; 0 α i C ( α i yi x i = j=1 α i y j x j 2) + j=1 α j α i y i y j x j x i + α i x i 2 y i α i + Dual SVDD is also a quadratic program α Gα e α f α α IR n problem D with y α = 1 and 0 α i C i = 1, n with G a symmetric matrix n n such that G ij = y i y j x j x i and f i = x i 2 y i α i

Conclusion Applications outlier detection change detection clustering large number of classes variable selection,... A clear path reformulation (to a standart problem) KKT Dual Bidual a lot of variations L 2 SVDD two classes non symmetric two classes in the symmetric classes (SVM) the multi classes issue practical problems with translation invariant kernels.

Bibliography Bo Liu, Yanshan Xiao, Longbing Cao, Zhifeng Hao, and Feiqi Deng. Svdd-based outlier detection on uncertain data. Knowledge and information systems, 34(3):597 618, 2013. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. David MJ Tax and Robert PW Duin. Support vector data description. Machine learning, 54(1):45 66, 2004. Régis Vert and Jean-Philippe Vert. Consistency and convergence rates of one-class svms and related algorithms. The Journal of Machine Learning Research, 7:817 854, 2006. Mingrui Wu and Jieping Ye. A small sphere and large margin approach for novelty detection using training data with outliers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(11):2088 2092, 2009. Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 35 / 35