Robust minimum encoding ball and Support vector data description (SVDD)

Size: px

Start display at page:

Download "Robust minimum encoding ball and Support vector data description (SVDD)"

Emily Harmon
5 years ago
Views:

1 Robust minimum encoding ball and Support vector data description (SVDD) Stéphane Canu Meriem El Azamin, Carole Lartizien & Gaëlle Loosli INRIA, Septembre 2014 September 24, 2014

2 Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) 2 Robust outlier detection with L0-SVDD L 0 SVDD

3 Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) 2 Robust outlier detection with L0-SVDD L 0 SVDD the radius

4 The minimum enclosing ball problem [Sylvester, 1857] Original formulation It is required to find the least circle which shall contain a given system of points in a plane Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

5 The minimum enclosing ball problem [Sylvester, 1857] the center Original formulation It is required to find the least circle which shall contain a given system of points in a plane Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

6 The minimum enclosing ball problem [Sylvester, 1857] the radius R 2 Given n points, {x { i, i = 1, n} min R IR,c IR d with x i c 2 R 2, i = 1,..., n. Original formulation It is required to find the least circle which shall contain a given system of points in a plane Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

7 MEB as a QP in the primal [Elzinga and Hearn, 1972] Theorem (MEB as a QP) R 2 The two following problems are equivalent, { min R IR,c IR d with x i c 2 R 2, i = 1,..., n { min c,ρ 1 2 c 2 ρ with c x i ρ x i 2 with ρ = 1 2 ( c 2 R 2 ) Proof: x i c 2 R 2 x i 2 2x i c + c 2 R 2 2x i c R 2 x i 2 c 2 2x i c R 2 + x i 2 + c 2 x i c 1 2 ( c 2 R 2 ) x i 2 }{{} ρ Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

8 MEB and the one class SVM SVDD: { min c,ρ 1 2 c 2 ρ with c x i ρ x i 2 SVDD and linear OCSVM (Supporting Hyperplane) if i = 1, n, x i 2 = constant, it is the the linear one class SVM (OC SVM) The linear one class SVM [Schölkopf and Smola, 2002] { min c,ρ 1 2 c 2 ρ with c x i ρ with ρ = ρ x i 2 OC SVM is a particular case of SVDD Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

9 The sphere and the hyperplane - When i = 1, n, x i 2 = 1 c 0 with x i c 2 R 2 c x i ρ ρ = 1 2 ( c 2 R + 1) SVDD and OCSVM "Belonging to the ball" is also "being above" an hyperplane Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

10 MEB: Lagrangian & KKT L(c, R, α) = R 2 + ( α i xi c 2 R 2) KKT conditions : stationarity 2c n α i 2 n α i x i = 0 1 n α i = 0 primal admiss. x i c 2 R 2 dual admiss. α i 0 The representer theorem i = 1, n Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

11 MEB: Lagrangian & KKT L(c, R, α) = R 2 + ( α i xi c 2 R 2) the radius KKT conditions : stationarity 2c n α i 2 n α i x i = 0 1 n α i = 0 primal admiss. x i c 2 R 2 dual admiss. α i 0 The representer theorem i = 1, n complementarity α i ( xi c 2 R 2) = 0 i = 1, n Complementarity tells us: two groups of points the support vectors x i c 2 = R 2 and the insiders α i = 0 Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

12 MEB: Dual The representer theorem: c L(c, R, α) = 0 c = The Lagrangian for the Dual ( L(α) = α i xi j=1 α i x i = α i α j x j 2) j=1 α i α j xi x j = α Gα and with G = XX the Gram matrix: G ij = x i x j, The dual formulation of the MEB min α IR n α Gα α diag(g) α i x i α i xi x i = α diag(g) with e α = 1 and 0 α i i = 1,..., n

13 SVDD primal vs. dual Primal Dual min R 2 R IR,c IR d with x i c 2 R 2 i = 1,..., n min α α Gα α diag(g) with e α = 1 and 0 α i i = 1,..., n d + 1 unknown n constraints can be recast as a QP perfect when d << n n unknown with G the pairwise influence Gram matrix n box constraints easy to solve to be used when d > n

14 Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) 2 Robust outlier detection with L0-SVDD L 0 SVDD the slack Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

15 Dealing with outliers : a bi criteria optimization problem Modeling potential errors: introducing slack variables ξ i for all x i { no error: xi c 2 R 2 ξ i = 0 error: x i c 2 > R 2 ξ i = x i c 2 R 2 the slack min R,c,ξ min R,c,ξ R 2 1 p ξ p i with x i c 2 R 2 + ξ i, i = 1,..., n and ξ i 0, i = 1,..., n Our hope: almost all ξ i = 0

16 The minimum enclosing ball problem with errors the slack The same road map: initial formulation reformulation (as a pqp) Lagrangian, KKT dual formulation bi dual Initial formulation: for a given µ 0 R 2 + µ min R,c,ξ ξ i with x i c 2 R 2 + ξ i, i = 1,..., n and ξ i 0, i = 1,..., n Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

17 The MEB with slack: parametric QP, KKT, dual and R 2 SVDD as a pqp: min c,ρ again with OC SVM as a particular case. With G = XX Dual SVDD: 1 2 c 2 ρ + µ 2 with c x i ρ x i ξ i and ξ i 0, i = 1, n min α α Gα α diag(g) with e α = 1 and 0 α i µ, i = 1, n for a given µ 1. If µ is larger than one it is useless (it s the no slack case) R 2 = ν + c c with ν denoting the Lagrange multiplier associated with the equality constraint n α i = 1. ξ i

Parametric QP [Markowitz, 1952] ( c (µ), ρ (µ) ) = arg min c,ρ 1 2 c 2 ρ + µ 2 with c x i ρ + 1 2 x i 2 1 2 ξ i and ξ i 0, i = 1, n ξ i α (µ) = arg min α Gα α d G α with e

18 Parametric QP [Markowitz, 1952] ( c (µ), ρ (µ) ) = arg min c,ρ 1 2 c 2 ρ + µ 2 with c x i ρ x i ξ i and ξ i 0, i = 1, n ξ i α (µ) = arg min α Gα α d G α with e α = 1 and 0 α i µ i = 1, n µ µ = 0.05 µ = 0.15 µ = 0.25 µ = 0.35 µ = 0.45 Regularization path α (µ) Piecewise linear α (µ ) = α (µ) + (µ µ)v

19 SVDD as a penalized hinge loss minimization The slack variables ξ i for all x i { no error: xi c 2 R 2 ξ i = 0 error: x i c 2 > R 2 ξ i = x i c 2 R 2 ( c (µ), ρ (µ) ) = arg min c,ρ 1 2 c 2 ρ + µ max ( 0, ρ x i 2 c ) x i Generalize to other loss: exponential, logistic, L 0... other penalization: L 1, elastic net...

20 Variations over SVDD Adaptive SVDD: the weighted error case for given w i, i = 1, n min R + µ c IR p,r IR,ξ IR n with w i ξ i x i c 2 R+ξ i ξ i 0 i = 1, n The dual of this problem is a QP [see for instance Liu et al., 2013] { min α XX α α diag(xx ) α IR n n with α i = 1 0 α i µw i i = 1, n Density induced SVDD (D-SVDD) [Lee et al., 2007]: min R + µ ξ i c IR p,r IR,ξ IR n with w i x i c 2 R+ξ i ξ i 0 i = 1, n

21 Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) 2 Robust outlier detection with L0-SVDD L 0 SVDD Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

22 Kernels and RKHS Definition (Kernel on x IR p ) a function of two variable k(x, x ) from IR p IR p to IR, symmetric positive Examples: the linear kernel: the polynomial kernel: the Gaussian kernel k(s, t) = s t k(s, t) = ( s t + b ) q k(s, t) = exp( s t 2 b ) From kernel to functions (RKHS) H 0 = f m f mf < ; f j IR; t j IR p, f (x) = f j k(x, t j ) Evaluation functional: x IR p j=1 f (x) = f ( ), k(x, ) H0

23 SVDD in a RKHS [Tax and Duin, 2004] The same road map: initial formulation reformulation (as a pqp) Lagrangian, KKT dual formulation The feature map: IR p H bi dual c f ( ) x i k(x i, ) x i c IR p R 2 k(x i, ) f ( ) 2 H R2 Kernelized SVDD (in a RKHS) is also a QP min R 2 + µ ξ i f H,R IR,ξ IR n with k(x i, ) f ( ) 2 H R2 +ξ i i = 1, n ξ i 0 i = 1, n

24 Equivalence between SVDD and OCSVM for translation invariant kernels (diagonal constant kernels) Theorem Let H be a RKHS on some domain IR p endowed with kernel k. If there exists some constant c such that x IR p, k(x, x) = c, then the two following problems are equivalent, min f,r,ξ with R + µ ξ i k(x i,.) f (.) 2 H R+ξ i ξ i 0 i = 1, n with ρ = 1 2 (c + f 2 H R) and ε i = 1 2 ξ i. min f,ρ,ξ with 1 2 f 2 H ρ + µ f (x i ) ρ ε i ε i 0 i = 1, n ε i

25 SVDD in a RKHS: KKT, Dual and R 2 L = R 2 + µ = R 2 + µ ξ i + ξ i + ( α i k(xi,.) f (.) 2 H R 2 ) ξ i β i ξ i α i ( k(xi, x i ) 2f (x i ) + f 2 H R 2 ξ i ) β i ξ i KKT conditions Stationarity n 2f (.) α i 2 n α ik(., x i ) = 0 The representer theorem n 1 α i = 0 µ α i β i = 0 Primal admissibility: k(x i,.) f (.) 2 R 2 + ξ i, ξ i 0 Dual admissibility: α i 0, β i 0 Complementarity αi ( k(xi,.) f (.) 2 R 2 ξ i ) = 0 β i ξ i = 0

26 SVDD in a RKHS: Dual and R 2 L(α) = = α i k(x i, x i ) 2 α i k(x i, x i ) f (x i ) + f 2 H with f (.) = j=1 α i α j k(x i, x j ) }{{} G ij α j k(., x j ) j=1 G ij = k(x i, x j ) min α α Gα α diag(g) with e α = 1 and 0 α i µ, i = 1... n As it is in the linear case: R 2 = ν + f 2 H with ν denoting the Lagrange multiplier associated with the equality constraint n α i = 1.

27 Kernelized SVDD primal vs. dual Primal Dual min f,r,ξ with R + µ ξ i k(x i,.) f (.) 2 H R+ξ i ξ i 0 i = 1, n min α α Gα α diag(g) with e α = 1 and 0 α i µ i = 1,..., n f H + n + 1 unknown 2n constraints can be recast as a QP intractable when d = n unknown with G the pairwise influence Gram matrix 2n box constraints QP tractable

28 SVDD train and val in a RKHS Train using the dual form (in: G, µ; out: α, ν) min α α Gα α diag(g) with e α = 1 and 0 α i µ, i = 1... n Val with the center in the RKHS: f (.) = n α ik(., x i ) φ(x) = k(x,.) f (.) 2 H R2 = k(x,.) 2 H 2 k(x,.), f (.) H + f (.) 2 H R2 = k(x, x) 2f (x) + R 2 ν R 2 = 2f (x) + k(x, x) ν = 2 α i k(x, x i ) + k(x, x) ν φ(x) = 0 is the decision border Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

29 An important theoretical result For a well-calibrated bandwidth, The kernel SVDD estimates the underlying distribution level set [Vert and Vert, 2006] The level sets of a probability density function IP(x) are the set C p = {x IR d IP(x) p} It is well estimated by the empirical minimum volume set V p = {x IR d k(x,.) f (.) 2 H R 2 0} The frontiers coincides

30 SVDD: the generalization error For a well-calibrated bandwidth, (x 1,..., x n ) i.i.d. from some fixed but unknown IP(x) Then [Shawe-Taylor and Cristianini, 2004] with probability at least 1 δ, ( δ ]0, 1[), for any margin m > 0 IP ( k(x,.) f (.) 2 H R 2 + m ) 1 mn ξ i + 6R2 ln(2/δ) m n + 3 2n

31 Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) 2 Robust outlier detection with L0-SVDD L 0 SVDD Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

32 The two class Support vector data description (SVDD) min R 2 +µ c,r,ξ +,ξ ( y i =1 ξ + i + y i = 1 ξ i ) with x i c 2 R 2 +ξ + i, ξ + i 0 i such that y i = 1 and x i c 2 R 2 ξ i, ξ i 0 i such that y i = 1 Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

33 The two class SVDD as a QP min R 2 +µ c,r,ξ +,ξ ( y i =1 ξ + i + y i = 1 ξ i ) with x i c 2 R 2 +ξ + i, ξ + i 0 i such that y i = 1 and x i c 2 R 2 ξ i, ξ i 0 i such that y i = 1 { xi 2 2x i c + c 2 R 2 +ξ + i, ξ + i 0 i such that y i = 1 x i 2 2x i c + c 2 R 2 ξ i, ξ i 0 i such that y i = 1 2x i c c 2 R 2 + x i 2 ξ + i, ξ + i 0 i such that y i = 1 2x i c c 2 + R 2 x i 2 ξ i, ξ i 0 i such that y i = 1 2y i x i c y i ( c 2 R 2 + x i 2 ) ξ i, ξ i 0 i = 1, n change variable: ρ = c 2 R 2 min c,ρ,ξ c 2 ρ + µ n ξ i with 2y i x i c y i (ρ x i 2 ) ξ i i = 1, n and ξ i 0 i = 1, n

34 The dual of the two class SVDD G ij = y i y j x i x j The dual formulation: min α IR n with α Gα n α iy i x i 2 y i α i = 1 0 α i µ i = 1, n Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

35 The two class SVDD vs. one class SVDD The two class SVDD (left) vs. the one class SVDD (right) Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

36 Small Sphere and Large Margin (SSLM) approach Support vector data description with margin [Wu and Ye, 2009] ( min R 2 +µ ξ + c,r,ξ IR n i + ) ξ i y i =1 y i = 1 with x i c 2 R 2 1+ξ + i, ξ + i 0 i such that y i = 1 and x i c 2 R ξ i, ξ i 0 i such that y i = 1 x i c 2 R ξ i and y i = 1 y i x i c 2 y i R 2 1+ξ i L(c, R, ξ, α, β) = R 2 ( +µ ξ i + α i yi x i c 2 y i R 2 ) + 1 ξ i β i ξ i

37 SVDD with margin dual formulation L(c, R, ξ, α, β) = R 2 +µ Optimality: c = ξ i + α i y i x i ; L(α) = ( α i yi x i c 2 y i R 2 ) + 1 ξ i β i ξ i α i y i = 1 ; 0 α i µ ( α i yi x i = j=1 α i y j x j 2) + j=1 α j α i y i y j x j x i + α i x i 2 y i α i + Dual SVDD is also a QP, very close to SVM! min α Gα e α f α α IR n problem D with y α = 1 and 0 α i µ i = 1, n with G a symmetric matrix n n such that G ij = y i y j x j x i and f i = x i 2 y i α i

38 Roadmap 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) 2 Robust outlier detection with L0-SVDD L 0 SVDD 4 iterations of Adaptive L0 SVDD

39 SVDD + outlier: the problem min R,c,ξ R + µ ξ i with x i c 2 R + m + ξ i, i = 1,..., n and ξ i 0, i = 1,..., n (1) µ = 1/16 µ = 1/8 µ = 1/4 µ = 1/2 ( ) Figure : Example of SVDD solutions with different µ values, m = 0 (red) and m = 5 (magenta). The circled data points represent support vectors for both m.

40 Chasing outliers with the L 0 (pseudo) norm SVDD is sensitive to the presence of outliers in the data Allowing t outliers (and no errors) ξ 0 = card{i ξ i 0} t L 0 SVDD min R + µ ξ 0 c IR p,r IR,ξ IR n with x i c 2 R+ξ i ξ i 0 i = 1, n However, the L 0 pseudo-norm is non differentiable, combinatorially hard and does not lead to an effective algorithmic approach Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

41 L 0 relaxations L p norm exp p norm ξ p = ( n ξ i p) 1 p p 0 exponential n ( ) 1 exp αξ i α piecewise linear n min( 1, ξ i ) α α 0 log n log( 1 + ξ i ) α α piecewise linear log 0 L 0 log relaxation SVDD min R + µ c IR p,r IR,ξ IR n with log(γ + ξ i ) x i c 2 R+ξ i ξ i 0 i = 1, n The L 0 log relaxation SVDD is differentiable, however it is not convex

42 DC programing [An and Tao, 2005, Gasso et al., 2009] The DC (Difference of Convex Functions) log(γ + ξ) = f (ξ) g(ξ) with f (ξ) = ξ g(ξ) = ξ log(γ + ξ), both functions f and g being convex. The DC algorithm consists in minimizing iteratively the convex term: ( ) log(γ + ξ) f (ξ) g (ξ old )ξ = ξ 1 ξ = 1 γ + ξ old 1 γ + ξ }{{ old ξ } with w = w 1 γ + ξ old where ξ old i denotes the solution at the previous iteration. Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

43 DC applied to the L 0 SVDD log relaxation min c,r,ξ with R + µ ξ 0 x i c 2 R+ξ i ξ i 0 i = 1, n min c,r,ξ with R + µ log(γ + ξ i ) x i c 2 R+ξ i ξ i 0 i = 1, n The DC idea applied to our L 0 SVDD approximation consists in building a sequence of solutions of the following adaptive SVDD: while not converged do min R + µ c IR p,r IR,ξ IR n with ξ old i = ξ i, i = 1, n w i ξ i x i c 2 R+ξ i ξ i 0 i = 1, n 1 with w i = γ + ξ old. i Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

44 Dual formulation (to be used with kernels) L(c, R, ξ, α, γ) = R 2 ( + µ w i ξ i + α i xi c 2 R 2 ) ξ i γ i ξ i KKT conditions : stationarity 2c n α i 2 n α i x i = 0 1 n α i = 0 The representer theorem µwi α i γ i = 0 i = 1, n primal admiss. x i c 2 R 2 + ξ i i = 1, n dual admiss. α i 0, γ i 0 i = 1, n ( ) complementarity α i xi c 2 R 2 ξ i = 0 i = 1, n Adaptive SVDD in the Dual { min α XX α α diag(xx ) α IR n n with α i = 1 0 α i µw i i = 1, n (2)

45 Robust SVDD pseudo code Algorithm 1 Robust L 0 SVDD for the linear kernel Data: X, µ, γ Result: R, c, ξ, α w i = 1; i = 1, n while not converged do (α, λ) solve_qp(x, µ, w) % solve problem (2) c X α R λ + c c ξ i max(0, x i c 2 R) i = 1, n w i 1/(γ + ξ i ) i = 1, n end Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

46 Robust kernel SVDD pseudo code Algorithm 2 Robust L 0 SVDD for the kernel k Data: X, k, µ, γ Result: ξ, α w i = 1; i = 1, n K kernel(x, X, k) while not converged do (α, λ) solve_qp(k, µ, w) % solve problem (2). ξ i max(0, 2Kα + diag(k) λ) i = 1, n. w i 1/(γ + ξ i ) i = 1, n end Stéphane Canu (INSA Rouen - LITIS) September 24, / 46

47 L 0 SVDD at work

48 L 0 SVDD at work

49 L 0 SVDD at work

50 L 0 SVDD at work

51 Conclusion Applications outlier detection change detection clustering large number of classes variable selection... A clear path reformulation (to a standard problem) KKT Dual Bidual a lot of variations L 2 SVDD two classes non symmetric two classes in the symmetric classes (SVM) the multi classes issue problems with non translation invariant kernels.

52 Bibliography LeThiHoai An and PhamDinh Tao. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133(1-4):23 46, D Jack Elzinga and Donald W Hearn. The minimum covering sphere problem. Management Science, 19(1):96 104, Gilles Gasso, Alain Rakotomamonjy, and Stéphane Canu. Recovering sparse signals with a certain family of nonconvex penalties and dc programming. Signal Processing, IEEE Transactions on, 57(12): , KiYoung Lee, Dae-Won Kim, Kwang Hyung Lee, and Doheon Lee. Density-induced support vector data description. Neural Networks, IEEE Transactions on, 18(1): , Bo Liu, Yanshan Xiao, Longbing Cao, Zhifeng Hao, and Feiqi Deng. Svdd-based outlier detection on uncertain data. Knowledge and information systems, 34(3): , Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77 91, B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, James Joseph Sylvester. A question in the geometry of situation. Quarterly Journal of Pure and Applied Mathematics, 1, David MJ Tax and Robert PW Duin. Support vector data description. Machine learning, 54(1): 45 66, Régis Vert and Jean-Philippe Vert. Consistency and convergence rates of one-class svms and related algorithms. The Journal of Machine Learning Research, 7: , Mingrui Wu and Jieping Ye. A small sphere and large margin approach for novelty detection using training data with outliers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(11): , 2009.

Lecture 6: Minimum encoding ball and Support vector data description (SVDD)

Lecture 6: Minimum encoding ball and Support vector data description (SVDD) Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 May 12, 2014 Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest