Efficient Maximum Margin Clustering

Size: px

Start display at page:

Download "Efficient Maximum Margin Clustering"

Arthur Snow
5 years ago
Views:

1 Dept. Automation, Tsinghua Univ. MLA, Nov. 8, 2008 Nanjing, China

2 Outline

3 Outline

4 Support Vector Machine Given X = {x 1,, x n }, y = (y 1,..., y n ) { 1, +1} n, SVM finds a hyperplane f (x) = w T φ(x) + b by solving min w,b,ξi 1 2 wt w + C n s.t. n ξ i (1) y i (w T φ(x i ) + b) 1 ξ i ξ i 0 i = 1,..., n

5 Maximum Margin Clustering [Xu et. al. 2004] MMC targets to find not only the optimal hyperplane (w, b ), but also the optimal labeling vector y min min 1 y { 1,+1} n w,b,ξ i 2 wt w + C n s.t. n ξ i (2) y i (w T φ(x i ) + b) 1 ξ i ξ i 0 i = 1,..., n

6 Representative Works Semi-definite programming [Xu et. al. (NIPS 2004)] Several relaxations made n 2 variables in SDP Time complexity O(n 7 )

7 Representative Works Semi-definite programming [Valizadegan and Jin (NIPS 2006)] Reduce number of variables from n 2 to n Time complexity O(n 4 ) Only 2-class scenario

8 Representative Works Alternating optimization [Zhang et. al. (ICML 2007)] Involve a sequence of QPs Number of iterations not guaranteed theoretically Only 2-class scenario

9 Outline

10 Problem Reformulation Theorem Maximum margin clustering is equivalent to 1 min w,b,ξ i 2 wt w + C n s.t. n ξ i (3) w T φ(x i ) + b 1 ξ i ξ i 0 i = 1,..., n where the labeling vector y i = sign(w T φ(x i ) + b).

11 Problem Reformulation Theorem Any solution (w, b ) to problem (4) is also a solution to problem (3) (and vice versa), with ξ = 1 n n ξ i. 1 2 wt w+cξ (4) s.t. c {0, 1} n : 1 n c i w T φ(x i )+b 1 n c i ξ n n min w,b,ξ 0

12 Problem Reformulation Number of variables reduced by 2n 1 Number of constraints increased from n to 2 n We can always find a polynomially sized subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (4) up to a precision of ɛ.

13 Cutting Plane Algorithm [J. E. Kelley 1960] Starts with an empty constraint subset Ω Computes the optimal solution to problem (4) subject to the constraints in Ω Finds the most violated constraint in problem (4) and adds it into the subset Ω Stops when no constraint in (4) is violated by more than ɛ 1 n c i w T φ(x i )+b 1 n c i (ξ + ɛ) (5) n n

14 The Most Violated Constraint Theorem The most violated constraint could be computed as follows c i = { 1 if w T φ(x i )+b < 1 0 otherwise (6) The feasibility of a constraint is measured by the corresponding value of ξ 1 n c i w T φ(x i )+b 1 n c i ξ (7) n n

15 Enforcing the Class Balance Constraint Enforce class balance constraint to avoid trivially optimal solutions 1 min w,b,ξ 0 2 wt w+cξ (8) s.t. c Ω: 1 n c i w T φ(x i )+b 1 n c i ξ n n l n ( ) w T φ(x i ) + b l

16 The Constrained Concave-Convex Procedure [A. J. Smola et.al. 2005] Solve non-convex optimization problem whose objective function could be expressed as a difference of convex functions min z f 0 (z) g 0 (z) (9) s.t. f i (z) g i (z) c i i = 1,..., n where f i and g i are real-valued convex functions on a vector space Z and c i R for all i = 1,..., n.

17 The Constrained Concave-Convex Procedure Given an initial point z 0, the CCCP computes z t+1 from z t by replacing g i (z) with its first-order Taylor expansion at z t min f 0 (z) T 1 {g 0, z t }(z) (10) z s.t. f i (z) T 1 {g i, z t }(z) c i i = 1,..., n

18 Optimization via the CCCP By substituting first-order Taylor expansion into problem (8), we obtain the following quadratic programming (QP) problem: 1 min w,b,ξ 2 wt w+cξ s.t. ξ 0 n ( ) l w T φ(x i ) + b l c Ω: 1 n n c i ξ 1 n (11) n [ ] c i sign(w T t φ(x i )+b t ) w T φ(x i )+b 0

19 Justification of CPMMC Theorem For any dataset X = (x 1,..., x n ) and any ɛ > 0, the CPMMC algorithm for maximum margin clustering returns a point (w, b, ξ) for which (w, b, ξ + ɛ) is feasible in problem (4).

20 Time Complexity Analysis Theorem Each iteration of CPMMC takes time O(sn) for a constant working set size Ω. Theorem For any ɛ > 0, C > 0, and any dataset X = {x 1,..., x n }, the CPMMC algorithm terminates after adding at most CR ɛ 2 constraints, where R is a constant number independent of n and s.

21 Time Complexity Analysis Theorem For any dataset X = {x 1,..., x n } with n samples and sparsity of s, and any fixed value of C > 0 and ɛ > 0, the CPMMC algorithm takes time O(sn).

22 Clustering Error Comparison Data Size KM NC MMC GMMC IterSVR CPMMC Digits ± ± Digits ± ± Digits ± ± Digits ± ± Ionosphere ± ± Letter ± ± Satellite ± ± Text ± ± Text ± ± UCI digits MNIST digits For UCI digits and MNIST datasets, we give a through comparison by considering all 45 pairs of digits 0-9. For NC/MMC/GMMC/IterSVR, results on the digits and ionosphere data are simply copied from (Zhang et. al., 2007).

23 Speed of CPMMC Data KM NC GMMC IterSVR CPMMC Digits Digits Digits Digits Ionosphere Letter Satellite Text Text

24 Dataset Size n vs. Speed CPU Time (seconds) Letter Satellite Text 1 Text 2 MNIST 1vs2 MNIST 1vs7 O(n) Number of Samples

25 ɛ vs. Accuracy & Speed Clustering Accuracy UCI Digits 3vs8 UCI Digits 1vs7 UCI Digits 2vs7 UCI Digits 8vs9 Letter Text 1 Text 2 (a) Epsilon CPU Time (seconds) (b) UCI Digits 3vs8 UCI Digits 1vs7 UCI Digits 2vs7 UCI Digits 8vs9 Letter Text 1 Text 2 O(x 0.5 ) Epsilon

26 C vs. Accuracy & Speed Clustering Accuracy UCI 3vs8 UCI 1vs7 UCI 2vs7 UCI 8vs9 Letter Satellite Ionosphere (a) C CPU Time (seconds) UCI 3vs8 UCI 1vs7 UCI 2vs7 UCI 8vs9 Letter Satellite Ionosphere O(x) (b) C

27 Outline

28 Multi-Class Support Vector Machine [Crammer & Singer 2001] Given a point set X ={x 1,, x n } and their labels y=(y 1,..., y n ) {1,..., k} n, SVM defines a weight vector w p for each class p {1,..., k} and classifies sample x by y =arg max y {1,...,k} w T y x with the weight vectors obtained as min w 1,...,w k,ξ s.t. 1 k n 2 β w p 2 + ξ i (12) p=1 i = 1,..., n, r = 1,..., k w T y i x i +δ yi,r w T r x i 1 ξ i

29 Similar with the binary clustering scenario min y min w 1,...,w k,ξ s.t. 1 k 2 β w p n p=1 n ξ i (13) i = 1,..., n, r = 1,..., k w T y i x i +δ yi,r w T r x i 1 ξ i

30 Problem Reformulation I Theorem min w 1,...,w k,ξ s.t. 1 k 2 β w p n p=1 n ξ i (14) i = 1,..., n, r = 1,..., k k k k w T px i I (w T p x i >w T q x i ) + I (w T r x i >w T q x i ) wt r x i 1 ξ i p=1 q=1,q p q=1,q r where I( ) is the indicator function and the label for sample x i is determined as y i = k p=1 p k q=1,q p I (w T p x i >w T q x i )

31 Problem Reformulation II Theorem Problem (14) can be equivalently formulated as problem (15), with ξ = 1 n n ξ i. 1 k min w 1,...,w k,ξ 2 β w p 2 +ξ (15) p=1 s.t. c i {e 0, e 1,..., e k }, i = 1,..., n 1 n k k n ct i e w T px i z ip + c ip (z ip w T px i ) 1 n p=1 p=1 n c T i e ξ where z ip = k q=1,q p I (w T p x i >w T q x i ) and each constraint c is represented as a k n matrix c = (c 1,..., c n ).

32 Problem Reformulation Number of variables reduced by 2n 1 Number of constraints increased from nk to (k + 1) n Targets to finding a small subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (15) up to a precision of ɛ.

33 Cutting Plane Algorithm [J. E. Kelley 1960, T. Joachims 2006] Starts with an empty constraint subset Ω Computes the optimal solution to problem (15) subject to the constraints in Ω Finds the most violated constraint in problem (15) and adds it into the subset Ω Stops when no constraint in (15) is violated by more than ɛ c i {e 0, e 1,..., e k } n, i = 1,..., n (16) 1 n k k n n ct i e w T px i z ip + c ip (z ip w T px i ) 1 c T i n e ξ ɛ p=1 p=1

34 The Most Violated Constraint Theorem Define p = arg max p (w T p x i ) and r = arg max r p (w T r x i ) for i = 1,..., n, the most violated constraint could be calculated as follows { er if (w c i = T p x i w T r x i)<1, i = 1,..., n (17) 0 otherwise

35 Enforcing the Class Balance Constraint To avoid trivially optimal solutions min w 1,...,w k,ξ 0 s.t. 1 k 2 β w p 2 +ξ (18) p=1 1 n k k n ct i e w T px i z ip + c ip (z ip w T px i ) 1 n l p=1 p=1 n c T i e ξ, [c 1,..., c n ] Ω n n w T px i w T qx i l, p, q =1,..., k

36 Optimization via the CCCP Calculate the subgradients 1 n k k wr c T i n e w T px i z ip + c ip z ip = 1 n n c T i ez(t) ip x i p=1 p=1 r = 1,..., k w=w (t) (19) By substituting first-order Taylor expansion into problem (18), we obtain a quadratic programming (QP) problem.

37 Justification of CPM3C Theorem For any dataset X = (x 1,..., x n ) and any ɛ > 0, the CPM3C algorithm returns a point (w 1,..., w k, ξ) for which (w 1,..., w k, ξ + ɛ) is feasible.

38 Clustering Accuracy Comparison Data KM NC MMC CPM3C Dig Dig Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC news RCVI

39 Rand Index Comparison Data KM NC MMC CPM3C Dig Dig Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC news RCVI

40 Speed Comparison Data KM CPM3C Dig Dig Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC news RCVI

41 Dataset Size n vs. Speed CPU Time (seconds) Cora DS Cora HA Cora ML Cora OS Cora PL 20News O(n) Cora & 20News CPU Time (seconds) WK CL WK HA WK WT WK WC RCVI O(n) WebKB & RCVI Number of Samples Number of Samples

42 ɛ vs. Accuracy 0.8 Cora & 20News 0.9 WebKB & RCVI Clustering Accuracy Cora DS Cora HA Cora ML 0.3 Cora OS Cora PL 20News Epsilon Clustering Accuracy WK CL WK TX 0.5 WK WT WK WC data Epsilon

43 ɛ vs. Speed 10 3 Cora & 20News 10 4 WebKB & RCVI CPU Time (seconds) Cora DS Cora HA Cora ML Cora OS Cora PL 20News O(x 0.5 ) Epsilon CPU Time (seconds) WK CL WK TX WK WT WK WC RCVI 10 2 O(x 0.5 ) Epsilon

44 Outline

45 Semi-Supervised Support Vector Machine Given X = {x 1,, x l, x l+1,, x n }, where the first l points in X are labeled as y i { 1, +1} and the remaining u = n l points are unlabeled 1 min min y l+1,...,y n w,b,ξ i 2 wt w+ C l n s.t. l ξ i + C u n n ξ j (20) j=l+1 y i [w T φ(x i )+b] 1 ξ i, i =1,..., l y j [w T φ(x j )+b] 1 ξ j, j =l + 1,..., n ξ i 0, i =1,..., l ξ j 0, j =l + 1,..., n

46 CutS3VM: Fast S3VM Algorithm Theorem Problem (20) is equivalent to 1 min w,ξ i 2 wt w+ C l n s.t. l ξ i + C u n n ξ j (21) j=l+1 y i [w T φ(x i )+b] 1 ξ i, i =1,..., l w T φ(x j )+b 1 ξ j, j =l + 1,..., n ξ i 0, i =1,..., l ξ j 0, j =l + 1,..., n where the labels y j, j = l + 1,..., n are calculated as y j = sign(w T φ(x j ) + b).

47 CutS3VM: Fast S3VM Algorithm Theorem Problem (21) can be equivalently formulated as min w,ξ 0 s.t. 1 2 wt w+ξ (22) 1 l n n C l c i y i [w T φ(x i )+b]+c u c j w T φ(x j )+b j=l+1 1 l n n C l c i +C u c j ξ, c {0, 1}n j=l+1 and any solution w to problem (22) is also a solution to problem (21) (vice versa), with ξ = C l l n ξ i + Cu n n j=l+1 ξ j.

48 Maximum Margin Embedding Traditional embedding methods find the optimal subspace by minimizing some form of average loss or cost MME directly finds the most discriminative subspace, where clusters are most well-separated MME is insensitive to the actual probability distribution of patterns lying further away from the separating hyperplanes

49 Maximum Margin Embedding min y,w,b,ξ i wt w + C n n ξ i (23) s.t. y i (w T φ(x i ) + b) 1 ξ i, i = 1,..., n A T w = 0 y { 1, +1} n where A = [w 1,..., w r 1 ] constrains that w should be orthogonal to all previously calculated projecting vectors.

50 Outline

51 Improvements No loss in clustering accuracy Major improvement on speed Handle large real-world datasets efficiently

52 Future works Automatically tune the parameters Even larger dataset

53 References Bin Zhao, Fei Wang, Changshui Zhang. Maximum Margin Embedding. ICDM 2008 Bin Zhao, Fei Wang, Changshui Zhang. CutS3VM: A Fast Semi-Supervised SVM Algorithm. KDD Bin Zhao, Fei Wang, Changshui Zhang. Efficient Multiclass Maximum Margin Clustering. ICML Bin Zhao, Fei Wang, Changshui Zhang. Efficient Maximum Margin Clustering Via Cutting Plane Algorithm. SDM 2008

54 Thanks for Listening

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass