Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization

Size: px

Start display at page:

Download "Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization"

Junior Bruce
5 years ago
Views:

1 Computational Management Science 2017 Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization Presented by Kilian Schindler École Polytechnique Fédérale de Lausanne Joint work with Napat Rujeerapaiboon, Daniel Kuhn École Polytechnique Fédérale de Lausanne Wolfram Wiesemann Imperial College London Bergamo June 1, 2017

2 K-means Clustering Standard K-means clustering formulation k=1 i=1 k i k i k k 2 s.t. k 2 R d, k i k=1 k i 2 {0, 1}, =1 8i. i k i k Sequential K-means clustering approach (Lloyd, 1982) Step 1: Fix { k } and solve k=1 i=1 k i k i k k 2 s.t. k i 2 {0, 1}, k=1 k i =1 8i. totally unimodular constraints Step 2: Fix { i k } and solve k=1 i=1 k i k i k k 2 s.t. k 2 R d. optimal k is average of cluster k Kilian Schindler (EPFL) CMS 2017 Slide 2

3 Cardinality-Constrained K-means Clustering Standard K-means clustering formulation k=1 i=1 k i k i k k 2 s.t. k 2 R d, k i k=1 k i 2 {0, 1}, =1 8i, i=1 k i = n k 8k. i k i k n k =3 Sequential K-means clustering approach (Bennett et al., 2000) Step 1: Fix { k } and solve k=1 i=1 k i k i k k 2 Step 2: Fix { k i } and solve s.t. k i 2 {0, 1}, k=1 k i =1 8i, i=1 k i = n k 8k. totally unimodular constraints k=1 i=1 k i k i k k 2 s.t. k 2 R d. optimal k is average of cluster k Kilian Schindler (EPFL) CMS 2017 Slide 3

4 Motivation for Balanced Clustering market segmentation distributed computing document clustering Kilian Schindler (EPFL) CMS 2017 Slide 4

5 Motivation for Outlier Detection Suppose we wanted to find three (balanced) clusters in the following dataset... Standard k-means (objective = 25.21) Balanced k-means (objective = 54.27) But if we could also specify a number of outliers to be removed Balanced k-means and outlier detection (objective = 1.97) Kilian Schindler (EPFL) CMS 2017 Slide 5

6 MILP Reformulation, Conic Relaxations Rounding Algorithm and Recovery Guarantees Numerical Experiments

7 Auxiliary Lemma Lemma (Zha et al., 2000). For vectors 1,..., n 2 R d, we have that P n i=1 i 1 n P n j=1 j 2 = 1 2n P n i,j=1 i j 2. i i = 1 n P n j=1 j Kilian Schindler (EPFL) CMS 2017 Slide 7

8 MILP Reformulation k=1 i=1 k i k i k k 2 s.t. k 2 R d, k i k=1 k i 2 {0, 1}, =1 8i, i=1 k i = n k 8k. (1) = (2) = (3) k=1 k=1 i=1 k i i 1 n k j=1 k j P 1 N 2n k i,j=1 k i j k 2 i j = 1 2 k=1 1 n k i,j=1 k i j k d ij j 2 introduce epigraphical variables k ij 1 2 k=1 1 n k i,j=1 k ij d ij s.t. ij k 2 R +, i k 2 {0, 1}, k=1 k i =1 8 i, i=1 k i = n k 8 k, k ij k i + k j 1 8 i, j, k. (P) (1) optimal k is average of cluster k (2) apply Lemma (Zha et al., 2000) (3) define d ij = k i j k 2 Kilian Schindler (EPFL) CMS 2017 Slide 8

9 Towards an SDP Relaxation (1/4) k=1 i=1 k i k i k k 2 s.t. k 2 R d, k i k=1 k i 2 {0, 1}, =1 8i, i=1 k i = n k 8k. (1) = (2) = (3) k=1 k=1 i=1 k i i 1 n k j=1 k j P 1 N 2n k i,j=1 k i j k 2 i j = 1 2 k=1 1 n k i,j=1 k i j k d ij j 2 apply transformation x k i 2 k i 1 to obtain x k i 2 { 1, +1} 1 8 k=1 1 n k i,j=1 (1 + xk i )(1 + xk j )d ij s.t. x k i 2 { 1, +1}, k=1 xk i =2 K 8i, i=1 xk i =2n k N 8k. define m k ij = xk i xk j and notice that (1 + x k i )(1 + xk j )=1+xk i + xk j + mk ij Kilian Schindler (EPFL) CMS 2017 Slide 9

10 Towards an SDP Relaxation (2/4) 1 8 k=1 1 n k i,j=1 (1 + xk i + xk j + mk ij ) d ij s.t. x k i 2 { 1, +1}, m k ij 2 R, m k ij = xk i xk j 8 i, j, k, k=1 xk i =2 K 8 i, i=1 xk i =2n k N 8 k. switch to matrix notation M k ij = mk ij and x k i = xk i 1 8 D D, k=1 1 n k M k + 11 > + x k 1 > + 1(x k ) > E s.t. x k 2 { 1, +1} N, M k 2 S N, M k = x k (x k ) > 8k, k=1 xk =(2 K)1, 1 > x k =2n k N 8k. Kilian Schindler (EPFL) CMS 2017 Slide 10

11 Towards an SDP Relaxation (3/4) 1 8 D D, k=1 1 n k M k + 11 > + x k 1 > + 1(x k ) > E s.t. x k 2 { 1, +1} N, M k 2 S N, M k = x k (x k ) > 8k, k=1 xk =(2 K)1, 1 > x k =2n k N 8k. M k 1 = x k (x k ) > 1 =(2n k N)x k 8k diag(m k )=1 8k M k + 11 > + x k 1 > + 1(x k ) > =+(1 + x k )(1 + x k ) > 0 8k M k + 11 > x k 1 > 1(x k ) > =+(1 x k )(1 x k ) > 0 8k M k 11 > + x k 1 > 1(x k ) > = (1 x k )(1 + x k ) > apple 0 8k M k 11 > x k 1 > + 1(x k ) > = (1 + x k )(1 x k ) > apple 0 8k Kilian Schindler (EPFL) CMS 2017 Slide 11

12 Towards an SDP Relaxation (4/4) 1 8 D D, k=1 1 n k s.t. x k 2 { 1, +1} N, M k 2 S N, M k + 11 > + x k 1 > + 1(x k ) > E M k = x k (x k ) > 8k, k=1 xk =(2 K)1, 1 > x k =2n k N 8k, SDP relaxation x k 2 [ 1, +1] N M k x k (x k ) > (Awasthi et al., 2015) M k 1 =(2n k N)x k 8k, diag(m k )=1 8k, M k + 11 > + x k 1 > + 1(x k ) > 0 8k, M k + 11 > x k 1 > 1(x k ) > 0 8k, M k 11 > + x k 1 > 1(x k ) > apple 0 8k, these additional constraints may play a role now M k 11 > x k 1 > + 1(x k ) > apple 0 8k. Kilian Schindler (EPFL) CMS 2017 Slide 12

13 SDP Relaxation 1 8 D D, k=1 1 n k M k + 11 > + x k 1 > + 1(x k ) > E s.t. x k 2 [ 1, +1] N, M k 2 S N, M k x k (x k ) > 8k, k=1 xk =(2 K)1, 1 > x k =2n k N 8k, M k 1 =(2n k N)x k 8k, (R SDP ) diag(m k )=1 8k, M k + 11 > + x k 1 > + 1(x k ) > 0 8k, M k + 11 > x k 1 > 1(x k ) > 0 8k, M k 11 > + x k 1 > 1(x k ) > apple 0 8k, M k 11 > x k 1 > + 1(x k ) > apple 0 8k. Kilian Schindler (EPFL) CMS 2017 Slide 13

14 LP Relaxation 1 8 D D, k=1 1 n k M k + 11 > + x k 1 > + 1(x k ) > E s.t. x k 2 [ 1, +1] N, M k 2 S N, M k x k (x k ) > 8k, k=1 xk =(2 K)1, 1 > x k =2n k N 8k, M k 1 =(2n k N)x k 8k, (R LP ) diag(m k )=1 8k, M k + 11 > + x k 1 > + 1(x k ) > 0 8k, M k + 11 > x k 1 > 1(x k ) > 0 8k, M k 11 > + x k 1 > 1(x k ) > apple 0 8k, M k 11 > x k 1 > + 1(x k ) > apple 0 8k. Kilian Schindler (EPFL) CMS 2017 Slide 14

15 Relaxation Theorem Theorem: min R LP apple min R SDP apple min P. We can obtain lower bounds on the objective of the cardinalityconstrained k-means clustering problem in polynomial time. Lloyd s algorithm does not give lower bounds and is not guaranteed to terminate in polynomial time (Arthur and Vassilvitskii, 2006). Can we recover a feasible solution (and thus an upper bound)? Kilian Schindler (EPFL) CMS 2017 Slide 15

16 MILP Reformulation, Conic Relaxations Rounding Algorithm and Recovery Guarantees Numerical Experiments lower bound on objective

17 Rounding Algorithm Step 1: Solve R SDP or R LP and record the optimal x 1,...,x K 2 R N. Step 2: Solve the (totally unimodular) linear assignment problem max. k=1 i=1 k i x k i s.t. k i 2 {0, 1}, k=1 k i =1 8i, i=1 k i = n k 8k. x k i =+1! assign point i to cluster k to obtain an assignment { k i } that is feasible in P. Note: All of the above problems can be solved in polynomial time. Kilian Schindler (EPFL) CMS 2017 Slide 17

18 Perfect Separation for Balanced Clustering This condition is also used in Elhamifar et al., 2012, and Nellore and Ward, {I k }, I k = n 8 k, max k max d ij < min i,j2i k k6=` min i2i k,j2i` d ij maximum distance within clusters minimum distance between clusters Kilian Schindler (EPFL) CMS 2017 Slide 18

19 Recovery Theorem for Balanced Clustering Theorem: Under perfect separation, min R LP =minr SDP =minp. Proof idea: Derive a lower bound on min R LP from its own constraints. Show that under perfect separation this is attainable in P. Kilian Schindler (EPFL) CMS 2017 Slide 19

20 Proof of Recovery Theorem for Balanced Clustering (1/2) Define W k = M k + 11 > + x k 1 > + 1(x k ) >. 1 8n D D, k=1 Mk + 11 > + x k 1 > + 1(x k ) > E = 1 8n = 1 8n (5) D D, k=1 Wk E P i6=j d ij PK k=1 wk ij weighted sum of non-negative terms x k 2 [ 1, +1] N, M k 2 S N, k=1 xk =(2 K)1, 1 > x k =2n N 8k, M k 1 =(2n N)x k 8k, diag(m k )=1 8k, M k + 11 > + x k 1 > + 1(x k ) > 0 8k, M k + 11 > x k 1 > 1(x k ) > 0 8k, M k 11 > + x k 1 > 1(x k ) > apple 0 8k, M k 11 > x k 1 > + 1(x k ) > apple 0 8k. (6),(7) (1) (1) (1) (2) (3) (4) (5) (6) (7) 0 apple k=1 wk ij = k=1 mk ij +1+xk i + xk j apple k=1 (1 + 1) + 2 K + 2 K =4 P i6=j k=1 wk ij = P k=1 i6=j m k ij +1+xk i + xk j = k=1 (2n N)2 N + N(N 1) + 2(N 1)(2n N) =4Kn(n 1) (2),(3),(4) (2) Kilian Schindler (EPFL) CMS 2017 Slide 20

21 Proof of Recovery Theorem for Balanced Clustering (2/2) Bounds on individual weights: Restriction on total weight: 0 apple k=1 wk ij apple 4 P i6=j k=1 wk ij =4Kn(n 1) Lower bound on non-negative weighted sum: P i6=j d PK n ij k=1 wk ij sum of the Kn(n 1 8n 1 2n o 1) smallest d ij with i 6= j Under perfect separation, this lower bound is attainable in P : ` k` =0 i k i =1 k j =1 j Kilian Schindler (EPFL) CMS 2017 Slide 21

22 Simultaneous Clustering and Outlier Detection A similar approach is taken in Chawla and Gionis, 2013, and Ott et al., Outliers can be dealt with by introducing an additional dummy cluster. This dummy cluster is not penalized in the objective function, but it has to fulfill appropriate constraints. MILP reformulation, SDP/LP relaxations and the recovery guarantee are still available. Kilian Schindler (EPFL) CMS 2017 Slide 22

23 MILP Reformulation, Conic Relaxations Rounding Algorithm and Recovery Guarantees Numerical Experiments lower bound on objective feasible clustering that can be optimal

24 Performance on Real-World Data Consider classification datasets in the UCI Machine Learning Repository with datapoints up to 200 attributes no missing values Perform classification by means of the following approaches Rounded Rounded R LP R SDP Best-of-10 Bennett et al. R LP R SDP Bennett et al. Dataset UB LB UB LB UB CV (%) Iris Seeds Planning Relax Connectionist Bench Urban Land Cover 3.61e9 3.17e9 3.54e9 3.44e9 3.64e9 9.2 Parkinsons 1.36e6 1.36e6 1.36e6 1.36e6 1.36e Glass Identification Kilian Schindler (EPFL) CMS 2017 Slide 24

25 Performance on Synthetic Data (1/3) Generate three clouds with 10, 20 and 70 datapoints, respectively. The datapoints of each cloud are contained within a unit ball. Vary the separation between the clouds. Apply Rounded R LP, Rounded R SDP and Best-of-10 Bennett et al. Kilian Schindler (EPFL) CMS 2017 Slide 25

26 Performance on Synthetic Data (2/3) Kilian Schindler (EPFL) CMS 2017 Slide 26

27 Performance on Synthetic Data (3/3) Rounded R LP Best-of-10 Bennett et al. Kilian Schindler (EPFL) CMS 2017 Slide 27

28 Performance on Outlier Detection Consider the Breast cancer Wisconsin (diagnostic) dataset in the UCI Machine Learning Repository. 357 benign (considered to be the cluster), and 212 malign (considered to be the outliers). Vary number of outliers and apply rounded. R LP Optimality gap always below 3%. Kilian Schindler (EPFL) CMS 2017 Slide 28

29 MILP Reformulation, Conic Relaxations Rounding Algorithm and Recovery Guarantees Numerical Experiments lower bound on objective feasible clustering that can be optimal proof of concept

30 Thank you & References Awasthi, P., A. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, R. Ward Relax, no need to round: Integrality of clustering formulations. Conference on Innovations in Theoretical Computer Science Bennett, K., P. Bradley, A. Demiriz Constrained K-means clustering. Technical Report, Microsoft Research. Chewla, S., A. Gionis k-means--: A unified approach to clustering and outlier detection. SIAM International Conference on Data Mining Elhamifar, E., G. Sapiro, R. Vidal Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. Advances in Neural Information Processing Systems Lloyd, S Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2) Nellore, A., R. Ward Recovery guarantees for exemplar-based clustering. Information and Computation Ott, L., L. Pang, F. Ramos, S. Chewla On integrated clustering and outlier detection. Advances in Neural Information Processing Systems Rujeerapaiboon, N., K. Schindler, D. Kuhn, W. Wiesemann Size matters: Cardinalityconstrained clustering and outlier detection via conic optimization. Optimization Online. Arthur, D., S. Vassilvitskii How slow is the k-means method?. Symposium on Computational Geometry Zha, H., X. He, C. Ding, H. Simon, M. Gu Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems Kilian Schindler (EPFL) CMS 2017 Slide 30

arxiv: v2 [math.oc] 5 Oct 2017

arxiv: v2 [math.oc] 5 Oct 2017 Size Matters: Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization arxiv:705.07837v [math.oc] 5 Oct 07 Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn Risk Analytics and