Distributed K-means over Compressed Binary Data

1 Distributed K-means over Comressed Binary Data Elsa DUPRAZ Telecom Bretagne; UMR CNRS 6285 Lab-STICC, Brest, France arxiv:1701.03403v1 [cs.it] 12 Jan 2017 Abstract We consider a networ of binary-valued sensors with a fusion center. The fusion center has to erform K-means clustering on the binary data transmitted by the sensors. In order to reduce the amount of data transmitted within the networ, the sensors comress their data with a source coding scheme based on LDPC codes. We roose to aly the K-means algorithm directly over the comressed data without reconstructing the original sensors measurements, in order to avoid otentially comlex decoding oerations. We rovide aroximated exressions of the error robabilities of the K-means stes in the comressed domain. From these exressions, we show that alying the K- means algorithm in the comressed domain enables to recover the clusters of the original domain. Monte Carlo simulations illustrate the accuracy of the obtained aroximated error robabilities, and show that the coding rate needed to erform K-means clustering in the comressed domain is lower than the rate needed to reconstruct all the measurements. I. INTRODUCTION Networs of sensors have long been emloyed in various domains such as environmental monitoring, electrical energy management, and medicine [1]. In articular, inexensive binary-valued sensors are successfully used in a wide range of alications, such as traffic control in telecommunication systems [2], self-testing in nanonelectric devices [3], or activity recognition on home environments [4]. In this aer, we consider a networ of J binary-valued sensors that transmit their data to a fusion center. In such networ, the sensors otentially collect a large amount of data. The fusion center may realize comlex data analysis tass by aggregating the sensors measurements and by exloiting the diversity of the collected data. Clustering is a articular data analysis tas that consists of searating the data in a given number of classes with similar characteristics. One of the most oular clustering methods is the K-means algorithm [5] due to its simlicity and its efficiency. The K- means algorithm grous the J measurement vectors into K clusters so as to minimize the average distance between vectors in a cluster and the cluster center. If the measurements are realvalued, K-means usually considers the Euclidian distance [5], while in case of binary measurements, K-means relies on the Hamming distance [6]. In our context, the J sensors should send their measurements to the fusion center in a comressed form in order to greatly reduce the amount of data transmitted within the networ. Low Density Parity Chec (LDPC) codes, initially introduced in the context of channel coding [7], have been shown to be very efficient for distributed comression in a networ of sensors [8]. However, the standard distributed comression framewor [8] considers that the fusion center has to reconstruct all the measurements from all the sensors. Here, in order to avoid otentially comlex decoding oerations, we would lie to erform K-means directly over the comressed data. Distributed K-means over comressed data raises three questions: (i) How should the data be comressed so that the fusion center can erform K-means without having to reconstruct all the measurements? (ii) How good is clustering over comressed data comared to clustering over the original data? (iii) Is the rate needed to erform K-means lower than the rate needed to reconstruct all the data? Regarding the first two questions, [9], [10] considered realvalued measurement vectors comressed from Comressed Sensing (CS) techniques, and roosed to aly the K-means algorithm directly in the comressed domain. CS consists of comuting M random linear combinations of the N comonents of the measurement vectors, with M N. The results of [9] show that if M is large enough, the comression reserves the Euclidian distances with high robability, which enables to erform K-means in the comressed domain. Other than K-means, detection and arameter estimation can be alied over comressed real-valued data [11] [13], but also over binary data comressed with LDPC codes [14], [15]. However, none of these wors consider the K-means algorithm over comressed binary data. Regarding the third question, information theory has recently addressed data analysis tass such as distributed hyothesis testing [16], [17] or similarity queries [18], [19] over comressed data. These wors rovide analytic exressions of the minimum rates that need to be transmitted in order to erform the considered tass, although these general analytic exressions can be difficult to evaluate for articular source models. However, to the best of our nowledge, the K-means algorithm has not been studied yet with information theory. In this aer, we consider binary measurement vectors and we assume that the comression is realized from LDPC codes. We want to determine whether the K-means algorithm can be alied directly over the comressed data in order to recover the clusters of the original data. In the following, after describing the source coding system (Section II), we roose a formulation of the K-means algorithm over binary data in the comressed domain (Section III). We then carry a theoretical analysis of the erformance of the K-means algorithm in the comressed domain (Section IV). We in articular derive aroximated error robabilities of each of the two stes of the K-means algorithm. The theoretical analysis shows that alying the K-means algorithm in the comressed domain ermits to recover the clusters of the original domain. Monte Carlo simulations confirm the accuracy of the obtained aroximated error robabilities, and show that the effective rate needed to erform K-means over comressed data is lower than the rate that would be needed to reconstruct all the sensors measurements (Section V).

2 II. SYSTEM DESCRIPTION In this section, we first introduce our notations and assumtions for the binary measurement vectors collected by the sensors. We then resent the source coding technique based on LDPC codes that is used in the system. A. Source Model The networ is comosed by J sensors and a fusion center. Each sensor j 1, J erforms N binary measurements x j,n {0, 1} that are stored in a vector x j of size N. Consider K different clusters C where each cluster is associated to a centroid θ of length N. The binary comonents θ,n of θ are indeendent and identically distributed (i.i.d.). with P (θ,n = 1) = c. We assume that each measurement vector x j belongs to one of the K clusters. The cluster assignment variables e j, are defined as e j, = 1 if x j C, e j, = 0 otherwise. Let Θ = {θ 1,, θ K } and E = {e 1,1,, e J,K } be the sets of centroids and of cluster assignment variables, resectively. Within cluster C, each vector x j C is generated as x j = θ b j, (1) where reresents the XOR comonentwise oeration, and b j is a vector of size N with binary i.i.d. comonents such that P (b j,n = 1) =. In the following, we assume that the cluster assignment variables e j,, the centroids θ, and the arameter and c are unnown. This model is equivalent to the model resented in [20] for K-means clustering with binary data. Some instances of the K-means algorithm have been roosed to deal with an unnown number of clusters K [5]. However, here, as a first ste, K is assumed to be nown in order to focus on the comression asects of the roblem. Each sensor has to transmit its data to the fusion center that should erform K-means on the received data in order to recover the cluster assignments E and the centroids Θ. We now describe the source coding technique that is used in our system in order to greatly reduce the amount of data transmitted to the fusion center. B. Source Coding with Low Density Parity Chec Codes In [8], it is shown that LDPC codes are very efficient to erform distributed source coding in a networ of sensors, and in [14], [15] it is shown that they allow arameter estimation over the comressed data. Denote by H the binary arity chec matrix of size N M (M < N) of an LDPC code. Denote by d v M the number of non-zero comonents in any row of H, and denote by d c N the number of non-zero comonents in any column of H. In our system, each sensor j transmits to the fusion center a binary vector u j of length M, obtained as u j = H T x j. (2) For each sensor, the coding rate is given by r = M N = dv d c. The set of all the ossible vectors x j is called the original domain and it is denoted as X N = {0, 1} N. The set of all the ossible vectors u j is called the comressed domain and it is denoted by U M {0, 1} M. The comressed domain U M deends on the considered code H. As in [8], we assume that the vectors u j are transmitted reliably to the fusion center. We consider this assumtion in order to focus on the source coding asects of the roblem, and we do not describe the channel codes that should be used in the system in order to satisfy this assumtion. The source coding technique described by (2) was initially roosed in [8] in a context where the fusion center has to reconstruct all the measurement vectors x j. However, reconstructing all the sensor measurements usually requires comlex decoding oerations and may need a higher coding rate r than simly alying K-means over the comressed data. Hence, in the following, we roose to aly the K-means algorithm directly over the comressed vectors u j, without having to reconstruct the original vectors x j. III. K-MEANS ALGORITHM The K-means algorithm for clustering binary vectors x j in the original domain X N was initially roosed in [6] and it maes use of the Hamming distance. In this section, we restate the K-means algorithm of [6] in the comressed domain U M. The Hamming distance between two vectors a, b U M in the comressed domain is defined as d(a, b) = M m=1 a m b m. Denote ψ = H T θ and Ψ = {ψ 1,..., ψ K } the comressed versions of the centroids θ. Alying the K-means algorithm in the comressed domain corresonds to minimizing the objective function F(Ψ, E) = J j=1 =1 K e j, d(u j, ψ ). (3) with resect to the comressed centroids ψ and to the cluster assignment variables e j,. We initialize the K-means algorithm with K comressed centroids ψ (0) that may be either selected at random among the set of inut vectors u j, or obtained from the K-means++ rocedure [21]. Denote by L the number of iterations of the K- means algorithm. In the following, exonent l always refers to a quantity obtained at the l-th iteration of the algorithm. At iteration l 1, L, K-means roceeds in two stes. First, from the centroids ψ (l 1) obtained at iteration l 1, it assigns each vector u j to a cluster as j,, e (l) j, = { 1 if d(uj, ψ (l 1) ) = min d(u j, ψ (l 1) ), 1,K 0 otherwise. (4) Second, the algorithm udates the centroids as follows: j, n, ψ (l),n = 1 if J e (l) j=1 0 otherwise., j, u j,n 1 2 J (l) where J (l) is the number of vectors assigned to cluster at iteration l. The cluster assignment ste (4) assigns each vector u j to the cluster with the closest comressed centroid ψ (l). The centroid comutation ste (5) is a majority voting oeration which can be shown to minimize the average distances (5)

3 between the centroid ψ (l) and all the vectors u j assigned to cluster at iteration l. Following the same reasonning as for K-means in the original domain [6], it is easy to show that when alying K-means in the comressed domain, the objective function F(Ψ (l), E (l) ) is decreasing with l and at least converges to a local minimum. However, this roerty does not guarantee that the cluster assignment variables e (l) j, obtained from the algorithm in the comressed domain will corresond to the correct cluster assignments in the original domain. In order to justify that the K-means algorithm alied in the comressed domain can recover the correct clusters of the original domain, we now roose a theoretical analysis of the two stes of the algorithm. IV. K-MEANS PERFORMANCE EVALUATION In this section, in order to assess the erformance of the K- means algorithm in the comressed domain, we evaluate each ste of the algorithm individually. We rovide an aroximated exression of the error robability of the cluster assignment ste in the comressed domain, assuming that the comressed cluster centroids ψ are erfectly nown. In the same way, we rovide an aroximated exression of the error robability of the centroid estimation ste in the comressed domain, assuming that the cluster assignment variables e j, are erfectly nown. Although evaluated in the most favorable cases, these error robabilities will enable us determine whether it is reasonable to aly K-means in the comressed domain in order to recover the clusters of the original domain. The exressions of the error robabilities we derive rely on two functions B M and f defined as ( ) M B M (m, ) = m (1 ) M m, (6) m f(d, ) = 1 2 1 2 (1 2)d. (7) A. Error Probability of the Cluster Assignment Ste The following roosition evaluates the error robability of the cluster assignment ste (4) alied to the comressed centroids ψ. Proosition 1. Let ê j, be the cluster assignments obtained when alying the cluster assignment ste (4) to the true comressed centroids ψ. The error robability P a = P (ê j, = 0 x j C ) can be aroximated as P a (K 1) M m 1=0 M m 2=m 1 B M (m 1, q 1 )B M (m 2, q 2 ) (8) with q 1 = f(d c, ), q 2 = f(d c, (1 )f(2, c )). Proof: We first evaluate the error robabilities P a, = P (ê j, = 1 x j C ),. Let a = u j ψ = H T b j, a = u j ψ = H T (θ θ b j ). (9) and define A = M m=1 a,m, A = M m=1 a,m. According to the cluster assignment ste (4), the error robability P a, can be exressed as P a, = P (A A ) which can be aroximated as M M P a, P (A = u) P (A = v). (10) u=0 v=u In order to get (10) we imlicitly assume that the random variables A and A are indeendent, and hence (10) is only an aroximation of P a,. Assuming that the a,m and a,m are all indeendent, we get P (A = u) B M (u, q 1 ) and P (A = v) B M (v, q 2 ). To finish, the error robability of the cluster assignment ste is given by P a = P a, = (K 1)P a, since P a, does not deend on. It can be seen from (8) that the aroximated error robability P a does not deend on the considered cluster. The exression (8) is only an aroximation of the error robability of the cluster assignment ste, since it assumes that the comonents of the vector H T b j are indeendent, which is not true in general. However, it is shown in [14], [22] that this assumtion is reasonable for arameter estimation over LDPC codes. In Section V, we verify the accuracy of the aroximation by comaring the values of (8) to the error robabilities measured from Monte Carlo simulations. B. Error Probability of the Centroid Comutation Ste The following roosition evaluates the error robability of the centroid comutation ste in the comressed domain. Proosition 2. Let ˆΨ be the estimated comressed centroids obtained after alying the centroid estimation ste (5) to the true cluster assignment variables e j,. The error robability P c, = P ( ˆψ,m ψ,m ) for cluster can be aroximated as P c, J j= J 2 B J (j, d ) (11) where J is the number of vectors in cluster, and d = f(d c, ). Proof: From the model defined in Section II, a codeword u j (j C ), can be exressed as u j = H T (θ b j ) = ψ a j, (12) where a j = H T b j is such that P (a j,m = 1) = d. Let A j = J j=1 a j,m. The error robability of the centroid comutation ste can be evaluated as P c, = P ( A j J ) 2 J j= J 2 B J (j, d ). (13) The aroximation comes from the fact that (11) assumes that the a j,m are indeendent. It can be seen from (11) that the aroximated error robability P c, only deends on the considered cluster through the number J of vectors in cluster C. The exression (8) is only an aroximation of the error robability of the centroid assignment ste for the same reasons as for the cluster assignment ste. We will also verify the accuracy of this aroximation in Section V.

4 M=250, theoretic M=250, ractical M=500, theoretic M=500, ractical 1e-7 0.05 0.1 0.15 0.2 0.25 0.3 0.35 (a) M=250, theoretic M=250, ractical M=500, theoretic M=500, ractical 1e-7 0 0.05 0.1 0.15 0.2 (b) M=250 M=500 0 0.05 0.1 0.15 0.2 (c) Fig. 1. (a) Comarison of error robability redicted by theory with error robability measured in ractice for the cluster assignment ste (b) Comarison of error robability redicted by theory with error robability measured in ractice for the centroid estimation ste (c) Error robability of the K-means algorithm with L = 10 iterations V. SIMULATION RESULTS In this section, we evaluate through simulations the erformance of the K-means algorithm in the comressed domain. We first consider each ste of the algorithm individually, and we verify the accuracy of the aroximated error robabilities obtained in Section IV. We then assess the erformance of the full algorithm and we evaluate the rate needed to erform K-means over comressed data. Throughout the section, we set J = 200, K = 4, c = 0.1 and we consider two LDPC codes of length N = 1000 with d v = 2. The two codes are constructed from the Progressive Edge Growth algorithm [23]; the first code is of rate r = 1/4 with M = 250 and d c = 8, and the second one is of rate r = 1/2, with M = 500 and d c = 4. We set d v = 2 for the two considered codes, since it can be shown from (8) and (11) that the error robabilities P a and P c, are increasing with d v (d v is necessarily greater than 2). A. Accuracy of the error robability aroximations We comare the aroximated exressions P a (4) and P c, (11) with the effective error robabilities measured from Monte Carlo simulations for each ste of the algorithm for the two constructed codes over N t = 10000 simulations. Figure 1(a) reresents the obtained error robabilities for the cluster assignment ste, while Figure 1(b) reresents the centroid estimation ste. We see that for the two considered codes, the theoretic error robabilities P a and P c, are close to the measured error robabilities for the two stes of the algorithm, which shows the accuracy of the roosed aroximations. Figure 1(a) also illustrates that the cluster assignment ste in the comressed domain can indeed recover the correct clusters of the original domain, since it is ossible to reach error robabilities from 10 3 to 10 7. The same conclusion holds for Figure 1(b) for the centroid estimation ste. B. K-means algorithm and rate evaluation In addition to analyzing the erformance of the K-means algorithm in the comressed domain, we also want to comare the coding rate r needed to erform K-means to the rate that would be needed to reconstruct all the sensors measurements. From [24], the minimum coding rate (er symbol normalized by the number of sources J) the fusion center should receive to reconstruct all the x j is given by R d = 1 J H(X 1,, X J ), where H(X 1,, X J ) is the joint entroy of the sources (X 1,, X J ). Since no closed-form exression of R d exist, we evaluate the rate R d from Monte-Carlo simulations. We run over N t = 10000 simulations the K-means algorithm in the comressed domain initialized with the K- means++ rocedure and L = 10 iterations. In order to evaluate the erformance of the algorithm, we measure the error robability of the cluster assignments decided by the algorithm in the comressed domain with resect to the correct clusters in the original domain. Figure 1(c) reresents the error robabilities with resect to obtained for the two considered codes. As exected, the error robability is increasing with and is decreasing with the coding rate r. We then comare the rate needed to erform K-means over comressed data to the rate needed to reconstruct all the sensors measurements. For c = 0.1 and = 0.1, we get R d = 0.68 bits/symbol, and, for c = 0.05 and = 0.1, we obtain R d = 0.43 bits/symbol. The results of Figure 1(c) show that for c = 0.1 and = 0.1, the code of rate r = 1/2 < 0.68 enables to erform K-means with an error robability lower than 10 6. For c = 0.1 and = 0.05, the code of rate r = 1/4 < 0.43 also enables to erform K-means with a low error robability P e = 10 5. This shows that the rate needed to erform K-means is lower than the rate needed to reconstruct all the sensors measurements, which justifies the use of the method resented in the aer. VI. CONCLUSION In this aer, we considered a networ of sensors which transmit their comressed binary measurements to a fusion center. We roosed to aly the K-means algorithm directly over the comressed data, without reconstructing the sensor measurements. From a theoretical analysis and Monte Carlo simulations, we showed the efficiency of alying K-means in the comressed domain. We also showed that the rate needed to erform K-means on the comressed vectors is lower than the rate needed to reconstruct all the measurements.

5 REFERENCES [1] J. Yic, B. Muherjee, and D. Ghosal, Wireless sensor networ survey, Comuter networs, vol. 52, no. 12,. 2292 2330, 2008. [2] L. Y. Wang, J.-F. Zhang, and G. G. Yin, System identification using binary sensors, IEEE Transactions on Automatic Control, vol. 48, no. 11,. 1892 1907, 2003. [3] E. Colinet and J. Juillard, A weighted least-squares aroach to arameter estimation roblems based on binary measurements, IEEE Transactions on Automatic Control, vol. 55, no. 1,. 148 152, 2010. [4] F. J. Ordóñez, P. de Toledo, and A. Sanchis, Activity recognition using hybrid generative/discriminative models on home environments using binary sensors, Sensors, vol. 13, no. 5,. 5460 5477, 2013. [5] A. K. Jain, Data clustering: 50 years beyond K-means, Pattern recognition letters, vol. 31, no. 8,. 651 666, 2010. [6] Z. Huang, Extensions to the K-means algorithm for clustering large data sets with categorical values, Data mining and nowledge discovery, vol. 2, no. 3,. 283 304, 1998. [7] T. Richardson and R. Urbane, The caacity of low-density aritychec codes under message-assing decoding, IEEE Transactions on Information Theory, vol. 47, no. 2,. 599 618, 2001. [8] Z. Xiong, A. Liveris, and S. Cheng, Distributed source coding for sensor networs, IEEE Signal Processing. Magazine, vol. 21, no. 5,. 80 94, 2004. [9] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas, Randomized dimensionality reduction for K-means clustering, IEEE Transactions on Information Theory, vol. 61, no. 2,. 1045 1062, 2015. [10] F. Pouramali-Anarai and S. Becer, Preconditioned data sarsification for big data with alications to PCA and K-means, arxiv rerint arxiv:1511.00152, 2015. [11] M. Davenort, P. T. Boufounos, M. B. Wain, R. G. Baraniu et al., Signal rocessing with comressive measurements, IEEE Journal of Selected Toics in Signal Processing, vol. 4, no. 2,. 445 460, 2010. [12] A. Bourrier, R. Gribonval, and P. rez, Comressive gaussian mixture estimation, in IEEE International Conference on Acoustics, Seech and Signal Processing, 2013,. 6024 6028. [13] A. Zebadua, P. O. Amblard, E. Moisan, and O. Michel, Comressed and quantized correlation estimators, To aear in IEEE Transactions on Signal Processing, 2016. [14] V. Toto-Zarasoa, A. Roumy, and C. Guillemot, Maximum lielihood bsc arameter estimation for the sleian-wolf roblem, IEEE Communications Letters, vol. 15, no. 2,. 232 234, 2011. [15] S. Wang, L. Cui, L. Stanovic, V. Stanovic, and S. Cheng, Adative correlation estimation with article filtering for distributed video coding, IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 5,. 649 658, May 2012. [16] M. S. Rahman and A. B. Wagner, On the otimality of binning for distributed hyothesis testing, IEEE Transactions on Information Theory, vol. 58, no. 10,. 6282 6303, 2012. [17] G. Katz, P. Piantanida, and M. Debbah, Distributed binary detection with lossy data comression, arxiv rerint arxiv:1601.01152, 2016. [18] E. Tuncel and D. Gunduz, Identification and lossy reconstruction in noisy databases, IEEE Transactions on Information Theory, vol. 60, no. 2,. 822 831, 2014. [19] A. Ingber, T. Courtade, and T. Weissman, Comression for quadratic similarity queries, IEEE Transactions on Information Theory, vol. 61, no. 5,. 2729 2747, 2015. [20] T. Li, A general model for clustering binary data, in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005,. 188 197. [21] D. Arthur and S. Vassilvitsii, K-means++: The advantages of careful seeding, in Proceedings of the eighteenth annual ACM-SIAM symosium on Discrete algorithms. Society for Industrial and Alied Mathematics, 2007,. 1027 1035. [22] E. Duraz, A. Roumy, and M. Kieffer, Source coding with side information at the decoder and uncertain nowledge of the correlation, IEEE Transactions on Communications, vol. 62, no. 1,. 269 279, 2014. [23] X.-Y. Hu, E. Eleftheriou, and D. M. Arnold, Regular and irregular rogressive edge-growth tanner grahs, IEEE Transactions on Information Theory, vol. 51, no. 1,. 386 398, 2005. [24] D. Sleian and J. K. Wolf, Noiseless coding of correlated information sources, IEEE Transactions on Information Theory, vol. 19, no. 4,. 471 480, 1973.