Understanding Big Data Spectral Clustering

Understanding Big Data Sectral Clustering Romain Couillet, Florent Benaych-Georges To cite this version: Romain Couillet, Florent Benaych-Georges. Understanding Big Data Sectral Clustering. IEEE 6th International Worksho on Comutational Advances in Multi-Sensor Adative Processing CAMSAP 205), Dec 205, Cancun, Mexico. <0.09/camsa.205.7383728 >. <hal-0242494> HAL Id: hal-0242494 htts://hal.archives-ouvertes.fr/hal-0242494 Submitted on 2 Dec 205 HAL is a multi-discilinary oen access archive for the deosit and dissemination of scientific research documents, whether they are ublished or not. The documents may come from teaching and research institutions in France or abroad, or from ublic or rivate research centers. L archive ouverte luridiscilinaire HAL, est destinée au déôt et à la diffusion de documents scientifiques de niveau recherche, ubliés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires ublics ou rivés.

Understanding Big Data Sectral Clustering Romain COUILLET, Florent Benaych-Georges CentraleSuélec LSS Université ParisSud, Gif sur Yvette, France MAP 5, UMR CNRS 845 Université Paris Descartes, Paris, France. Abstract This article introduces an original aroach to understand the behavior of standard kernel sectral clustering algorithms such as the Ng Jordan Weiss method) for large dimensional datasets. Precisely, using advanced methods from the field of random matrix theory and assuming Gaussian data vectors, we show that the Lalacian of the kernel matrix can asymtotically be well aroximated by an analytically tractable equivalent random matrix. The analysis of the former allows one to understand deely the mechanism into lay and in articular the imact of the choice of the kernel function and some theoretical limits of the method. Desite our Gaussian assumtion, we also observe that the redicted theoretical behavior is a close match to that exerienced on real datasets taken from the MNIST database). I. INTRODUCTION Letting x,..., x n R be n data vectors, kernel sectral clustering consists in a variety of algorithms designed to cluster these data in an unsuervised manner by retrieving information from the leading eigenvectors of a ossibly modified version of) the so-called kernel matrix K = {K ij } n i,j= with e.g., 2 K ij = f x i x j /) for some f : R R +. There are multile reasons see e.g., []) to exect that the aforementioned eigenvectors contain information about the otimal data clustering. One of the most rominent of those was ut forward by Ng Jordan Weiss in [2] who notice that, if the data are ideally well slit in k classes C,..., C k that ensure f x i x j /) = 0 if and only if x i and x j belong to distinct classes, then the eigenvectors associated with the k smallest eigenvalues of I n D 2 KD 2, D DK n ), live in the san of C,..., Ck, the indicator vectors of the classes. In the non-trivial case where such a searating f does not exist, one would thus exect the leading eigenvectors to be instead erturbed versions of indicator vectors. We shall recisely study the matrix I n D 2 KD 2 in this article. Nonetheless, desite this consicuous argument, very little is known about the actual erformance of kernel sectral clustering in actual working conditions. In articular, to the authors knowledge, there exists no contribution addressing the case of arbitrary and n. In this article, we roose a new aroach consisting in assuming that both and n are large, and exloiting recent results from random matrix theory. Our method is insired by [3] which studies the asymtotic distribution of the eigenvalues of K for i.i.d. vectors x i. We generalize here [3] by assuming that the x i s are drawn from Couillet s work is suorted by RMT4GRAPH ANR-4-CE28-0006). 2 As shall be seen below, the non conventional) division by here is the roer norm scale in the large n, regime. a mixture of k Gaussian vectors having means µ,..., µ k and covariances C,..., C k. We then go further by studying the resulting model and showing that L = D 2 KD 2 can be aroximated by a matrix of the so-called siked model tye [4], [5], that is a matrix with clustered eigenvalues and a few isolated outliers. Among other results, our main findings are: in the large n, regime, only a very local asect of the kernel function really matters for clustering; there exists a critical growth regime with and n) of the µ i s and C i s for which sectral clustering leads to non-trivial misclustering robability; we recisely analyze elementary toy models, in which the number of exloitable eigenvectors and the influence of the kernel function may vary significantly. On to of these theoretical findings, we shall observe that, quite unexectedly, the kernel sectral algorithms behave similar to our theoretical findings on real datasets. We recisely see that clustering erformed uon a subset of the MNIST handwritten figures) database behaves as though the vectorized images were extracted from a Gaussian mixture. Notations: The norm stands for the Euclidean norm for vectors and the oerator norm for matrices. The vector m R m stands for the vector filled with ones. The oerator Dv) = D{v a a=) is the diagonal matrix having v,..., v k as its diagonal elements. The Dirac mass at x is δ x. II. MODEL AND THEORETICAL RESULTS Let x,..., x n R be indeendent vectors with x n+...+n l +,..., x n+...+n l C l for each l {,..., k}, where n 0 = 0 and n +... + n k = n. Class C a encomasses data x i = µ a + w i for some µ a R and w i N 0, C a ), with C a R nonnegative definite. We shall consider the large dimensional regime where both n and grow simultaneously large. In this regime, we shall require the µ i s and C i s to behave in a recise manner. As a matter of fact, we may state as a first result that the following set of assumtions form the exact regime under which sectral clustering is a non trivial roblem. Assumtion Growth Rate): As n, n c 0 > 0, n i n c i > 0 we will write c = [c,..., c k ] T ). Besides, ) For µ k n i i= n µ i and µ i = µ i µ, µ i = O) 2) For C k n i i= n C i and Ci = C i C, C a = O) and tr Ca = O n). 2 3) tr C converges as n to τ > 0. The value τ is imortant since x i x j 2 a.s. τ uniformly on i j in {,..., n}.

We now define the kernel function as follows. Assumtion 2 Kernel function): Function f is three-times continuously differentiable around τ and f > 0. Then we introduce the kernel matrix K { f x i x j 2 )} n. i,j= From the revious remark on τ, note that all non-diagonal elements of K tend to f and thus K can be oint-wise develoed using Taylor exansion. However, our interest is on a slightly modified form of) the Lalacian matrix L nd 2 KD 2 where D = DK n ) is usually referred to as the degree matrix. Under Assumtion, L is essentially a rank-one matrix with D 2 n for leading eigenvector with n for eigenvalue). To avoid this singularity, we shall instead study the matrix L nd 2 KD 2 n D 2 n T nd 2 T nd n ) which we shall show to have all its eigenvalues of order O). 3 Our main technical result shows that there is a matrix ˆL such that L ˆL P 0, where ˆL follows a tractable random matrix model. Before introducing the latter, we need the following fundamental deterministic element notations 4 M [µ,..., µ k] R k { t tr Ca R k T { tr C ac b a= a,b= J [j,..., j k ] R n k P I n n n T n R n n R k k where j a R n is the canonical vector of class C a, defined by j a ) i = δ xi C a, and the random element notations W [w,..., w n ] R n Φ W T M R n k ψ { wi 2 E[ w i 2 ] ) } n i= R n. Theorem Random Matrix Equivalent): Let Assumtions and 2 hold and L be defined by ). Then, as n, L ˆL where ˆL is given by ˆL 2f [ P W T W P f P 0 + UBU T ] + 2f f F I n 3 It is equivalent to study L or L that have the same eigenvalue-eigenvector airs but for the air n, D 2 n) of L turned into 0, D 2 n) for L. 4 Caital M stands here for means while t, T account for vector and matrix of traces, P for a rojection matrix onto the orthogonal of n T n). with F = f0) f+τf 2f [ ] U J, Φ, ψ B B = M T M + 5f 8f f 2f and ) B I k k c T 5f 8f f 2f t I k c T k ) 0 k k 0 k t T 5f 0 k 8f f 2f 5f 8f f ) 2f tt T f f T + n F k T k and the case f = 0 is obtained by extension by continuity in the limit f B being well defined as f 0). From a mathematical standoint, excluding the identity matrix, when f 0, ˆL follows a siked random matrix model, that is its eigenvalues congregate in bulks but for a few isolated eigenvalues, the eigenvectors of which align to some extent to the eigenvectors of UBU T. When f = 0, ˆL is even a simler small rank matrix. In both cases, the isolated eigenvalue-eigenvector airs of ˆL are amenable to analysis. From a ractical asect, note that U is constituted by the vectors j i, while B contains the information about the interclass mean-deviations through M, and about the inter-class covariance deviations through t and T. As such, the aforementioned isolated eigenvalue-eigenvector airs are exected to correlate to the canonical class basis J and all the more so that M, t, T have sufficiently strong norm. From the oint of view of the kernel function f, note that, if f = 0, then M vanishes from the exression of ˆL, thus not allowing sectral clustering to rely on differences in means. Similarly, if f = 0, then T vanishes, and thus differences in shae between the covariance matrices cannot be discriminated uon. Finally, if 5f 8f = f 2f, then differences in covariance traces are seemingly not exloitable. Before introducing our main results, we need the following technical assumtion which ensures that P W T W P does not roduce itself isolated eigenvalues and thus, that the isolated eigenvalues of ˆL are solely due to UBU T ). Assumtion 3 Sike control): Letting λ C a )... λ C a ) be the eigenvalues of C a, for each a, as n, i= δ λ ic a) D ν a, with suort suν a ), and max distλ ic a ), suν a )) 0. i Theorem 2 Isolated eigenvalues 5 ): Let Assumtions 3 hold and define the k k matrix k G z = hτ, z)i k + hτ, z)m T I + c j g j z)c j M j= hτ, z) f 5f f T + 8f f ) ) 2f tt T Γz) 5 Again, the case f = 0 is obtained by extension by continuity.

where hτ, z) = + 5f 8f f ) k 2f c i g i z) 2 tr C2 i i= Γz) = D {c a g a z) a= { c a g a z)c b g b z) k i= c ig i z) a,b= and g z),..., g k z) are the unique solutions to the system ) c 0 g a z) = z + k tr C a I + c i g i z)c i. i= Let ρ, away from the eigenvalue suort of P W T W P, be such that hτ, ρ) 0 and G ρ has a zero eigenvalue of multilicity m ρ. Then there exists m ρ eigenvalues of L asymtotically close to 2 f f ρ + f0) f + τf. f Let us now turn to the more interesting result concerning the eigenvectors. This result is divided in two subsequent formulas, concerning resectively the eigenvector D 2 n associated with the eigenvalue n of L, and the remaining more interesting) eigenvectors associated with the eigenvalues exhibited in Theorem 2. Proosition Eigenvector D 2 n ): Let Assumtions 2 hold true. Then D 2 n T n D n = n n + n c 0 { 2 + D trc2 a) na [ f 2f {t a na a= ϕ + o P ) a= for some ϕ N 0, I n ). Theorem 3 Eigenvector rojections): Let Assumtions 3 hold. Let also λ j,..., λ j+m ρ be isolated eigenvalues of L all converging to ρ as er Theorem 2 and Π ρ the rojector on the eigensace associated to these eigenvalues. Then, with the notations of Theorem 2, m J ρ T hτ, ρ)v r,ρ ) i V l,ρ ) T i Π ρ J = Γρ) V i= l,ρ ) T + o P ) i G ρv r,ρ ) i where V r,ρ C k mρ and V l,ρ C k mρ are sets of right and left eigenvectors of G ρ associated with the eigenvalue zero, and G ρ is the derivative of G z along z taken for z = ρ. Proosition rovides an accurate characterization of the eigenvector D 2 n, which conveys clustering information based on the difference in covariance traces through t) mainly. As for Theorem 3, it states that, as, n grow large, the alignment between the isolated eigenvectors of L and the canonical class-basis j,..., j k tends to be deterministic in a theoretically tractable manner. In articular, the quantity n tr Dc )J T Π ρ J [0, m λ ] ] evaluates the alignment between Π ρ and J, thus roviding a first hint on the exected erformance of sectral clustering. A second interest of Theorem 3 is that, for eigenvectors û of L of multilicity one so Π ρ = ûû T ), the diagonal elements of n Dc 2 )J T Π ρ J Dc 2 ) rovide the squared mean values of the successive first j, then next j 2, etc., elements of û. The off-diagonal elements of n Dc 2 )J T Π ρ J Dc 2 ) then allow to decide on the signs of û T j i for each i. These ieces of information are again crucial to estimate the exected erformance of sectral clustering. However, the statements of Theorems 2 and 3 are difficult to interret from the onset. These become more exlicit when alied to simler scenarios and allow one to draw interesting conclusions. This is the target of the next section. III. SPECIAL CASES In this section, we aly Theorems 2 and 3 to the cases where: i) C i = βi for all i, with β > 0, ii) all µ i s are equal and C i = + γi )βi. Assume first that C i = βi for all i. Then, letting l be an isolated eigenvalue of βi + M Dc)M T, if l β > β c 0 2) then the matrix L has an eigenvalue asymtotically) equal to ) ρ = 2f l + β li j + f0) f + τf f c 0 l β f Besides, we find that n J T Π ρ J = l c 0β 2 lβ l) 2 ) Dc)M T Υ ρ Υ T ρ M Dc) + o P ) where Υ ρ R mρ are the eigenvectors of βi +M Dc)M T associated with eigenvalue l. Aside from the very simle result in itself, note that the choice of f is asymtotically) irrelevant here. Note also that M Dc)M T lays an imortant role as its eigenvectors rule the behavior of the eigenvectors of L used for clustering. Assume now instead that for each i, µ i = µ and C i = + γi )βi for some γ,..., γ k R fixed, and we shall denote γ = γ,..., γ k ) T. Then, if the searability condition 2) is met, we now find after calculus that there exists at most one isolated eigenvalue in L beside n) again equal to ρ = 2f l + β l ) + f0) f + τf f c 0 l β f ) but for l = β 2 5f 8f f 2f 2 + ) k i= c iγi 2. Moreover, β 2 β l) 2 n J T Π ρ J = c 0 2 + k i= c Dc)γγ T Dc) + o P ). iγi 2 If the searability condition is not met, then there is no isolated eigenvalue beside n. We note here the imortance of an aroriate choice of f. Note also that n Dc 2 )J T Π ρ J Dc 2 ) is roortional to Dc 2 )γγ T Dc 2 ) and thus the eigenvector aligns strongly to

Fig.. Samles from the MNIST database, without and with 0dB noise. Dc 2 )γ itself. Thus the entries of Dc 2 )γ should be quite distinct to achieve good clustering erformance. IV. SIMULATIONS We comlete this article by demonstrating that our results, that aly in theory only to Gaussian x i s, show a surrisingly similar behavior when alied to real datasets. Here we consider the clustering of n = 3 64 vectorized images of size = 784 from the MNIST training set database numbers 0,, and 2, as shown in Figure ). Means and covariance are emirically obtained from the full set of 60 000 MNIST images. The matrix L is constructed based on fx) = ex x/2). Figure 2 shows that the eigenvalues of both L and ˆL, both in the main bulk and outside, are quite close to one another recisely L ˆL / L 0.). As for the eigenvectors dislayed in decreasing eigenvalue order), they are in an almost erfect match, as shown in Figure 3. In the latter is also shown in thick blue) lines the theoretical aroximated signed) diagonal values of n Dc 2 )J T Π ρ J Dc 2 ), which also show an extremely accurate match between theory and ractice. Here, the k-means algorithm alied to the four dislayed eigenvectors has a correct clustering rate of 86%. Introducing a 0dB random additive noise to the same MNIST data brings the aroximation error down to L ˆL / L 0.04 and the k-means correct clustering robability to 78% with only two theoretically exloitable eigenvectors instead of reviously four)..5 0.5 2 0 2 matching eigenvalues Eigenvalues of L Eigenvalues of ˆL 0 0 0 20 30 40 50 Fig. 2. Eigenvalues of L and ˆL, MNIST data, = 784, n = 92. V. CONCLUDING REMARKS The random matrix analysis of kernel matrices constitutes a first ste towards a recise understanding of the underlying Fig. 3. Leading four eigenvectors of L red) versus ˆL black) and theoretical class-wise means blue); MNIST data. mechanism of kernel sectral clustering. Our first theoretical findings allow one to already have a artial understanding of the leading kernel matrix eigenvectors on which clustering is based. Notably, we recisely identified the asymtotic) linear combination of the class-basis canonical vectors around which the eigenvectors are centered. Currently on-going work aims at studying in addition the fluctuations of the eigenvectors around the identified means. With all these informations, it shall then be ossible to recisely evaluate the erformance of algorithms such as k-means on the studied datasets. This innovative aroach to sectral clustering analysis, we believe, will subsequently allow exerimenters to get a clearer icture of the differences between the various classical sectral clustering algorithms beyond the resent Ng Jordan Weiss algorithm), and shall eventually allow for the develoment of finer and better erforming techniques, in articular when dealing with high dimensional datasets. REFERENCES [] U. Von Luxburg, A tutorial on sectral clustering, Statistics and comuting, vol. 7, no. 4,. 395 46, 2007. [2] A. Y. Ng, M. Jordan, and Y. Weiss, On sectral clustering: Analysis and an algorithm, Proceedings of Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, vol. 4,. 849 856, 200. [3] N. El Karoui, The sectrum of kernel random matrices, The Annals of Statistics, vol. 38, no.,. 50, 200. [4] F. Benaych-Georges and R. R. Nadakuditi, The singular values and vectors of low rank erturbations of large rectangular random matrices, Journal of Multivariate Analysis, vol.,. 20 35, 202. [5] F. Chaon, R. Couillet, W. Hachem, and X. Mestre, The outliers among the singular values of large rectangular random matrices with additive fixed rank deformation, Markov Processes and Related Fields, vol. 20,. 83 228, 204.