Understanding Big Data Spectral Clustering

Similar documents
The Arm Prime Factors Decomposition

A filter-based computational homogenization method for handling non-separated scales problems

AN OPTIMAL CONTROL CHART FOR NON-NORMAL PROCESSES

A numerical approach of Friedrichs systems under constraints in bounded domains

RANDOM MATRIX-IMPROVED KERNELS FOR LARGE DIMENSIONAL SPECTRAL CLUSTERING. Hafiz Tiomoko Ali, Abla Kammoun, Romain Couillet

The spectrum of kernel random matrices

The sum of digits function in finite fields

A construction of bent functions from plateaued functions

On the performance of greedy algorithms for energy minimization

Estimation of the large covariance matrix with two-step monotone missing data

Algebraic Parameter Estimation of Damped Exponentials

Bonding a linearly piezoelectric patch on a linearly elastic body

Minimum variance portfolio optimization in the spiked covariance model

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Numerical Linear Algebra

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Optimizing Power Allocation in Interference Channels Using D.C. Programming

On information plus noise kernel random matrices

Full-order observers for linear systems with unknown inputs

Unbiased minimum variance estimation for systems with unknown exogenous inputs

Random Matrices in Machine Learning

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

arxiv: v1 [physics.data-an] 26 Oct 2012

The analysis and representation of random signals

On the Spectrum of Random Features Maps of High Dimensional Data

Uniform Law on the Unit Sphere of a Banach Space

Radial Basis Function Networks: Algorithms

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces

Modeling Pointing Tasks in Mouse-Based Human-Computer Interactions

On split sample and randomized confidence intervals for binomial proportions

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Applications to stochastic PDE

Hidden Predictors: A Factor Analysis Primer

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

On Wald-Type Optimal Stopping for Brownian Motion

An Improved Generalized Estimation Procedure of Current Population Mean in Two-Occasion Successive Sampling

A non-commutative algorithm for multiplying (7 7) matrices using 250 multiplications

On the diversity of the Naive Lattice Decoder

Norm Inequalities of Positive Semi-Definite Matrices

Tropical Graph Signal Processing

Performances of Low Rank Detectors Based on Random Matrix Theory with Application to STAP

Unsupervised Hyperspectral Image Analysis Using Independent Component Analysis (ICA)

Positivity, local smoothing and Harnack inequalities for very fast diffusion equations

Widely Linear Estimation with Complex Data

On a Markov Game with Incomplete Information

Estimation of Separable Representations in Psychophysical Experiments

Principal Components Analysis and Unsupervised Hebbian Learning

A Simple Proof of P versus NP

VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES

A proximal approach to the inversion of ill-conditioned matrices

Hotelling s Two- Sample T 2

On Symmetric Norm Inequalities And Hermitian Block-Matrices

Chater Matrix Norms and Singular Value Decomosition Introduction In this lecture, we introduce the notion of a norm for matrices The singular value de

On parameter estimation in deformable models

8 STOCHASTIC PROCESSES

Solution to Sylvester equation associated to linear descriptor systems

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels

Positive decomposition of transfer functions with multiple poles

On the Properties for Iteration of a Compact Operator with Unstructured Perturbation

Torsion subgroups of quasi-abelianized braid groups

Vibro-acoustic simulation of a car window

Brownian Motion and Random Prime Factorization

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential

Hook lengths and shifted parts of partitions

Information collection on a graph

Fast Computation of Moore-Penrose Inverse Matrices

Gaia astrometric accuracy in the past

On the longest path in a recursively partitionable graph

Nel s category theory based differential and integral Calculus, or Did Newton know category theory?

Session 5: Review of Classical Astrodynamics

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Question order experimental constraints on quantum-like models of judgement

Low frequency resolvent estimates for long range perturbations of the Euclidean Laplacian

State Estimation with ARMarkov Models

Plotting the Wilson distribution

Random matrix theory applied to low rank stap detection

An efficient Jacobi-like deflationary ICA algorithm: application to EEG denoising

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

Multiple sensor fault detection in heat exchanger system

A new simple recursive algorithm for finding prime numbers using Rosser s theorem

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP

General Linear Model Introduction, Classes of Linear models and Estimation

Developing A Deterioration Probabilistic Model for Rail Wear

2-D Analysis for Iterative Learning Controller for Discrete-Time Systems With Variable Initial Conditions Yong FANG 1, and Tommy W. S.

Case report on the article Water nanoelectrolysis: A simple model, Journal of Applied Physics (2017) 122,

ON THE DEVELOPMENT OF PARAMETER-ROBUST PRECONDITIONERS AND COMMUTATOR ARGUMENTS FOR SOLVING STOKES CONTROL PROBLEMS

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming

Methylation-associated PHOX2B gene silencing is a rare event in human neuroblastoma.

On Symmetric Norm Inequalities And Hermitian Block-Matrices

Linear Quadratic Zero-Sum Two-Person Differential Games

Feedback-error control

Asymptotically Optimal Simulation Allocation under Dependent Sampling

LEIBNIZ SEMINORMS IN PROBABILITY SPACES

#A37 INTEGERS 15 (2015) NOTE ON A RESULT OF CHUNG ON WEIL TYPE SUMS

A Slice Based 3-D Schur-Cohn Stability Criterion

g as an approximation to Dg = (L(x) g(x)) for random particle distributions.

On the simultaneous stabilization of three or more plants

A Note on Massless Quantum Free Scalar Fields. with Negative Energy Density

Transcription:

Understanding Big Data Sectral Clustering Romain Couillet, Florent Benaych-Georges To cite this version: Romain Couillet, Florent Benaych-Georges. Understanding Big Data Sectral Clustering. IEEE 6th International Worksho on Comutational Advances in Multi-Sensor Adative Processing CAMSAP 205), Dec 205, Cancun, Mexico. <0.09/camsa.205.7383728 >. <hal-0242494> HAL Id: hal-0242494 htts://hal.archives-ouvertes.fr/hal-0242494 Submitted on 2 Dec 205 HAL is a multi-discilinary oen access archive for the deosit and dissemination of scientific research documents, whether they are ublished or not. The documents may come from teaching and research institutions in France or abroad, or from ublic or rivate research centers. L archive ouverte luridiscilinaire HAL, est destinée au déôt et à la diffusion de documents scientifiques de niveau recherche, ubliés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires ublics ou rivés.

Understanding Big Data Sectral Clustering Romain COUILLET, Florent Benaych-Georges CentraleSuélec LSS Université ParisSud, Gif sur Yvette, France MAP 5, UMR CNRS 845 Université Paris Descartes, Paris, France. Abstract This article introduces an original aroach to understand the behavior of standard kernel sectral clustering algorithms such as the Ng Jordan Weiss method) for large dimensional datasets. Precisely, using advanced methods from the field of random matrix theory and assuming Gaussian data vectors, we show that the Lalacian of the kernel matrix can asymtotically be well aroximated by an analytically tractable equivalent random matrix. The analysis of the former allows one to understand deely the mechanism into lay and in articular the imact of the choice of the kernel function and some theoretical limits of the method. Desite our Gaussian assumtion, we also observe that the redicted theoretical behavior is a close match to that exerienced on real datasets taken from the MNIST database). I. INTRODUCTION Letting x,..., x n R be n data vectors, kernel sectral clustering consists in a variety of algorithms designed to cluster these data in an unsuervised manner by retrieving information from the leading eigenvectors of a ossibly modified version of) the so-called kernel matrix K = {K ij } n i,j= with e.g., 2 K ij = f x i x j /) for some f : R R +. There are multile reasons see e.g., []) to exect that the aforementioned eigenvectors contain information about the otimal data clustering. One of the most rominent of those was ut forward by Ng Jordan Weiss in [2] who notice that, if the data are ideally well slit in k classes C,..., C k that ensure f x i x j /) = 0 if and only if x i and x j belong to distinct classes, then the eigenvectors associated with the k smallest eigenvalues of I n D 2 KD 2, D DK n ), live in the san of C,..., Ck, the indicator vectors of the classes. In the non-trivial case where such a searating f does not exist, one would thus exect the leading eigenvectors to be instead erturbed versions of indicator vectors. We shall recisely study the matrix I n D 2 KD 2 in this article. Nonetheless, desite this consicuous argument, very little is known about the actual erformance of kernel sectral clustering in actual working conditions. In articular, to the authors knowledge, there exists no contribution addressing the case of arbitrary and n. In this article, we roose a new aroach consisting in assuming that both and n are large, and exloiting recent results from random matrix theory. Our method is insired by [3] which studies the asymtotic distribution of the eigenvalues of K for i.i.d. vectors x i. We generalize here [3] by assuming that the x i s are drawn from Couillet s work is suorted by RMT4GRAPH ANR-4-CE28-0006). 2 As shall be seen below, the non conventional) division by here is the roer norm scale in the large n, regime. a mixture of k Gaussian vectors having means µ,..., µ k and covariances C,..., C k. We then go further by studying the resulting model and showing that L = D 2 KD 2 can be aroximated by a matrix of the so-called siked model tye [4], [5], that is a matrix with clustered eigenvalues and a few isolated outliers. Among other results, our main findings are: in the large n, regime, only a very local asect of the kernel function really matters for clustering; there exists a critical growth regime with and n) of the µ i s and C i s for which sectral clustering leads to non-trivial misclustering robability; we recisely analyze elementary toy models, in which the number of exloitable eigenvectors and the influence of the kernel function may vary significantly. On to of these theoretical findings, we shall observe that, quite unexectedly, the kernel sectral algorithms behave similar to our theoretical findings on real datasets. We recisely see that clustering erformed uon a subset of the MNIST handwritten figures) database behaves as though the vectorized images were extracted from a Gaussian mixture. Notations: The norm stands for the Euclidean norm for vectors and the oerator norm for matrices. The vector m R m stands for the vector filled with ones. The oerator Dv) = D{v a a=) is the diagonal matrix having v,..., v k as its diagonal elements. The Dirac mass at x is δ x. II. MODEL AND THEORETICAL RESULTS Let x,..., x n R be indeendent vectors with x n+...+n l +,..., x n+...+n l C l for each l {,..., k}, where n 0 = 0 and n +... + n k = n. Class C a encomasses data x i = µ a + w i for some µ a R and w i N 0, C a ), with C a R nonnegative definite. We shall consider the large dimensional regime where both n and grow simultaneously large. In this regime, we shall require the µ i s and C i s to behave in a recise manner. As a matter of fact, we may state as a first result that the following set of assumtions form the exact regime under which sectral clustering is a non trivial roblem. Assumtion Growth Rate): As n, n c 0 > 0, n i n c i > 0 we will write c = [c,..., c k ] T ). Besides, ) For µ k n i i= n µ i and µ i = µ i µ, µ i = O) 2) For C k n i i= n C i and Ci = C i C, C a = O) and tr Ca = O n). 2 3) tr C converges as n to τ > 0. The value τ is imortant since x i x j 2 a.s. τ uniformly on i j in {,..., n}.

We now define the kernel function as follows. Assumtion 2 Kernel function): Function f is three-times continuously differentiable around τ and f > 0. Then we introduce the kernel matrix K { f x i x j 2 )} n. i,j= From the revious remark on τ, note that all non-diagonal elements of K tend to f and thus K can be oint-wise develoed using Taylor exansion. However, our interest is on a slightly modified form of) the Lalacian matrix L nd 2 KD 2 where D = DK n ) is usually referred to as the degree matrix. Under Assumtion, L is essentially a rank-one matrix with D 2 n for leading eigenvector with n for eigenvalue). To avoid this singularity, we shall instead study the matrix L nd 2 KD 2 n D 2 n T nd 2 T nd n ) which we shall show to have all its eigenvalues of order O). 3 Our main technical result shows that there is a matrix ˆL such that L ˆL P 0, where ˆL follows a tractable random matrix model. Before introducing the latter, we need the following fundamental deterministic element notations 4 M [µ,..., µ k] R k { t tr Ca R k T { tr C ac b a= a,b= J [j,..., j k ] R n k P I n n n T n R n n R k k where j a R n is the canonical vector of class C a, defined by j a ) i = δ xi C a, and the random element notations W [w,..., w n ] R n Φ W T M R n k ψ { wi 2 E[ w i 2 ] ) } n i= R n. Theorem Random Matrix Equivalent): Let Assumtions and 2 hold and L be defined by ). Then, as n, L ˆL where ˆL is given by ˆL 2f [ P W T W P f P 0 + UBU T ] + 2f f F I n 3 It is equivalent to study L or L that have the same eigenvalue-eigenvector airs but for the air n, D 2 n) of L turned into 0, D 2 n) for L. 4 Caital M stands here for means while t, T account for vector and matrix of traces, P for a rojection matrix onto the orthogonal of n T n). with F = f0) f+τf 2f [ ] U J, Φ, ψ B B = M T M + 5f 8f f 2f and ) B I k k c T 5f 8f f 2f t I k c T k ) 0 k k 0 k t T 5f 0 k 8f f 2f 5f 8f f ) 2f tt T f f T + n F k T k and the case f = 0 is obtained by extension by continuity in the limit f B being well defined as f 0). From a mathematical standoint, excluding the identity matrix, when f 0, ˆL follows a siked random matrix model, that is its eigenvalues congregate in bulks but for a few isolated eigenvalues, the eigenvectors of which align to some extent to the eigenvectors of UBU T. When f = 0, ˆL is even a simler small rank matrix. In both cases, the isolated eigenvalue-eigenvector airs of ˆL are amenable to analysis. From a ractical asect, note that U is constituted by the vectors j i, while B contains the information about the interclass mean-deviations through M, and about the inter-class covariance deviations through t and T. As such, the aforementioned isolated eigenvalue-eigenvector airs are exected to correlate to the canonical class basis J and all the more so that M, t, T have sufficiently strong norm. From the oint of view of the kernel function f, note that, if f = 0, then M vanishes from the exression of ˆL, thus not allowing sectral clustering to rely on differences in means. Similarly, if f = 0, then T vanishes, and thus differences in shae between the covariance matrices cannot be discriminated uon. Finally, if 5f 8f = f 2f, then differences in covariance traces are seemingly not exloitable. Before introducing our main results, we need the following technical assumtion which ensures that P W T W P does not roduce itself isolated eigenvalues and thus, that the isolated eigenvalues of ˆL are solely due to UBU T ). Assumtion 3 Sike control): Letting λ C a )... λ C a ) be the eigenvalues of C a, for each a, as n, i= δ λ ic a) D ν a, with suort suν a ), and max distλ ic a ), suν a )) 0. i Theorem 2 Isolated eigenvalues 5 ): Let Assumtions 3 hold and define the k k matrix k G z = hτ, z)i k + hτ, z)m T I + c j g j z)c j M j= hτ, z) f 5f f T + 8f f ) ) 2f tt T Γz) 5 Again, the case f = 0 is obtained by extension by continuity.

where hτ, z) = + 5f 8f f ) k 2f c i g i z) 2 tr C2 i i= Γz) = D {c a g a z) a= { c a g a z)c b g b z) k i= c ig i z) a,b= and g z),..., g k z) are the unique solutions to the system ) c 0 g a z) = z + k tr C a I + c i g i z)c i. i= Let ρ, away from the eigenvalue suort of P W T W P, be such that hτ, ρ) 0 and G ρ has a zero eigenvalue of multilicity m ρ. Then there exists m ρ eigenvalues of L asymtotically close to 2 f f ρ + f0) f + τf. f Let us now turn to the more interesting result concerning the eigenvectors. This result is divided in two subsequent formulas, concerning resectively the eigenvector D 2 n associated with the eigenvalue n of L, and the remaining more interesting) eigenvectors associated with the eigenvalues exhibited in Theorem 2. Proosition Eigenvector D 2 n ): Let Assumtions 2 hold true. Then D 2 n T n D n = n n + n c 0 { 2 + D trc2 a) na [ f 2f {t a na a= ϕ + o P ) a= for some ϕ N 0, I n ). Theorem 3 Eigenvector rojections): Let Assumtions 3 hold. Let also λ j,..., λ j+m ρ be isolated eigenvalues of L all converging to ρ as er Theorem 2 and Π ρ the rojector on the eigensace associated to these eigenvalues. Then, with the notations of Theorem 2, m J ρ T hτ, ρ)v r,ρ ) i V l,ρ ) T i Π ρ J = Γρ) V i= l,ρ ) T + o P ) i G ρv r,ρ ) i where V r,ρ C k mρ and V l,ρ C k mρ are sets of right and left eigenvectors of G ρ associated with the eigenvalue zero, and G ρ is the derivative of G z along z taken for z = ρ. Proosition rovides an accurate characterization of the eigenvector D 2 n, which conveys clustering information based on the difference in covariance traces through t) mainly. As for Theorem 3, it states that, as, n grow large, the alignment between the isolated eigenvectors of L and the canonical class-basis j,..., j k tends to be deterministic in a theoretically tractable manner. In articular, the quantity n tr Dc )J T Π ρ J [0, m λ ] ] evaluates the alignment between Π ρ and J, thus roviding a first hint on the exected erformance of sectral clustering. A second interest of Theorem 3 is that, for eigenvectors û of L of multilicity one so Π ρ = ûû T ), the diagonal elements of n Dc 2 )J T Π ρ J Dc 2 ) rovide the squared mean values of the successive first j, then next j 2, etc., elements of û. The off-diagonal elements of n Dc 2 )J T Π ρ J Dc 2 ) then allow to decide on the signs of û T j i for each i. These ieces of information are again crucial to estimate the exected erformance of sectral clustering. However, the statements of Theorems 2 and 3 are difficult to interret from the onset. These become more exlicit when alied to simler scenarios and allow one to draw interesting conclusions. This is the target of the next section. III. SPECIAL CASES In this section, we aly Theorems 2 and 3 to the cases where: i) C i = βi for all i, with β > 0, ii) all µ i s are equal and C i = + γi )βi. Assume first that C i = βi for all i. Then, letting l be an isolated eigenvalue of βi + M Dc)M T, if l β > β c 0 2) then the matrix L has an eigenvalue asymtotically) equal to ) ρ = 2f l + β li j + f0) f + τf f c 0 l β f Besides, we find that n J T Π ρ J = l c 0β 2 lβ l) 2 ) Dc)M T Υ ρ Υ T ρ M Dc) + o P ) where Υ ρ R mρ are the eigenvectors of βi +M Dc)M T associated with eigenvalue l. Aside from the very simle result in itself, note that the choice of f is asymtotically) irrelevant here. Note also that M Dc)M T lays an imortant role as its eigenvectors rule the behavior of the eigenvectors of L used for clustering. Assume now instead that for each i, µ i = µ and C i = + γi )βi for some γ,..., γ k R fixed, and we shall denote γ = γ,..., γ k ) T. Then, if the searability condition 2) is met, we now find after calculus that there exists at most one isolated eigenvalue in L beside n) again equal to ρ = 2f l + β l ) + f0) f + τf f c 0 l β f ) but for l = β 2 5f 8f f 2f 2 + ) k i= c iγi 2. Moreover, β 2 β l) 2 n J T Π ρ J = c 0 2 + k i= c Dc)γγ T Dc) + o P ). iγi 2 If the searability condition is not met, then there is no isolated eigenvalue beside n. We note here the imortance of an aroriate choice of f. Note also that n Dc 2 )J T Π ρ J Dc 2 ) is roortional to Dc 2 )γγ T Dc 2 ) and thus the eigenvector aligns strongly to

Fig.. Samles from the MNIST database, without and with 0dB noise. Dc 2 )γ itself. Thus the entries of Dc 2 )γ should be quite distinct to achieve good clustering erformance. IV. SIMULATIONS We comlete this article by demonstrating that our results, that aly in theory only to Gaussian x i s, show a surrisingly similar behavior when alied to real datasets. Here we consider the clustering of n = 3 64 vectorized images of size = 784 from the MNIST training set database numbers 0,, and 2, as shown in Figure ). Means and covariance are emirically obtained from the full set of 60 000 MNIST images. The matrix L is constructed based on fx) = ex x/2). Figure 2 shows that the eigenvalues of both L and ˆL, both in the main bulk and outside, are quite close to one another recisely L ˆL / L 0.). As for the eigenvectors dislayed in decreasing eigenvalue order), they are in an almost erfect match, as shown in Figure 3. In the latter is also shown in thick blue) lines the theoretical aroximated signed) diagonal values of n Dc 2 )J T Π ρ J Dc 2 ), which also show an extremely accurate match between theory and ractice. Here, the k-means algorithm alied to the four dislayed eigenvectors has a correct clustering rate of 86%. Introducing a 0dB random additive noise to the same MNIST data brings the aroximation error down to L ˆL / L 0.04 and the k-means correct clustering robability to 78% with only two theoretically exloitable eigenvectors instead of reviously four)..5 0.5 2 0 2 matching eigenvalues Eigenvalues of L Eigenvalues of ˆL 0 0 0 20 30 40 50 Fig. 2. Eigenvalues of L and ˆL, MNIST data, = 784, n = 92. V. CONCLUDING REMARKS The random matrix analysis of kernel matrices constitutes a first ste towards a recise understanding of the underlying Fig. 3. Leading four eigenvectors of L red) versus ˆL black) and theoretical class-wise means blue); MNIST data. mechanism of kernel sectral clustering. Our first theoretical findings allow one to already have a artial understanding of the leading kernel matrix eigenvectors on which clustering is based. Notably, we recisely identified the asymtotic) linear combination of the class-basis canonical vectors around which the eigenvectors are centered. Currently on-going work aims at studying in addition the fluctuations of the eigenvectors around the identified means. With all these informations, it shall then be ossible to recisely evaluate the erformance of algorithms such as k-means on the studied datasets. This innovative aroach to sectral clustering analysis, we believe, will subsequently allow exerimenters to get a clearer icture of the differences between the various classical sectral clustering algorithms beyond the resent Ng Jordan Weiss algorithm), and shall eventually allow for the develoment of finer and better erforming techniques, in articular when dealing with high dimensional datasets. REFERENCES [] U. Von Luxburg, A tutorial on sectral clustering, Statistics and comuting, vol. 7, no. 4,. 395 46, 2007. [2] A. Y. Ng, M. Jordan, and Y. Weiss, On sectral clustering: Analysis and an algorithm, Proceedings of Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, vol. 4,. 849 856, 200. [3] N. El Karoui, The sectrum of kernel random matrices, The Annals of Statistics, vol. 38, no.,. 50, 200. [4] F. Benaych-Georges and R. R. Nadakuditi, The singular values and vectors of low rank erturbations of large rectangular random matrices, Journal of Multivariate Analysis, vol.,. 20 35, 202. [5] F. Chaon, R. Couillet, W. Hachem, and X. Mestre, The outliers among the singular values of large rectangular random matrices with additive fixed rank deformation, Markov Processes and Related Fields, vol. 20,. 83 228, 204.