The spectrum of kernel random matrices

Size: px

Start display at page:

Download "The spectrum of kernel random matrices"

Poppy Bradley
5 years ago
Views:

1 The sectrum of kernel random matrices Noureddine El Karoui Deartment of Statistics, University of California, Berkeley Abstract We lace ourselves in the setting of high-dimensional statistical inference, where the number of variables in a dataset of interest is of the same order of magnitude as the number of observations n We consider the sectrum of certain kernel random matrices, in articular n n matrices whose (i, j-th entry is f(x i X j/ or f( X i X j 2 /, where is the dimension of the data, and X i are indeendent data vectors Here f is assumed to be a locally smooth function The study is motivated by questions arising in statistics and comuter science, where these matrices are used to erform, among other things, non-linear versions of rincial comonent analysis Surrisingly, we show that in high-dimensions, and for the models we analyze, the roblem becomes essentially linear - which is at odds with heuristics sometimes used to justify the usage of these methods The analysis also highlights certain eculiarities of models widely studied in random matrix theory and raises some questions about their relevance as tools to model high-dimensional data encountered in ractice 1 Introduction Recent years has seen newfound theoretical interest in the roerties of large dimensional samle covariance matrices With the increase in the size and dimensionality of datasets to be analyzed, questions have been raised about the ractical relevance of information derived from classical asymtotic results concerning sectral roerties of samle covariance matrices To address these concerns, one line of analysis has been the consideration of asymtotics where both the samle size, n and the number of variables in the dataset go to infinity, jointly, while assuming for instance that /n had a limit This tye of questions concerning the sectral roerties of large dimensional matrices have been and are being addressed in variety of fields, from hysics to various areas of mathematics While the toic is classical, with the seminal contribution Wigner (1955 dating back from the 1950 s, there has been renewed and vigorous interest in the study of large dimensional random matrices in the last decade or so This has led to new insights and the aearance of new canonical distributions (Tracy and Widom (1994, new tools (see Voiculescu (2000 and, in Statistics, a sense that one needs to exert caution with familiar techniques of multivariate analysis when the dimension of the data gets large and the samle size is of the same order of magnitude as that dimension So far in Statistics, this line of work has been concerned mostly with the roerties of samle covariance matrices In a seminal aer, Marčenko and Pastur (1967 showed a result that, from a statistical standoint, may be interreted as saying, roughly, that asymtotically, the histogram the eigenvalues of a samle (ie random covariance matrix is (asymtotically a deterministic non-linear deformation of the histogram of the eigenvalues of the oulation covariance matrix Remarkably, they managed to characterize this deformation for fairly general oulation covariances Their result was shown in great generality, I would like to thank Bin Yu for stimulating my interest in the questions considered in this aer and for interesting discussions on the toic I would like to thank Elizabeth Purdom for discussions about kernel analysis and Peter Bickel for many stimulating discussions about random matrices and their relevance in statistics I would also like to thank an anonymous referee for useful and constructive comments that resulted in an imroved resentation of the aer Suort from NSF grant DMS is gratefully acknowledged AMS 2000 SC: Primary: 62H10 Secondary: 60F99 Key words and Phrases : covariance matrices, kernel matrices, eigenvalues of covariance matrices, multivariate statistical analysis, high-dimensional inference, random matrix theory, machine learning, Hadamard matrix functions, concentration of measure Contact : nkaroui@statberkeleyedu 1

2 and introduced new tools to the field, including one that has become ubiquitous, the Stieltjes transform of a distribution In its best known form, their result says that when the oulation covariance is identity, and hence all the oulation eigenvalues are equal to 1, in the limit the samle eigenvalues are slit and, if n, they are sread between [(1 /n 2, (1 + /n 2 ], according to a fully exlicit density, known now as the density of the Marčenko-Pastur law Their result was later re-discovered indeendently in Wachter (1978 (under slightly weaker conditions, and generalized to the case of non-diagonal covariance matrices in Silverstein (1995, under some articular distributional assumtions, which we discuss later in the aer On the other hand, recent develoments have been concerned with fine roerties of the largest eigenvalue of random matrices, which became amenable to analysis after mathematical breakthroughs which haened in the 1990 s (see Tracy and Widom (1994, Tracy and Widom (1996 and Tracy and Widom (1998 Classical statistical work on joint distribution of eigenvalues of samle covariance matrices (see Anderson (2003 for a good reference then became usable for analysis in high-dimensions In articular, in the case of gaussian distributions, with Id covariance, it was shown in Johnstone (2001 and El Karoui (2003 that the largest eigenvalue of the samle covariance matrix is Tracy-Widom distributed More recent rogress (El Karoui (2007c managed to carry out the analysis for essentially general oulation covariance On the other hand, models for which the oulation covariance has a few searated eigenvalues have also been of interest: see for instance Paul (2007 and Baik and Silverstein (2006 Beside the articulars of the different tye of fluctuations that can be encountered (Tracy-Widom, Gaussian or other, researchers have been able to recisely localize these largest eigenvalues One interesting asect of those results is the fact that in the high-dimensional setting of interest to us, the largest eigenvalues are always ositively biased, with the bias being sometime large (We also note that in the case of iid data - which naturally is less interesting in statistics - results on the localization of the largest eigenvalue have been available for quite some time now, after the works Geman (1980 and Yin et al (1988 to cite a few This is naturally in shar contrast to classical results of multivariate analysis, which show n-consistency of all samle eigenvalues - though the ossibility of bias is a simle consequence of Jensen s inequality On the other hand, there has been much less theoretical work on kernel random matrices By this term, we mean matrices with (i, j entry of the form M i,j = k(x i, X j, where M is an n n matrix, X i is a -dimensional data vector, and k is a function of two variables, often called a kernel, that may deend on n Common choices of kernels include, for instance, k(x i, X j = f( X i X j 2 /t, where f is a function and t is a scalar, or k(x i, X j = f(x i X j/t For the function f, common choices include f(x = ex( x, f(x = ex( x a, for a certain scalar a, f(x = (1 + x a, or f(x = tanh(a + bx, where b is a scalar We refer the reader to Rasmussen and Williams (2006, Chater 4, or Williams and Seeger (2000 for more examles In articular, we are not aware of any work in the setting of high-dimensional data analysis, where grows with n However, given the ractical success and flexibility of these methods (we refer to Schölkof and Smola (2002 for an introduction, it is natural to try to investigate theoretically their roerties Further, as illustrated in the data analytic art of Williams and Seeger (2000, an n/ boundedness assumtion is not unrealistic as far as alications of kernel methods are concerned One aim of the resent aer is to shed some theoretical light on the roerties of these kernel random matrices, and to do so in relatively wide generality We note that the choice of renormalization that we make below is motivated in art by the arguments of Williams and Seeger (2000 and their ractical choices of kernels for data of varying dimensions Existing theory on kernel random matrices (see for instance the interesting Koltchinskii and Giné (2000, for fixed dimensional inut data, redicts that the eigenvalues of kernel random matrices behave - at least for the largest ones - like the eigenvalues of the corresonding oerator on L 2 (dp, if the data is iid with robability distribution P To be more recise, if X i is a sequence of iid random variables with distribution P, under regularity conditions on the kernel k(x, y, it was shown in Koltchinskii and Giné (2000 that, for any index l, the l-th largest eigenvalue of the kernel matrix M, with entries M i,j = 1 n k(x i, X j, 2

3 converges to the l-th largest eigenvalue of the oerator K defined as Kf(x = k(x, yf(ydp (y These insights have also been derived through more heuristic but nonetheless enlightening arguments in, for instance, Williams and Seeger (2000 Further, more recise fluctuation results are also given in Koltchinskii and Giné (2000 We also note interesting work on Lalacian eigenmas (see eg Belkin and Niyogi (2008 where, among other things, results have been obtained showing convergence of eigenvalues and eigenvectors of certain Lalacian random matrices (which are quite closely connected to kernel random matrices comuted from data samled from a manifold, to corresonding quantities for the Lalace- Beltrami oerator on the manifold These results are in turn used in the literature to exlain the behavior of non-linear versions of standard rocedures of multivariate statistics, such as Princial Comonent Analysis (PCA, Canonical Correlation Analysis (CCA or Indeendent Comonent Analysis (CCA We refer the reader to Schölkof et al (2004 for an introduction to kernel-pca, and to Bach and Jordan (2003 for an introduction to kernel-cca and kernel-ica At the heart of these techniques are the sectral roerties of kernel random matrices Because these techniques are used in bioinformatics, a field where large datasets are common and becoming the norm, it is natural to ask what can be said about these sectral roerties for high-dimensional data We show that for the models we analyze (ICA-tye models and generalizations that go beyond the linear setting of ICA, kernel random matrices essentially behave like samle covariance matrices and hence their eigenvalues suffer from the same bias roblems that affect samle covariance matrices in high-dimensions In articular, if one were to try to aly the heuristics of Williams and Seeger (2000, which were develoed for low-dimensional roblems, to the high-dimensional case, the redictions would be quite wildly wrong (A simle examle is rovided by the Gaussian kernel with iid Gaussian data, where the comutations can be done comletely exlicitly, as exlained in Williams and Seeger (2000 We also note that the scaling we use is different from the one used in low dimensions, where the matrices are scaled by 1/n This is because the high-dimensional roblem would be comletely degenerate if we used this normalization in our setting However, our results still give information about the roblem when it is scaled by 1/n From a random matrix oint of view, our study is connected to the study of Euclidean random matrices and distance matrices, which is of some interest in, for instance, Physics We refer to Bogomolny et al (2003 and Bordenave (2006 for work in this direction in the low (or fixed dimensional setting We also note that at the level of generality we lace ourselves in, the random matrices we study do not seem to be amenable to study through the classical tools of random matrix theory Hence, beside their obvious statistical interest, they are also interesting on urely mathematical grounds We now turn to the gist of our aer, which will show that high-dimensional kernel random matrices behave sectrally essentially like matrices closely connected to samle covariance matrices We will get two tyes of results: in Theorems 1 and 2, we get a strong aroximation result (in oerator norm for standard models (ICA-like studied in random matrix theory In Theorems 3 and 4, we characterize the limiting sectral distribution of our kernel random matrices, for a wider class of data distributions In Section 2, we also state clearly the consequences of our theorems and review the relevant theory of high-dimensional samle covariance matrices From a technical standoint, we adot a oint of view centered on the concentration of measure henomenon, as exosed for instance in Ledoux (2001, as it rovides a unified way to treat the two tyes of results we are interested in Finally, we discuss in our (self-contained conclusion (Section 3 the consequences of our results, and in articular some ossible limitations of standard random matrix models as tools to model data encountered in ractice, focusing on geometric roerties of datasets drawn according to those models As exlained in more details there, vectors drawn according to these standard random matrix models essentially live close to sheres and are almost orthogonal to one another, a roerty that may or may not be resent in datasets to be analyzed and can be seen as a key to many classical and less classical random matrix results (see also El Karoui (2007a 3

4 2 Sectrum of kernel random matrices Kernel random matrices do not seem to be amenable to analysis through the usual tools of random matrix theory In articular, for general f, it seems difficult to carry out either a method of moments roof, or a Stieltjes transform roof, or a roof that relies on knowing the density of the eigenvalues of the matrix Hence, we take an indirect aroach Our strategy is to find aroximations of the kernel random matrix that have two roerties First, the aroximation matrix is analyzable or has already been analyzed in random matrix theory Second, the quality of the aroximation is good enough that sectral roerties of the aroximating matrix can be shown to carry over to the kernel matrix The strategy in the first two theorems is to derive an oerator norm consistent aroximation of our kernel matrix In other words, if we call M our kernel matrix, we will find K such that M K 2 0, as n and tend to Note that both M and K are real symmetric (and hence Hermitian here We exlain after the statement of Theorem 1 why oerator norm consistency is a desirable roerty But let us say that in a nutshell, it imlies consistency for each individual eigenvalue as well as eigensaces corresonding to searated eigenvalues For the second set of theorems (Theorems 3 and 4, we will relax the distributional assumtions made on the data, but at the exense of the recision of the results we will obtain: we will characterize the limiting sectral distribution of our kernel random matrices Our theorems below show that kernel random matrices can be well aroximated by matrices that are closely connected to large-dimensional covariance matrices The sectral roerties of those matrices have been the subject of a significant amount of work in recent and less recent years, and hence this knowledge, or at least art of it, can be transferred to kernel random matrices In articular, we refer the reader to Marčenko and Pastur (1967, Wachter (1978, Geman (1980, Yin et al (1988, Silverstein (1995, Bai and Silverstein (1998, Johnstone (2001, Baik and Silverstein (2006, Paul (2007, El Karoui (2007c, Bai et al (2007 and El Karoui (2007a for some of the most statistically relevant results in this area We review some of them now 21 Some results on large dimensional samle covariance matrices Since our main theorems are aroximating theorems, we first wish to state some of the roerties of the objects we will use to aroximate kernel random matrices In what follows, we consider an n data matrix, with, say /n having a finite non-zero limit Most of the results that have been obtained are of two tyes: either they are so-called bulk results and concern essentially the sectral distribution (or loosely seaking the histogram of eigenvalues of the random matrices of interest Or they concern the localization and fluctuation behavior of extreme eigenvalues of these random matrices 211 Sectral distribution results An object of interest in random matrix theory is the sectral distribution of random matrices Let us call l i the decreasingly ordered eigenvalues of our random matrix, and let us assume we are working with an n n matrix, M n The emirical sectral distribution of a n n matrix is the robability measure which uts mass 1/n at each of its eigenvalues In other words, if we call F n this robability measure, we have df n (x = 1 n δ li (x n Note that the histogram of eigenvalues reresent an integrated version of this measure For random matrices, this measure F n is naturally a random measure A key result in the area of covariance matrices is that if we observe iid data vectors X i, with X i = Σ 1/2 Y i, where Σ is a ositive semi-definite matrix and Y i is a vector with iid entries, under weak moment conditions on Y i and assuming that the sectral distribution of Σ has a limit (in the sense of weak convergence of distributions, F n converges to a non-random measure, which we call F i=1 4

5 We call the models X i = Σ 1/2 Y i the standard models of random matrix theory because most results have been derived under these assumtions In articular, various results (Geman (1980, Bai and Silverstein (1998, Bai and Silverstein (1999 show, among many other things, that when the entries of the vector Y have 4 (absolute moments, the largest eigenvalues of the samle covariance matrix X X/n, where X i now occuies the i-th row of the n matrix X, stay close to the endoint of the suort of F A natural question is therefore to try to characterize F Excet in articular situations, it is difficult to do so exlicitly However, it is ossible to characterize a certain transformation of F The tool of choice in this context is the Stieltjes transform of a distribution It is a function defined on C + by the formula, if we call St F the Stieltjes transform of F, St F (z = df (λ λ z, Im [z] > 0 In articular for emirical sectral distributions, we see that, if F n is the sectral distribution of the matrix M n, St Fn (z = 1 n 1 n l i z = 1 n trace ( (M n zid 1 i=1 The imortance of the Stieltjes transform in the context of random matrix theory stems from two facts: on the one hand, it is connected fairly exlicitly to the matrices that are being analyzed On the other hand, ointwise convergence of Stieltjes transform imlies weak convergence of distributions, if a certain mass reservation condition is satisfied This is how a number of bulk results are therefore roved For a clear and self-contained introduction to the connection between Stieltjes transforms and weak convergence of robability measures, we refer the reader to Geronimo and Hill (2003 The result of Marčenko and Pastur (1967, later generalized by Silverstein (1995 for standard random matrix models with non-diagonal covariance, and more recently by eg El Karoui (2007a away from those standard models, is a functional characterization of the limit F If one calls w n (z the Stieltjes transform of the emirical sectral distribution of XX /n, w n (z converges ointwise (and almost surely after Silverstein (1995 to a non-random w(z, which, as a function, is a Stieltjes transform Moreover, w, the Stieltjes transform of F, satisfies the equation, if /n ρ, ρ > 0: 1 λdh(λ w(z = z ρ 1 + λw, where H is the limiting sectral distribution of Σ, assuming that such a distribution exists We note that Silverstein (1995 roved the result under a second moment condition on the entries of Y i From this result, Marčenko and Pastur (1967 derived that in the case where Σ = Id, and hence dh = δ 1, the emirical sectral distribution has a limit whose density is, if ρ 1, f ρ (x = 1 2πρ (b x(x a where a = (1 ρ 1/2 2 and b = (1 + ρ 1/2 2 The difference between the oulation sectral distribution (a oint mass at 1, of mass 1 and the limit of the emirical sectral distribution is quite striking 212 Largest eigenvalues results Another line of work has been focused on the behavior of extreme eigenvalues of samle covariance matrices In articular, Geman (1980 showed, under some moment conditions, that when Σ = Id, l 1 (X X/n (1 + /n 2 almost surely In other words, the largest eigenvalue stays close to the endoint of the limiting sectral distribution of X X/n This result was later generalized in Yin et al (1988, and shown to be true under the assumtion of finite 4th moment only, for data with mean 0 In recent years, fluctuation results have been obtained for this largest eigenvalue, which is of ractical interest in Princial Comonents Analysis (PCA Under Gaussian assumtions, Johnstone (2001 and El Karoui (2003 (see also Forrester (1993 and Johansson (2000 showed that the fluctuations of the largest eigenvalue are Tracy- Widom distributed For the general covariance case, similar results, as well as localization information were x 5

6 recently obtained in El Karoui (2007c We note that the localization information (ie a formula that was discovered in this latter aer was shown to hold for a wide variety of standard random matrix models, through aeal to Bai and Silverstein (1998 We refer the interested reader to Fact 2 in El Karoui (2007c for more information Interesting work has also been done on so-called siked models, where a few oulation eigenvalues are searated from the bulk of them In articular, in the case where all oulation eigenvalues are equal, excet for one that is significantly larger (see Baik et al (2005 for the discovery of an interesting hase transition, Paul (2007 showed, in the Gaussian case, inconsistency of the largest samle eigenvalue, as well as the fact that the angle between the oulation and samle rincial eigenvectors is bounded away from 0 Paul (2007 also obtained fluctuation information about the largest emirical eigenvalue Finally, we note that the same inconsistency of eigenvalue result was also obtained in Baik and Silverstein (2006, beyond the Gaussian case 213 Notations Let us now define some notations and add some clarifications We denote by A the transose of A The matrices we will be working with all have real entries We remind the reader that if A and B are two rectangular matrices, AB and BA have the same eigenvalues, excet for ossibly, a certain number of zeros We will make reeated use of this fact, eg for matrices like X X and XX In the case where A and B are both square, AB and BA have exactly the same eigenvalues We will also need various norms on matrices We will use the so-called oerator norm, which we denote by A 2, which corresonds to the largest singular value of A, ie max i li (A A We occasionally denote the largest singular value of A by σ 1 (A Clearly, for ositive semi-definite matrices, the largest singular value is equal to the largest eigenvalue Finally, we will sometime need to use the Frobenius (or Hilbert- Schmidt norm of a matrix A We denote it by A F By definition, it is simly, because we are working with matrices with real entries, A 2 F = A 2 i,j i,j Further, we use to denote the Hadamard (ie entrywise roduct of two matrices We denote by µ m the m-th moment of a random variable Note that by a slight abuse of notation, we might also use the same notation to refer to the m-th absolute moment (ie E X m of a random variable, but if there is any ambiguity, we will naturally make recise which definition we are using Finally, in the discussion of standard random matrix models that follows, there will be arrays of random variables and as convergence We work with random variables defined on a common robability sace To each ω corresonds an infinite dimensional array of numbers Unless otherwise noted, the n matrices we will use in what follows are the uer-left corner of this array We now turn to the study of kernel random matrices We will show that we can aroximate them by matrices that are closely connected to samle covariance matrices in high-dimensions and, therefore, that a number of the results we just reviewed also aly to them 22 Inner-roduct kernel matrices: f(x i X j/ Theorem 1 (Sectrum of inner roduct kernel random matrices Let us assume that we observe n iid random vectors, X i in R Let us consider the kernel matrix M with entries ( X M i,j = f i X j We assume that a n, ie n/ and /n remain bounded as b Σ is a ositive semi-definite matrix, and Σ 2 = σ 1 (Σ remains bounded in, ie there exists K > 0, such that σ 1 (Σ K, for all c trace (Σ / has a finite limit, ie there exists l R such that lim trace (Σ / = l 6

7 d X i = Σ 1/2 Y i e The entries of Y i, a -dimensional random vector, are iid Also, denoting by Y i (k the k-th entry of Y i, we assume that E (Y i (k = 0, var (Y i (k = 1 and E ( Y i (k 4+ɛ < for some ɛ > 0 (We say that Y i has 4 + ɛ absolute moments f f is a C 1 function in a neighborhood of l = lim trace (Σ / and a C 3 function in a neighborhood of 0 Under these assumtions, the kernel matrix M can (in robability be aroximated consistently in oerator norm, when and n tend to, by the matrix K, where ( K = f(0 + f (0 trace ( Σ f (0 XX + υ Id n, where ( trace (Σ υ = f f(0 f (0 trace (Σ In other words, M K 2 0, in robability, when The advantages of obtaining an oerator norm consistent estimator are many We list some here: Asymtotically, M and K have the same j-largest eigenvalue, for any j: this is simly because for symmetric matrices, if l j is the j-th largest eigenvalue of a matrix, Weyl s inequality (see eg Corollary III26 in Bhatia (1997 imlies that l j (M l j (K M K 2 Hence our result imlies that l j (M l j (K 0 in robability as and n go to infinity The limiting sectral distributions of M and K (if they exist are the same This is a consequence of Lemma 1, 21 below So in articular, when K has a limiting sectral distribution (in the sense of weak convergence of robability measures, the emirical sectral distribution of M converges to that distribution (in the sense of weak convergence of distributions in robability We have subsace consistency for eigensaces corresonding to searated eigenvalues (For a roof, we refer to El Karoui (2007b, Corollary 3 So, when K has eigenvalues that stay searated from the bulk of this matrix s eigenvalues, then M has in robability the same roerty, and the angle between the corresonding eigensaces for K and M go to 0 in robability (Note that the statements we just made assume that both M and K are symmetric, which is the case here The strategy for the roof is the following According to the results of Lemma A-3, the matrix X i X j/ has small entries off the diagonal, whereas on the diagonal, the entries are essentially constant and equal to trace (Σ / Hence, it is natural to try to use the δ-method (ie do a Taylor exansion entry by entry By contrast to standard roblems in Statistics, the fact that we have to erform n 2 of those Taylor exansions means that the second order term is not negligible a riori The roof shows that this aroach can be carried out rigorously, and that, erhas surrisingly, the second order term is not too comlicated to aroximate in oerator norm It is also shown that the third order term lays essentially no role Before we start the roof, we want to mention that we will dro the index in Σ below to avoid cumbersome notations Let us also note, more technically, that an imortant ste of the roof is to show that, when the Y i s have enough moments, they can be treated without much error in sectral results has bounded random variables - the bound deending on the number of moments, n and This then enables us to use concentration results for convex Lischitz functions of indeendent bounded random variables at various imortant oints of the roof and also in Lemma A-3, whose results underly much of the aroach taken here 7

8 Proof First, let us call τ trace (Σ Using Taylor exansions, we can rewrite our kernel matrix as: f(x ix j / = f(0 + f (0X ix j / + f (0 (X 2 ix j / 2 + f (3 (ξ i,j (X 6 ix j / 3, if i j ( f( X i 2 2/ = f(τ + f Xi 2 2 (ξ i,i τ on the diagonal The roof can be searated in different stes We will break the kernel matrix into a diagonal term and an off diagonal term The results of Lemma A-3, after they are shown, will allow us to take care of the diagonal matrix at relatively lost cost So we ostone that art of the analysis to the end of the roof and we first focus on the off-diagonal matrix In what follows, we call second order term the matrix A with entries A i,j = f (0 (X 2 ix j / 2 1 i j We call third order term the matrix B with entries B i,j = f (3 (ξ i,j (X 6 ix j / 3 1 i j The off-diagonal matrix is the sum A + B A Study of the off-diagonal matrix Truncation and centralization Following the arguments of Lemma 22 in Yin et al (1988, we see that because we have assumed that we have 4 + ɛ absolute moments, and n, the array Y = Y 1 i n,1 j is almost surely equal to the array Ỹ of same dimensions, with Ỹ i,j = Y i,j 1 Yi,j B, where B = 1/2 δ, and δ > 0 We will therefore carry out the analysis on this Ỹ array Note that most of the results we will rely on require vectors of iid entries with mean 0 Of course, Ỹ i,j has in general a mean different from 0 In other words, if we call µ = E (Ỹi,j, we need to show that we do not lose anything in oerator norm by relacing Ỹi s by U i s with U i = Ỹi µ1 Note that, as seen in Lemma A-3, by lugging in t = 1/2 δ in the notation of this lemma, which corresonds to the 4 + ɛ moment assumtion here, we have µ 3/2 δ Now let us call S the matrix XX /, excet that its diagonal is relaced by zeros From Yin et al (1988, and the fact that n/ stays bounded, we know that XX / 2 σ 1 (Σ Y Y 2 / stays bounded Using Lemma A-3, we see that the diagonal of XX / stays bounded as in oerator norm Therefore, S 2 is bounded as Now, as in the roof of Lemma A-3, we have ( 1 ΣU j S i,j = U i ΣU j + µ + 1 ΣU i + µ 2 1 Σ1 U i ΣU j + R i,j as Note that this equality is true as only because it involves relacing Y by Ỹ The roof of Lemma A-3 shows that R i,j µ 2σ 1/2 1 (Σ(σ 1/2 1 (Σ + δ/2 + µ 2 σ 1 (Σ as We conclude that, for some constant C, R 2 F Cn 2 µ 2 Cn 2 3 2δ as 8

9 Therefore R 2 0 as In other words, if we call S U the matrix with i, j entry U i ΣU j/ off the diagonal and 0 on the diagonal, S S U 2 0 as Now it is a standard result on Hadamard roducts (see for instance, Bhatia (1997, Problem I613, or Horn and Johnson (1994, Theorems 551 and 5515 that for two matrices A and B, A B 2 A 2 B 2 Since the Hadamard roduct is commutative, we have We conclude that S S S U S U = (S + S U (S S U S S S U S U 2 S S U 2 ( S 2 + S U 2 0 as, since S S U 2 0 as, and S 2 and hence S U 2 stay bounded, as The conclusion of this study is that to aroximate the second order term in oerator norm, it is enough to work with S U and not S, and hence, very imortantly, with bounded random variables with zero mean Further, the roof of Lemma A-3 makes clear that σu 2, the variance of the U i,j s, goes to 1, the variance of the Y i,j s, very fast So if we can aroximate the matrix with (i, j-entry U i ΣU j/(σu 2 consistently in oerator norm by a matrix whose oerator norm is bounded, this same matrix will constitute an oerator norm aroximation of U i ΣU j/ In other words, we can assume that, when working with matrices of dimension n, the random variables we will be working with have variance 1 without loss of generality, and that they have mean 0 and are bounded by B, B deending on and going to infinity Control of the second order term We now focus on aroximating in oerator norm the matrix with (i, j-th entry f (0 (X 2 ix j / 2 1 i j As we just exlained, we assume from now on in all the work concerning the second order term that the vectors Y i have mean 0, and that their entries have variance 1 and are bounded by B = 1/2 δ This is because we just saw that relacing Y i by U i /σ U would not change ( as and asymtotically the oerator norm of the matrix to be studied We note that to make clear that the truncation deends on, we might have wanted to use the notation Y ( i, but since there will be no ambiguity in the roof, we chose to use the less cumbersome notation Y i The control of the second order term turns out to be the most delicate art of the analysis, and the only lace where we need the assumtion that X i = Σ 1/2 Y i Let us call W the matrix with entries Note that, when i j, W i,j = { (X i X j 2, if i j 2 0, if i = j E (W i,j = E ( trace ( X ix j X jx i / 2 = E ( trace ( X j X jx i X i / 2 = trace ( Σ 2 / 2 Because we assume that trace (Σ / has a finite limit, and n/ stays bounded away from 0, we see that the matrix E (W has a largest eigenvalue that, in general, does not go to 0 Note also that under our assumtions, E (W i,j = O(1/ Our aim is to show that W can be aroximated in oerator norm by this constant matrix So let us consider the matrix W with entries W i,j = { (X i X j 2 trace ( Σ 2 / 2, if i j 2 0, if i = j Simle comutations show that the exected Frobenius norm squared of this matrix does ( not go to 0 Hence more subtle arguments are needed to control its oerator norm We will show that E trace ( W 4 ( goes to zero, which imlies that E W 4 2 goes to zero, because W is real symmetric The elements contributing to trace ( W 4 are generally of the form W i,j Wj,k Wk,l Wl,i We are going to study these terms according to how many indices are equal to each other 9

10 i Terms involving 4 different indices: i j k l We first focus on the case where all these indices (i, j, k, l are different Recall that X i = Σ 1/2 Y i, where Y i has iid entries We want to comute E ( Wi,j Wj,k Wk,l Wl,i, so it is natural to focus first on Now, note that E ( Wi,j Wj,k Wk,l Wl,i Y i, Y k W i,j = 1 2 { Y i ΣY j Y j ΣY i trace ( Σ 2} = 1 2 { Y i Σ(Y j Y j IdΣY i + trace ( Σ 2 (Y i Y i Id } Hence, calling we have M j Y j Y j Id, 4 Wi,j Wj,k = (Y i ΣM j ΣY i Y k ΣM jσy k + (Y i ΣM j ΣY i trace ( Σ 2 M k + (Y k ΣM jσy k trace ( Σ 2 M i + trace ( Σ 2 M i trace ( Σ 2 M k Now, of course, we have E (M j = E (M j Y i, Y k = 0 Hence, 4 E ( Wi,j Wj,k Y i, Y k = (Y i ΣE ( M j ΣY i Y k ΣM j Y i, Y k ΣYk + trace ( Σ 2 ( M i trace Σ 2 M k ( If M is a deterministic matrix, we have, since E Y j Y j = Id, E (M j MM j = E ( Y j Y j MY j Y j M If we now use Lemma A-1, and in articular Equation (A-1, age 28, we finally have, recalling that here σ 2 = 1, E (M j MM j = (M + M + (µ 4 3diag(M + trace (M Id M = M + (µ 4 3diag(M + trace (M Id In the case of interest here, we have M = ΣY i Y k Σ, and the exectation is to be understood conditionally on Y i, Y k, but because we have assumed that the indices are different, and the Y m s are indeendent, we can do the comutation of the conditional exectation as if M were deterministic Therefore, we have (Y i ΣE ( M j ΣY i Y k ΣM j Y i, Y k ΣYk = Y i Σ [ ΣY k Y i Σ + (µ 4 3diag(ΣY i Y k Σ + (Y k Σ2 Y i Id ] ΣY k = [ (Y i Σ 2 Y k 2 + (µ 4 3Y i Σdiag(ΣY i Y k ΣΣY k + (Y i Σ 2 Y k 2] Naturally, we have E ( Wi,j Wj,k Y i, Y k = E ( Wk,l Wl,i Y i, Y k, and therefore, by using roerties of conditional exectation, since all the indices are different, ( [2(Y 8 E ( Wi,j Wj,k Wk,l Wl,i = E i Σ 2 Y k 2 + (µ 4 3Y i Σdiag(ΣY i Y k ΣΣY k + trace ( Σ 2 ( M i trace Σ 2 ] 2 M k By convexity, we have (a + b + c 2 3(a 2 + b 2 + c 2, so to control the above exression, we just need to control the square of each of the terms aearing in it In other words, we need to understand the terms T 1 = E ( (Y i Σ 2 Y k 4, ( [Y T 2 = E i Σdiag(ΣY i Y k ΣΣY ] 2 k, and ( [trace ( T 3 = E Σ 2 ( M i trace Σ 2 ] 2 M k 10

11 Study of T 1 Let us start by the term T 1 = E ( (Y i Σ2 Y k 4 A simle re-writing shows that (Y i Σ 2 Y k 4 = Y i Σ 2 Y k Y k Σ2 Y i Y i Σ 2 Y k Y k Σ2 Y i Using Equation (A-1 in Lemma A-1, we therefore have, using the fact that Σ 2 Y i Y i Σ2 is symmetric, E ( (Y i Σ 2 Y k 4 Y i = Y i Σ 2 [ 2Σ 2 Y i Y i Σ 2 + (µ 4 3diag(Σ 2 Y i Y i Σ 2 + trace ( Σ 2 Y i Y i Σ 2 Id ] Σ 2 Y i = 3(Y i Σ 4 Y i 2 + (µ 4 3Y i Σ 2 diag(σ 2 Y i Y i Σ 2 Σ 2 Y i Finally, we have, using Equation (A-2 in Lemma A-1, Now, we have E ( (Y i Σ 2 Y k 4 = 3 [ 2trace ( Σ 4 + (trace ( Σ (µ 4 3trace ( Σ 4 Σ 4] + (µ 4 3E ( Y i Σ 2 diag(σ 2 Y i Y i Σ 2 Σ 2 Y i Y i Σ 2 diag(σ 2 Y i Y i Σ 2 Σ 2 Y i = trace ( Σ 2 Y i Y i Σ 2 diag(σ 2 Y i Y i Σ 2 = trace ( Σ 2 Y i Y i Σ 2 Σ 2 Y i Y i Σ 2 Calling v i = Σ 2 Y i, we note that the matrix whose trace is taken is (v i v i (v iv i = (v i v i (v i v i (see Horn and Johnson (1990, 458 or Horn and Johnson (1994, 307 Hence, Y i Σ 2 diag(σ 2 Y i Y i Σ 2 Σ 2 Y i = v i v i 2 2 Now let us call m k the k-th column of the matrix Σ 2 Using the fact that Σ 2 is symmetric, we see that the k-th entry of the vector v i is v i (k = m k Y i So v i (k 4 = Y i m km k Y iy i m km k Y i Calling M k = m k m k, we see using Equation (A-2 in Lemma A-1 that E ( v i (k 4 = 2trace ( M 2 k + [trace (Mk ] 2 + (µ 4 3trace (M k M k Using the definition of M k, we finally get that E ( v i (k 4 = 3 m k (µ 4 3 m k m k 2 2 Now, note that if C is a generic matrix and C k is its k th column, denoting by e k the k-th vector of the canonical basis, we have C k = Ce k and hence C k 2 2 = e k C Ce k σ1 2(C, where σ 1(C is the largest singular value of C So in articular, if we call λ 1 (D the largest eigenvalue of a ositive semi-definite matrix D, we have m k 4 2 λ 1(Σ 4 m k 2 2 After recalling the definition of m k, and using the fact that k m k m k 2 2 = Σ2 Σ 2 2 F, we deduce that E ( v i v i 2 2 = 3 m k (µ 4 3 m k m k 2 2 k k Therefore, we can conclude that 3λ 1 (Σ 4 trace ( Σ 4 + (µ 4 3trace ( [Σ 2 Σ 2] 2 E ( (Y i Σ 2 Y k 4 3λ 1 (Σ 4 trace ( Σ 4 + (µ 4 3trace ( [Σ 2 Σ 2] 2 Now recall that, according to Theorem 5519 in Horn and Johnson (1994, if C and D are ositive semidefinite matrices, λ(c D w d(c λ(d, where λ(d is the vector of decreasingly ordered eigenvalues of D, and d(c denotes the vector of decreasingly ordered diagonal entries of C (because all the matrices are ositive semidefinite, their eigenvalues are their singular values Here w denotes weak (submajorization In our case, of course, C = D = Σ 2 Using the results of Examle II35 (iii in Bhatia (1997, with the function φ(x = x 2, we see that Finally, we have trace ( (Σ 2 Σ 2 2 = λ 2 i (Σ 2 Σ 2 d 2 i (Σ 2 λ 2 i (Σ 2 λ 1 (Σ 4 trace ( Σ 4 This bounds the first term, T 1, in our uer bound T 1 = E ( (Y i Σ 2 Y k 4 (3 + µ 4 3 λ 1 (Σ 4 trace ( Σ 4 (1 11

12 ( [trace ( Study of T 3 Let us now turn to the third term, T 3 = E Σ 2 ( M i trace Σ 2 ] 2 M k We remind the ( [trace reader that M i = Y i Y i Id By indeendence of Y ( i and Y k, it is enough to understand E Σ 2 ] 2 M i Note that ( [trace ( E Σ 2 ] 2 M i ( [Y = E i Σ 2 Y i trace ( Σ 2] 2 = E ( Y i Σ 2 Y i Y i Σ 2 ( Y i trace Σ 2 2 Using Equation (A-2 in Lemma A-1, we conclude that ( [trace ( E Σ 2 ] 2 M i = 2trace ( Σ 4 + (µ 4 3trace ( Σ 2 Σ 2 Using the fact that we know the diagonal of Σ 2 Σ 2, we conclude that, ( [trace ( T 3 = E Σ 2 ] 2 [ ( M i trace Σ 2 ] 2 M k { 2trace ( Σ 4 + µ 4 3 λ 1 (Σ 2 trace ( Σ 2} 2 (2 So we have an uer bound on T 3 ( Study of T 2 Finally, let us turn to the middle term, T 2 = E [Y i Σdiag(ΣY iy k ΣΣY k] 2 Before we square it, the argument of the exectation has the form Y i Σdiag(ΣY ky i ΣΣY k Call u k = ΣY k Making the same comutations as above, we find that Y i Σdiag(ΣY k Y i ΣΣY k = trace ( diag(σy k Y i ΣY k Y i Σ = trace ( (ΣY k Y i Σ (ΣY k Y i Σ = trace ( (u k u i (u k u i = trace ( (u k u k (u i u i = (u i u i (u k u k We deduce, using indeendence and elementary roerties of inner roducts that ( [Y E i Σdiag(ΣY k Y i ] 2 ΣΣY k E ( u i u i 2 ( 2 E uk u k 2 2 Note that to arrive at Equation (1, we studied exressions similar to E ( u i u i 2 2 So we can similarly conclude that ( [Y T 2 = E i Σdiag(ΣY k Y i ] 2 ΣΣY k { (3 + µ 4 3 λ 1 (Σ 2 trace ( Σ 2} 2 (3 With our assumtions, the terms (1, (2 and (3 are O( 2 Note that in the comutation of the trace, there are O(n 4 such terms Finally, note that the exectation of interest to us corresonds to the sum of the three quadratic terms divided by 8 So the total contribution of these terms is in exectation O( 2 This takes care of the contribution of the terms involving four different indices, as it shows that 0 E W i,j Wj,k Wk,l Wl,i = O( 2 i j k l ii Terms involving three different indices: i j k Note that because W i,i = 0, terms involving 3 different indices with a non-zero contribution are necessarily of the form ( W i,j 2 ( W i,k 2, since terms with a cycle of length 3 all involve a term of the form W i,i and hence contribute 0 Let us now focus on those terms, assuming that j k Note that we have O(n 3 such terms, and that it is enough to focus on the Wi,j 2 W i,k 2, since the contribution of the other terms is, in exectation, of order 1/4 (with our assumtions trace ( Σ 2 / 2 = O(1/, and because we have only n 3 terms in the sum, this extra contribution is ( [ ( 2, asymtotically zero Now, we clearly have E Wi,j 2 W i,k 2 Y i = E Wi,j i] 2 Y by conditional indeendence ( of the two terms The comutation of E Wi,j 2 Y i is similar to the ones we have made above, and we have 4 E ( W 2 i,j Y i = 2(Y i Σ 2 Y i 2 + (µ 4 3Y i Σdiag(ΣY i Y i ΣΣY i + (trace ( ΣY i Y i Σ 2 12

13 Using the fact that K i = ΣY i Y i Σ is ositive semidefinite, and hence its diagonal entries are non-negative, we have trace (K i K i (trace (K i 2, we conclude that Hence, 4 E ( W 2 i,j Y i (3 + κ4 3 (Y i Σ 2 Y i 2 (3 + κ 4 3 σ 1 (Σ 4 Y i 4 2 E ( Wi,jW 2 i,k (3 + κ σ 1 (Σ 8 Y i 8 2 Now, the alication F which takes a vector and returns its Euclidean norm is trivially a convex 1- Lischitz function, with resect to Euclidean norm Because the entries of Y i are bounded by B, we see that, according to Corollary 410 in Ledoux (2001, F (Y i = Y i 2 satisfies a concentration inequality, namely, for r > 0, P ( Y i 2 m F > r 4 ex( r 2 /16B 2, where m F is a median of F (Y i = Y i 2 (hence m F is a deterministic quantity A simle integration (see for instance the roof of Proosition 19 in Ledoux (2001, and change the ower from 2 to 8 then shows that E ( Y i 2 m F 8 = O(B 8 Now, we know, according to Proosition 19 in Ledoux (2001, that if µ F is the mean of F (Y i, ie µ F = E ( Y i 2, µ F exists and m F µ F = O(B Since µ 2 F µ F 2 = E ( Y i 2 2 =, we conclude that, if C denotes a generic constant that may change from dislay to dislay, E ( Y i 8 2 E ( Yi 2 m F + m F (E ( Y i 2 m F 8 + m 8 F C(E ( Y i 2 m F 8 + m F µ F 8 + µ 8 F C(B Now, our original assumtion about the number of absolute moments of the random variables of interest imly that B = O( 1/2 δ Consequently, E ( Y i 8 2 = O( 4 Therefore, and Hence, we also have i i E ( W 2 i,jw 2 i,k = O( 4 j i,k i,j k j i,k i,j k E ( W 2 i,jw 2 i,k = O( E ( W i,j W i,k = O( 1 iii ( Terms involving two different indices: i j The last terms we have to focus on to control E trace ( W 4 are of the form W i,j 4 Note that we have n2 terms like this Since by convexity, (a + b 4 8(a 4 +b 4, we see that it is enough to understand the contribution of Wi,j 4 to show that i,j ( W E 4 tends to zero Now, let us call for a moment v = ΣY i and u = Y j The quantity of interest to us is basically of the form E ( (u v 8 Let us do comutations conditional on v We note that since the entries of u are indeendent and have mean 0, in the exansion of (u v 8, the only terms that will contribute a non-zero quantity to the exectation have entries of u raised to a ower greater than 2 We can decomose the sum reresenting E ( (u v 8 v into subterms, according to what owers of the terms are involved There are 6 terms: (2,2,2,2 (ie all terms are raised to the ower 2, (3,3,2 (ie two terms are raised to the ower 3, and one to the ower 2, (4,2,2, (4,4, (5,3, (6,2 and (8 For instance the subterm corresonding to (2,2,2,2 is, before taking exectations, i 1 i 2 i 3 i 4 u 2 i 1 u 2 i 2 u 2 i 3 u 2 i 4 (v i1 v i2 v i3 v i4 2 i,j 13

14 After taking exectations conditional on v, we see that it is obviously non-negative and contributes (σ 2 4 (v i1 v i2 v i3 v i4 2 ( v 2 i 4 = (Y i Σ 2 Y i 4 σ 1 (Σ 8 Y i 8 2 i 1 i 2 i 3 i 4 Note that we just saw that E ( Y i 8 2 = O( 4 in our context Similarly, the term (3, 3, 2 will contribute In absolute value, this term is less than µ 2 3σ 2 i 1 i 2 i 3 v 3 i 1 v 3 i 2 v 2 i 3 µ 2 3σ 2 ( v i 3 2 ( v 2 i Now, note that if z is such that z 2 = 1, we have, for 2, z i zi 2 = 1 Alied to z = v/ v 2, we conclude that v i v 2 Consequently, the term (3,3,2 contributes in absolute value less than µ 2 3σ 2 v 8 2 The same analysis can be reeated for all the other terms, which are all found to be less than, v 8 2 times the moments of u involved Because we have assumed that our original random variables had 4+ɛ absolute moments, the moments of order less than 4 cause no roblem The moments of order higher than 4, say 4 + k, can be bounded by µ 4 B k Consequently, we see that E ( Wi,j 4 ( ( = E E W 4 i,j Y i CB 4 E Since we have n 2 such terms, we see that E ( Wi,j 4 0 as Using our earlier convexity remark, we finally conclude that 4 E ( W i,j 0 as i j i j ( Yi 8 = O(B/ 4 4 = O( (2+4δ iv Second order term: combining all the elements We have therefore established control of the second order term and seen that the largest singular value of W goes to 0 in robability, using Chebyshev s inequality Note that we have also shown that the oerator norm of W is bounded in robability and that W trace ( Σ 2 2 (11 Id 2 0 in robability Control of the third order term We note that the third order term is of the form f (3 (ξ i,j X i X j W i,j According to Lemma B-1, if M is a real symmetric matrix with non-negative entries, and E is a symmetric matrix such that max i,j E i,j = ζ, then σ 1 (E M ζσ 1 (M Note that W is real symmetric matrix with non-negative entries So all we have to show to rove that the third order term goes to zero in oerator norm is that max i j X i X j/ goes to 0, because we have just established that W 2 remains bounded in robability We are going to make use of Lemma A-3, 31 in the Aendix In our setting, we have B = 1/2 δ, or 2/m = 1/2 δ The lemma imlies, for instance, that max i j X ix j / δ log( as 14 8

15 So max i j X i X j/ 0 as Note that this imlies that max i j ξ i,j 0 as Since we have assumed that f (3 exists and is continuous and hence bounded in a neighborhood of 0, we conclude that max f (3 (ξ i,j X ix j / = o( δ/2 as i,j If we call E the matrix with entry E i,j = f (3 (ξ i,j X i X j/ off-the diagonal and 0 on the diagonal, we see that E satisfies the conditions ut forth in our discussion earlier in this section and we conclude that E W 2 max E i,j W 2 = o( δ/2 as i,j Hence, the oerator norm of the third order term goes to 0 almost surely (To maybe clarify our arguments, let us reeat that we analyzed the second order term by relacing the Y i s by, in the notation of the truncation and centralization discussion, U i Let us call W U = S U S U, again using notation introduced in the truncation and centralization discussion As we saw, W W U 2 0 as, so showing, as we did, that W U 2 remains bounded ( as imlies that W 2 does, too, and this is the only thing we need in our argument showing the control of the third order term B Control of the diagonal term The roof here is divided into two arts First, we show that the error term coming from the first order exansion of the diagonal is easily controlled Then we show that the terms added when relacing the off-diagonal matrix by XX / + trace ( Σ 2 / 2 11 can also be controlled Recall the notation τ = trace (Σ / Errors induced by diagonal aroximation Note that Lemma A-3 guarantees that for all i, ξ i,i τ δ/2, as Because we have assumed that f is continuous and hence bounded in a neighborhood of τ, we conclude that f (ξ i,i is uniformly bounded in Now Lemma A-3 also guarantees that max i X i 2 2 τ δ as Hence, the diagonal matrix with entries f( X i 2 2 / can be aroximated consistenly in oerator norm by f(τid as Errors induced by off-diagonal aroximation When we relace the off-diagonal matrix by f (0XX / + [f(0 + f (0trace ( Σ 2 /2 2 ]11, we add a diagonal matrix with (i, i entry f(0 + f (0 X i 2 2 / + f (0trace ( Σ 2 /2 2, which we need to subtract eventually We note that 0 trace ( Σ 2 / 2 σ1 2(Σ/ 0 when σ 1(Σ remains bounded in So this term does not create any roblem Now, we just saw that the diagonal matrix with entries X i 2 2 / can be consistently aroximated in oerator norm by (trace (Σ / Id So the diagonal matrix with (i, i entry f(0 + f (0 X i 2 2 / + f (0trace ( Σ 2 /2 2 can be aroximated consistently in oerator norm by (f(0 + f (0trace (Σ /Id as This finishes the roof 23 Kernel random matrices of the tye f( X i X j 2 2/ As is to be exected, the roerties of such matrices can be deduced from the study of inner roduct kernel matrices, with a little bit of extra work We need to slightly modify the distributional assumtions under which we work, and consider the case where we have 5+ɛ absolute moments for the entries of Y i We also need to assume that f is regular is the neighborhood of different oints Otherwise, the assumtions are the same as that of Theorem 1 We have the following theorem: Theorem 2 (Sectrum of Euclidean distance kernel matrices Consider the n n kernel matrix M with entries ( Xi X j 2 2 M i,j = f Let us call trace (Σ τ = 2 15

16 Let us call ψ the vector with i-th entry ψ i = X i 2 2 / trace (Σ / Suose that the assumtions of Theorem 1 hold, but that conditions e and f are relaced by e The entries of Y i, a -dimensional random vector, are iid Also, denoting by Y i (k the k-th entry of Y i, we assume that E (Y i (k = 0, var (Y i (k = 1 and E ( Y i (k 5+ɛ < for some ɛ > 0 (We say that Y i has 5 + ɛ absolute moments f f is C 3 in a neighborhood of τ Then M can be aroximated consistently in oerator norm (and in robability by the matrix K, defined by ] K = f(τ11 + f (τ [1ψ + ψ1 2 XX + f (τ [1(ψ ψ + (ψ ψ1 + 2ψψ + 4 trace ( Σ 2 ] υ Id, υ = f(0 + τf (τ f(τ In other words, M K 2 0 in robability Proof Note that here the diagonal is just f(0id and it will cause no trouble The work therefore focuses on the off-diagonal matrix In what follows, we call τ = 2 trace(σ Let us define and A i,j = X i X j 2 2 S i,j = X i X j τ, With these notations, we have, off the diagonal, ie when i j, by a Taylor exansion: M i,j = f(τ + [A i,j 2S i,j ] f (τ [A i,j 2S i,j ] 2 f (τ f (3 (ξ i,j [A i,j 2S i,j ] 3 We note that the matrix A with entries A i,j is a rank 2 matrix As a matter of fact, it can be written, if ψ is the vector with entries ψ i = X i 2 2 τ/2, A = 1ψ + ψ1 Using the well-known identity (see eg Gohberg et al (2000, Chater 1, Theorem 32 ( det(i + uv + vu 1 + u = det v u 2 2 v u, v we see immediately that the non-zero eigenvalues of A are 1 ψ ± n ψ 2 After these reliminary remarks, we are ready to start the roof er se Truncation and centralization Since we assume 5 + ɛ absolute moments, we see, using Lemma 22 in Yin et al (1988, that we can truncate the Y i s at level B = 2/5 δ, with δ > 0 and as not change the data matrix We then need to centralize the vectors truncated at 2/5 δ Note that because we work with X i X j = Σ 1/2 (Y i Y j centralization creates absolutely no roblem here, since it is absorbed in the difference So in what follows we can assume without loss of generality that we are working with vectors X i = Σ 1/2 Y i, where the entries of Y i are bounded by 2/5 δ and E (Y i = 0 The issue of variance 1 is addressed as before, so we can assume that the entries of Y i have variance 1 Concentration of X i X j 2 2 / By lugging-in the results of Corollary A-2, with 2/m = 2/5 δ, we get that max i j X i X j trace (Σ 16 log( 1/10 δ

On information plus noise kernel random matrices

On information plus noise kernel random matrices On information lus noise kernel random matrices Noureddine El Karoui Deartment of Statistics, UC Berkeley First version: July 009 This version: February 5, 00 Abstract Kernel random matrices have attracted