On information plus noise kernel random matrices

Size: px

Start display at page:

Download "On information plus noise kernel random matrices"

Alexandra Thompson
5 years ago
Views:

1 On information lus noise kernel random matrices Noureddine El Karoui Deartment of Statistics, UC Berkeley First version: July 009 This version: February 5, 00 Abstract Kernel random matrices have attracted a lot of interest in recent years, from both ractical and theoretical standoints Most of the theoretical work so far has focused on the case were the data is samled from a low-dimensional structure Very recently, the first results concerning kernel random matrices with high-dimensional inut data were obtained, in a setting where the data was samled from a genuinely high-dimensional structure - similar to standard assumtions in random matrix theory In this aer, we consider the case where the data is of the tye information+noise In other words, each observation is the sum of two indeendent elements: one samled from a low-dimensional structure, the signal art of the data, the other being high-dimensional noise, normalized to not overwhelm but still affect the signal We consider two tyes of noise, sherical and ellitical In the sherical setting, we show that the sectral roerties of kernel random matrices can be understood from a new kernel matrix, comuted only from the signal art of the data, but using in general a slightly different kernel The Gaussian kernel has some secial roerties in this setting The ellitical setting, which is imortant from a robustness standoint, is less rone to easy interretation Introduction Kernel techniques are now a standard tool of statistical ractice and kernel versions of many methods of classical multivariate statistics have now been created A few imortant examles can be found in Schölkof and Smola 00 see the descrition of kernel PCA, 4-45, and Bach and Jordan 003 for kernel ICA for instance There are several ways to describe kernel methods, but one of them is to think of them as classical multivariate techniques using generalized notions of inner-roduct A basic inut in these techniques is a kernel matrix, ie an inner-roduct or Gram matrix, for generalized inner roducts If our vectors of observations are X,,X n, the kernel matrices studied in this aer have i, j entry f X i X j or fx i X j, for a certain f Poular examles include the Gaussian kernel entries ex X i X j /s, the Sigmoid kernel entries tanhκx i X j + θ, and olynomial kernels entries X i X j d We refer to Rasmussen and Williams 006 for more examles As exlained in for instance Schölkof and Smola 00, kernel techniques allow ractitioners to essentially do multivariate analysis in infinite dimensional saces, by embedding the data in a infinite dimensional sace through the use of the kernel A nice numerical feature is that the embedding need not be secified, and all comutations can be made using the finite-dimensional kernel matrix Kernel techniques also allow users to do certain forms of non-linear data analysis and dimensionality reduction, which is naturally very desirable Zwald et al 004 and von Luxburg et al 008 are two interesting relatively recent aers concerned broadly seaking with the same tyes of inferential questions we have in mind and investigate in this aer, though the settings of these aers is quite different from the one we will work under I would like to thank Peter Bickel for suggesting that I consider the roblem studied here and in general for many enlightening discussions about toics in high-dimensional statistics I would also like to thank an anonymous referee and associate editors for raising interesting questions which contributed to a strengthening of the results resented in the aer Suort from NSF grants DMS , DMS CAREER and an Alfred P Sloan research fellowshi is gratefully acknowledged AMS 000 SC: Primary: 6H0 Secondary: 60F99 Key words and Phrases : covariance matrices, kernel matrices, multivariate statistical analysis, high-dimensional inference, random matrix theory, machine learning, concentration of measure Contact : nkaroui@statberkeleyedu

2 Kernel matrices and the closely related Lalacian matrices also lay a central role in manifold learning see eg Belkin and Niyogi 003 and Izenman 008 for an overview of various techniques In classical statistics, they have been a mainstay of satial statistics and geostatistics in articular see Cressie 993 In geostatistical alications, it is clear that the dimension of the data is at most 3 Also, in alications of kernel techniques and manifold learning, it is often assumed that the data live on a low-dimensional manifold or structure, the kernel aroach allowing us to somehow recover at least artially this information Consequently, most theoretical analyses of kernel matrices and kernel or manifold learning techniques have focused on situations where the data is assumed to live on such a low-dimensional structure In articular, it is often the case that asymtotics are studied under the assumtion that the data is iid from a fixed distribution - indeendent of the number of oints Some remarkable results have been obtained in this setting see Koltchinskii and Giné 000 and also Belkin and Niyogi 008 Let us give a brief overview of such results In Koltchinskii and Giné 000, the authors rove that if X i are iid with distribution P, under regularity conditions on the kernel kx, y, the k-th largest eigenvalue of the kernel matrix M, with entries Mi, j = n kx i, X j, converges to the k-th largest eigenvalue of the oerator K defined as Kfx = kx, yfydpy In this imortant aer, the authors were also able to obtain fluctuation behavior for these eigenvalues, under certain technical conditions see Theorem 5 in Koltchinskii and Giné 000 Similar first-order convergence results were obtained, at a heuristic level but through interesting arguments, in Williams and Seeger 000 These results gave theoretical confirmation to ractitioners intuition and heuristics that the kernel matrix could be used as a good roxy for the oerator K on L dp and hence kernel techniques could be exlained and justified through the sectral roerties of this oerator To statisticians well-versed in the theory of random matrices, this set of results aears to be similar to results for low-dimensional covariance matrices stating that when the dimension of the data is fixed and the number of observations goes to infinity, the samle covariance matrix is a sectrally consistent estimator of the oulation covariance matrix see eg Anderson 003 However, it is well-known see eg Marčenko and Pastur 967, Bai 999, Johnstone 007 that this is not the case when the dimension of the data,, changes with n the number of observations, and in articular when asymtotics are studied under the assumtion that /n has a finite limit We refer to the asymtotic setting where and n both tend to infinity as the high-dimensional setting We note that given that more and more datasets have observations that are high-dimensional, and kernel techniques are used on some of them see Williams and Seeger 000, it is natural to study kernel random matrices in the high-dimensional setting Another imortant reason to study this tye of asymtotics is that by keeing track of the effect of the dimension of the data,, and of other arameters of the roblem on the results, they might hel us give more accurate rediction about the finite-dimensional behavior of certain statistics than the classical small, large n asymtotics An examle of this henomenon can be found in the aer Johnstone 00 where it turned out in simulation that some of the doubly asymtotic results concerning fluctuation behavior of the largest eigenvalue of a Wishart matrix with identity covariance are quite accurate for and n as small as 5 or 0, at least in the right tail of the distribution We refer the interested reader to Johnstone 00 for more details on the secific examle we just described Hence, it is also otentially ractically imortant to carry out these theoretical studies for they can be informative even for finite dimensional considerations The roerties of kernel random matrices under classical random matrix assumtions have been studied by the author in the recent El Karoui 007 It was shown there that when the data is highdimensional, for instance X i N0, Σ, and the oerator norm of Σ is eg bounded, kernel random matrices essentially act like standard Gram/ covariance matrices, u to re-centering and re-scaling, which deend only on f Naturally, a certain scaling is needed to make the roblem non-degenerate, and the results we just stated hold, for instance, when Mi, j = f X i X j /, for otherwise the kernel matrix is in general degenerate We refer to El Karoui 007 for more details and discussions of the relevance of these results in ractice In limited simulations, we found that the theory agreed with the numerics even when was of the order of several 0 s and /n was not too small eg /n

3 These results came as somewhat of a surrise and seemed to contradict the intuition and numerous ositive ractical results that have been obtained, since they suggested that the kernel matrices we considered were just a centered and scaled version of the matrix XX However, it should be noted that the assumtions imlied that the data was truly high-dimensional So an interesting middle-ground, from modeling, theoretical and ractical oints of view is the following: what haens if the data does not live exactly on a fixed-dimensional manifold, but lives nearby? In other words, the data is now samled from a noisy version of the manifold This is the question we study in this aer We assume now that the data oints X i R we observe are of the form X i = Y i + Z i, where Y i is the signal art of the observations and live for instance on a low-dimensional manifold, eg a 3-dimensional shere and Z i is the noise art of the observations and is eg multivariate Gaussian in dimension, where might be 00 We think this is interesting from a ractical standoint because the assumtion that the data is exactly on a manifold is erhas a bit otimistic and the noisy manifold version is erhas more in line with what statisticians exect to encounter in ractice there is a clear analogy with linear regression here From a theoretical standoint, such a model allows us to bridge the two extremes between truly low-dimensional data and fully high-dimensional data From a modeling standoint, we roose to scale the noise so that its norm stays bounded or does not grow too fast in the asymtotics That way, the signal art of the data is likely to be affected but not totally drowned by the noise It is imortant to note however that the noise is not small in any sense of the word - it is of a size comarable with that of the signal In the case of sherical noise see below for details but note that the Gaussian distribution falls into this category our results say that, to first-order, the kernel matrix comuted from information+noise data behaves like a kernel matrix comuted from the signal art of the data, but, we might have to use a different kernel than the one we started with This other kernel is quite exlicit In the case of dot-roduct kernel matrices ie Mi, j = fx i X j/n, the original kernel can be used under certain assumtions - so, to first order, the noise art has no effect on the sectral roerties of the kernel matrix The results are different when looking at Euclidian distance kernels ie Mi, j = f X i X j /n where the effect of the noise is basically to change the kernel that is used This is in any case a quite ositive result in that it says that the whole body of work concerning the behavior of kernel random matrices with low-dimensional inut data can be used to also study the information+noise case - the only change being a change of kernels The case of ellitical noise is more comlicated The dot-roduct kernels results still have the same interretation But the Euclidian distance kernels results are not as easy to interret Results Before we start, we set some notations We use M F to denote the Frobenius norm of the matrix M so M F = M i, j and M to denote its oerator norm, ie its largest singular value We also use v to denote the Euclidian norm of the vector v a b is shorthand for a, b Unless otherwise noted, functions that are said to be Lischitz are Lischitz with resect to Euclidian norm We slit our results into two arts, according to distributional assumtions on the noise One deals with the Gaussian-like case, which allows us to give a simle roof of the results The second art is about the case where the noise has a distribution that satisfies certain concentration and elliticity roerties This is more general and brings the geometry of the roblem forward It also allows us to study the robustness and lack thereof of the results to the shericity of the noise, an assumtion that is imlicit in the high-dimensional Gaussian and Gaussian-like case We draw some ractical conclusions from our results for the case of sherical noise in Subsection 3 The case of Gaussian-like noise We first study a setting where the noise is drawn according to a distribution that is similar to a Gaussian, but slightly more general Theorem Suose we observe data X,, X n in R, with X i = Y i + Z i, 3

4 where Z i = Σ / U i where the -dimensional vector U i has iid entries with mean 0, variance, and fourth moment µ 4, and {Y i } n i= P n We assume that there exists a deterministic vector a and a real C > 0, ossibly deendent on n, such that i, E Y i a < C Also, µ 4 might change with n but is assumed to remain bounded {Z i } n i= are iid, and we also assume that {Y i} n i= and {Z i} n i= are indeendent We consider the random matrices M f with i, j entry M f i, j = n f X i X j, for functions f FC0n = {f such that su fx fy C 0 n x y } x,y Let us call ν = traceσ Let M f be the matrix with i, j-th entry { M f i, j = n f Y i Y j + ν if i j n f0 if i = j Assuming only that µ 4 is bounded uniformly in n, we have, for a constant C indeendent of n,, and Σ, [ ] trace Σ E su M f M f F C C 0 n f F C0 + Σ C n We lace ourselves in the high-dimensional setting where n and tend to infinity We assume that trace Σ / 0, as tends to infinity Under these assumtions, for any fixed C 0 > 0 and C > 0, lim n, su M f M f F = 0 in robability f F C0 If we further assume that ν remains for instance bounded, the same result holds if we relace the diagonal of M by fν/n, because fν f0 νc 0 and therefore su f FC0 fν f0 νc 0 The aroximating matrix we then get is the matrix with i, j-th entry f ν Y i Y j, where f νx = fx + ν, ie a ure signal matrix involving a different kernel from the one we started with We note that there is a otential measurability issue that we address in the roof Our theorem really means that we can find a random variable that dominates the random element su f FC0 n M f M f F and goes to 0 in robability A subcase of our result is the case of Gaussian noise: then U i is N0, Id and our result naturally alies We also note that P n can change with n The class of functions we consider is fixed in the last statement of the theorem but if we were to look at a sequence of kernels we could ick a different function in the class F C0 for each n It should also be noted that the roof technique allows us to deal with classes of functions that vary with n: we could have a varying C 0 n As Equation makes clear, the aroximation result will hold as soon as the right hand side of Equation goes to 0 asymtotically, ie C 0ntrace Σ /, Σ / 0 Finally, we work here with uniformly Lischitz functions The roof technique carries over to other classes, such as certain classes of Hölder functions, but the bounds would be different Proof The strategy is to use the same entry-wise exansion aroach that was used in El Karoui 007 To do so, we remark that Z i Z j / remains essentially constant across i, j in the setting we are considering - this is a consequence of the sherical nature of high-dimensional Gaussian distributions We can therefore try to aroximate Mi, j by f Y i Y j + ν/n and all we need to do is to show that the remainder is small We also note that if, as we assume, trace Σ / 0, then Σ = o, since Σ trace Σ Work conditional on Y n = {Y i } n i=, for i j We clearly have X i X j = Y i Y j + Z i Z j Y i Y j + Z i Z j Let us study the various arts of this exansion Conditional on Y n, if we call y = Y i Y j, we see easily that Z i Z j = Σ / U i U j, and Z i Z j Y i Y j = U i U j Σ / y 4

5 Note that U i U j, which we denote Γ, has iid entries, with mean 0, variance and fourth moment µ We call With these notations, we have α = Z i Z j Y i Y j / and β = Z i Z j Therefore, for any function f in F C0n, traceσ X i X j Y i Y j + ν = α + β f X i X j f Y i Y j + ν C 0n β + α, and hence, [ f Xi X j f Y i Y j + ν] C0 n [ β + ] 4α We naturally also have [ su f Xi X j f Y i Y j + ν ] C0 n [ β + 4α ] f F C0 n So we have found a random variable τ n = C0 n[ β + ] 4α that dominates the random element ζn = [ su f FC0 f Xi X n j f Y i Y j + ν] One might be concerned about the measurability of ζ n - but by using outer exectations see van der Vaart 998, 58, we can comletely byass this otential roblem In what follows, we denote by E an outer exectation Though this technical oint does not shed further light on the roblem, it naturally needs to be addressed Hence, E su f F C0 n f Xi X j f Y i Y j + ν Y n C 0 n E β + E 4α Yn Let us focus on E β for a moment Let us call Γ = U i U j We first note that Z i Z j = Γ Σ Γ = trace Σ Γ Γ In articular, E Z i Z j = traceσ, so E β = 0 Therefore, E β = var Zi Z j / Now recall the results found for instance in Lemma A- in El Karoui 007: if the vector γ has iid entries with mean 0, variance σ and fourth moment κ 4, and if M is a symmetric matrix, E γ Mγ = σ 4 trace M + tracem + κ 4 3σ 4 tracem M, where M M is the Hadamard roduct of M with itself, ie the entrywise roduct of two matrices Alying this result in our setting ie using the moments given above of Γ, which has iid entries, in the revious formula gives var Z i Z j = var Γ Σ Γ = 8 trace Σ + µ4 3traceΣ Σ It is easy to see that traceσ Σ trace Σ, since trace Σ = i, i Therefore, i σ σ i, j and traceσ Σ = E β var Z i Z j = 8 + µ 4 3 trace Σ trace Σ = O We note that under our assumtions on trace Σ / and the fact that µ 4 remains bounded in n and therefore, this term will go to 0 as On the other hand, because α Y n = Γ Σ/ y /, and because E Γ = 0 and cov Γ = Id, we have E α y Σ y Yn = Σ y 5 4 Σ Y i a + Y j a

6 Hence, we have for C a constant indeendent of Σ, and n, [ E f Xi X j f Y i Y j + ν trace Σ Y n CC 0 n su f F C0 n + Σ Yi a + Y j a ] This inequality allows us to conclude that, for another constant C, [ E M f M f trace Σ F Y n CC0n since clearly, su f F C0 n su f F C0 n M f M f F n su f F C0 n + Σ n ] n Y i a i= f Xi X j f Y i Y j + ν, Under the assumtion that E Y i a exists and is less than C, we finally conclude that [ ] trace Σ E su M f M f F CC 0 n f F C0 + Σ C, n and Equation is shown Therefore, under our assumtions, E su M f M f F = o f F C0 Hence, when n and tend to, as announced in the Theorem su M M F 0 in robability, f F C0 Case of noise drawn from a distribution satisfying concentration inequalities The roof of Theorem makes clear that the heart of our argument is geometric: we exloit the fact that Z i Z j / is essentially constant across airs i, j It is therefore natural to try to extend the theorem to more general assumtions about the noise distribution than the Gaussian-like one we worked under reviously It is also imortant to understand the imact of the imlicit geometric assumtions ie shericity of the noise that are made and in articular the robustness of our results against these geometric assumtions We extend the results in two directions First, we investigate the generalization of our Gaussianlike results to the setting of Euclidian-distance kernel random matrices, when the noise is distributed according to a distribution satisfying a concentration inequality multilied by a random variable, ie a generalization of ellitical distributions This allows us to show that the Gaussian-like results of Theorem essentially hold under much weaker assumtions on the noise distribution, as long as the Gaussian geometry ie a sherical geometry is reserved see Corollary 3 The results of Theorem show that breaking the Gaussian geometry results in quite different aroximation results We also discuss in Theorem 4 the situation of inner-roduct kernel random matrices under the same generalized ellitical assumtions on the noise The case of Euclidian distance kernel random matrices We have the following theorem Theorem Euclidian distance kernels Suose we observe data X,, X n in R, with X i = Y i + R i Z i We lace ourselves in the high-dimensional setting where n and tend to infinity We assume that {Y i } n i= P n 6

7 {Z i } n i= are iid with E Z i = 0, and we also assume that Y n = {Y i } n i= and {Z i} n i= are indeendent R i are random variables indeendent of {Z i } n i= We now assume that the distribution of Z i is such that, for any -Lischitz function F, if µ F = E FZ i, P FZ i µ F > r C ex c 0 r b hr, where for simlicity we assume that c 0, C and b are indeendent of We call ν = E Z i / and assume that ν stays bounded as We assume that i, R i [r, R ], where r and R are deterministic sequences deending on We assume without loss of generality that R Calling MY n = Y i Y j, we assume that there exists M such that PMY n M and ǫ > 0 such that M /, R R log n + log n ǫ /b 0 Then we have Xi X j [ Y i Y j + νr i + R j ] 0 in robability We call WY n = min Y i Y j, and suose we ick W such that PWY n W Note that W = 0 is always a ossibility We call, for η > 0 given, I η = [W + νr η, M + νr + η], and { } F C,I η = f such that We consider the random matrices M f with i, j entry su fx fy C x y x,y I η M f i, j = n f X i X j, for f FC,I η Let us call M f the matrix with i, j-th entry { M f i, j = n f Y i Y j + νr i + R j if i j n f0 if i = j We have, for any given C > 0 and η > 0, lim su n, f F C,Iη M f M f F = 0 in robability 3 We have the following corollary in the case of sherical noise, which is a generalization of the Gaussian-like case considered in Theorem Corollary 3 Euclidian distance kernels with sherical noise Suose we observe data X,, X n in R, with X i = Y i + Z i, where Y i and Z i satisfy the same assumtions as in Theorem with r = R = Then the results of Theorem aly with I η = [W + ν η, M + ν + η], and { n f Y i Y j + ν if i j M f i, j = n f0 if i = j As in Theorem, we deal with otential measurability issues concerning the su in the roof Our theorem is really that we can find a random variable that goes to 0 with robability and dominates the random element su f FC,Iη M f M f F - an outer-robability statement This theorem generalizes Theorem in two ways The sherical case, detailed in Corollary 3, is a more general version of Theorem limited to Gaussian noise This is because the Gaussian setting corresonds to b = and c 0 = / Σ However, assuming only concentration inequalities allows us to handle much more comlicated structures for the noise distribution Some examles are 7

8 given below We also note that if the Y i s ie the signal art of the X i s are samled, for instance, from a fixed manifold of finite Euclidian diameter, the conditions on M are automatically satisfied, with M being the Euclidian diameter of the corresonding manifold Another generalization is geometric : by allowing R i to vary with i, we move away from the sherical geometry of high-dimensional Gaussian vectors and generalizations, to a more ellitical setting Hence our results show clearly the otential limitations and the structural assumtions that are made when one assumes Gaussianity of the noise Theorem and Corollary 3 show that the Gaussian-like results of Theorem are not robust against a change in the geometry of the noise We note however that if R i is indeendent of Z i and E R i =, cov Ri Z i = cov Z i, so all the noise models have the same covariance but they may yield different aroximating matrices and hence different sectral behavior for our information+noise models However, the sherical results have the advantage of having simle interretations In the setting of Corollary 3, if we assume that f0 and fν are uniformly bounded in n over the class of functions we consider, we can relace the diagonal of M by fν/n and have the same aroximation results Then the new M is a kernel matrix comuted from the signal art of the data with the new kernel f ν x = fx + ν To make our result more concrete, we give a few examles of distributions for which the concentration assumtions on Z i are satisfied: Gaussian random variables, for which we have c 0 = / Σ We refer to Ledoux 00, Theorem 7, for a justification of this claim Vectors of the tye v where v is uniformly distributed on the unit l - shere in dimension Theorem 3 in Ledoux 00 shows that our assumtions are satisfied, with c = // c 0 = /4, after noticing that a -Lischitz function with resect to Euclidean norm is also - Lischitz with resect to the geodesic distance on the shere Vectors Γ v, with v uniformly distributed on the unit l -shere in R and with ΓΓ = Σ having bounded oerator norm Vectors of the tye /b v, b, where v is uniformly distributed in the unit l b ball or shere in R See Ledoux 00, Theorem 4, which refers to Schechtman and Zinn 000 as the source of the theorem In this case, c 0 deends only on b Vectors with log-concave density of the tye e Ux, with the Hessian of U satisfying, for all x, HessU c 0 Id, where c 0 > 0 is the real that aears in our assumtions See Ledoux 00, Theorem 7 for a justification Vectors v distributed according to a centered Gaussian coula, with corresonding correlation matrix, Σ, having Σ bounded We refer to El Karoui 009 for a justification of the fact that our assumtions are satisfied If ṽ has a Gaussian coula distribution, then its i-th entry satisfy ṽ i = ΦN i, where N is multivariate normal with covariance matrix Σ, Σ being a correlation matrix, ie its diagonal is Here Φ is the cumulative distribution function of a standard normal distribution Taking v = ṽ / gives a centered Gaussian coula This last examle is intended to show that the result can handle quite comlicated and non-linear noise structure We note that to justify that the assumtions of the theorem are satisfied, it is enough to be able to show concentration around the mean or the median, as Proosition 8 in Ledoux 00 makes clear The reader might feel that the assumtions concerning the boundedness of the R i s will be limiting in ractice We note that the same roof essentially goes through if we just require that R i s belong to the interval [r, R ] with robability going to, but this requires a little bit more conditioning and we leave the details, which are not difficult, to the interested reader So for instance, if we had a tail condition on R i, we could bound R i with high-robability to get a choice of R So this boundedness condition is here just to make the exosition simler and is not articularly limiting in our oinion On the other hand, we note that our conditions allow deendence in the R i s and are therefore rather weak requirements Finally, the theorem as stated is for a fixed C, though the class of functions we are considering might vary with n and through the influence of I η The roof makes clear that C could also vary with n and We discuss in more details the necessary adjustments after the roof Proof of Theorem : We use the notation Y n = {Y i } n i= and P Y n to denote robability conditional on Y n We call L = {Y n : MY n M } Let us also call YR n = {{Y i } n i=, {R i} n i= }; similarly, P YR n denotes robability conditional on YR n We call LR = {YR n : Y L} We will start by working conditionally on YR n and eventually decondition our results 8

9 We assume from now on that the YR n we work with is such that Y n L Note that PY n L by assumtion and also PYR n LR The main idea now is that, in a strong sense, where ν = E Zi To show this formally, we write where i j, X i X j Y i Y j + R i + R jν, X i X j [ Y i Y j + R i + R j ν] = α + β, α = R iz i R j Z j Y i Y j β = R iz i R j Z j Our aim is to show that, as n and tend to infinity, On i j α Note that if i = j, α = 0 Clearly,, and R i E Z i + R j E Z j α + β 0 in robability P YRn α > r P YRn R i Z i Y i Y j > r + P YRn R j Z j Y i Y j > r Since we assumed that R i R, we see that the function F Z = R i Z Y i Y j / is Lischitz with resect to Euclidian norm, with Lischitz constant smaller than M / R /, when Y n is in L Also, since E Z i = 0, E F Z YR n = 0, where the exectation is conditional on YR n Hence, our concentration assumtions on Z i imly that P YRn R i Z i Y i Y j / > r C ex c 0 / r/[m / R ] b Therefore, if we use a simle union bound, we get P YRn α > r Cn ex c 0 / r/[m / R ] b In articular, if we ick, for ǫ > 0, r 0 = R M / / log n + log n ǫ /b /c 0 /b, we see that P YRn α > r 0 Cn ex c 0 / r 0 /[M / R ] b = C ex log n ǫ 0 Since P α > t P α > t &YR n LR + PYR n / LR, and since the latter goes to 0, we have, unconditionally, P α > r 0 0 On i j β We see that if A and B are vectors in R, the ma N Ri,R j : A, B R i A R j B is R i R j - Lischitz on R equied with the norm A + B, by the triangle inequality Therefore, using Proosition and 7 in Ledoux 00and using the fact that hr 0 as r and h is continuous when using the latter, we conclude that P YRn R i Z i R j Z j E R i Z i R j Z j > r 4hr/R 4 If now γ = E R i Z i R j Z j YR n, and if r = R /c 0 /b log n + log n ǫ /b /, P YRn R i Z i R j Z j γ > r K ex log n ǫ 0, 9

10 where K is a constant which does not deend on YR n So we conclude that unconditionally, if 0 = R i Z i R j Z j γ, P 0 > r 0 Note also that under our assumtions, r 0 Recall that we aim to show that = R i Z i R j Z j νri + R j 0 in robability Let us first work on = R i Z i R j Z j γ Using the fact that a b = a ba + b, and therefore, a b a b a b + b, we see that a b a b a b + b If we choose a = R i Z i R j Z j / and b = γ /, we see that the revious equation becomes γ Therefore, if we can show that 0 γ / goes to 0 in robability, we will have 0 in robability Using the concentration result given in 4, in connection with roosition 9 in Ledoux 00 and a slight modification exlained in El Karoui 007, we have Ri + Rjν γ = var YR n R i Z i R j Z j / R 3C bc 0 Γ/b = /b R κ b 5 Using our assumtion that ν remains bounded, we see that Therefore, for some K indeendent of, R γ remains bounded γ 0 KR r, with robability going to Our assumtions also guarantee that R r 0, so we conclude that, for a constant K indeendent of, R i Z i R j Z j γ = Kr R 0 with robability going to Using Equation 5, we have the deterministic inequality R i + Rjν γ R κ b r r So we can finally conclude that with high robability = β = R i Z i R j Z j νri + R j Kr R 0 Putting all these elements together, we see that if u = M / R R log n + log n ǫ /b /, we can find a constant K such that P α + β > Ku 0 0

11 In other words, P Xi X j [ Y i Y j + νr i + R j ] > Ku 0 6 This establishes a strong form of the first art of the theorem, ie Equation Second art of the theorem Equation 3 To get to the second art, we recall that, assuming that f is C -Lischitz on an interval containing { X i X j, Y i Y j + νr i + R j }, we have f Xi X j f Y i Y j + νr i + R j C Xi X j Y i Y j + νr i + R j Let us define, for η > 0 given, the event and the random element ζ n = E = { i j, X i X j I η, Y i Y j [W, M ]}, su f F C,Iη f Xi X j f Y i Y j + νr i + R j When E is true, all the airs { X i X j, Y i Y j +νr i +R j } are in I η: the art concerning Y i Y j + νr i + R j is obvious, and the one concerning X i X j comes from the definition of E So when E is true, we also have i j, f Xi X j f Y i Y j + νr i + R j C α + β Let us now consider the random variable τ n such that τ n = C on E and otherwise, so τ n = C E + E c Our remark above shows that ζ n τ n α + β Now, we see from our assumtions about {Y i } n i=, Equation 6 and the fact that u 0, that for any η > 0, PE So we have Pτ n C Also, α + β Ku with robability tending to, so we can conclude that Hence, we also have Pτ n α + β C Ku P ζ n C Ku, where this statement might have to be understood in terms of outer robabilities - hence the P instead of P See van der Vaart 998, 58 In lain English, we have found a random variable, τ n α + β, bounded by C Ku with robability going to, which is larger than the random element ζ n In other resects, we have, for all f F C,I η, since Therefore, M f M f F ζ n, M f i, j M f i, j n f X i X j f Y i Y j + νr i + R j ζ n n su M f M f F ζ n 0 in robability, 7 f F C,Iη where once again this statement may have to be understood in terms of outer robabilities The result stated in Equation 3 is roved We mentioned before the roof the ossibility that we might let C vary with n and and still get a good aroximation result This can be done by looking at Equation 7 above: ζ n is less than KC u with high-robability, so when u C n 0 the main aroximation result of Theorem holds, for a C and therefore a class of functions, that vary with n and

12 The case of inner-roduct kernel random matrices We now turn our attention to kernel matrices of the form Mi, j = fx i X j/n which are also of interest in ractice In that setting, we are able to obtain results similar in flavor to Theorem, with slight modifications on the assumtions we make about f Theorem 4 Scalar roduct kernels Suose we observe data X,, X n in R, with X i = Y i + R i Z i We lace ourselves in the high-dimensional setting where n and tend to infinity We assume that {Y i } n i= P n {Z i } n i= are iid with E Z i = 0, and we also assume that {Y i } n i= and {Z i} n i= are indeendent {R i } n i= are assumed to be indeendent of {Z i} n i= We also assume that we can find a deterministic sequence R such that i, R i R and R We assume that the distribution of Z i is such for any -Lischitz function F with resect to Euclidian norm, if µ F = E FZ i, P FZ i µ F > r C ex c 0 r b hr, where for simlicity we assume that c 0, C and b are indeendent of We call ν = E Z i / and assume that ν stays bounded as We call M = Y i Y j, and M a real such that PM M We assume that there exists ǫ > 0 such that M /, R R log n + log n ǫ /b 0 We then have X i X j Y i Y j + δ νri 0 in robability 8 We call J η = [ M η R ν, M + η + R ν] and F C,J η = {f such that We then consider the random matrices M f with i, j entry su fx fy C x y } x,y J η M f i, j = n f X ix j, for f F C,J η Let us call M the matrix with i, j-th entry { M f i, j = n f Y i Y j n f Y i + νri if i j if i = j We have, for any C > 0 and η > 0, lim su n, f F C,Jη M f M f F = 0 in robability We note that under our assumtions, we also have f Y i + νr i f Y i νc R, with high robability, and uniformly in f in F C,J η Therefore, when R 4 /n 0, the result is also valid if we relace the diagonal of M f by {f Y i }n i= /n - in which case the new aroximating matrix is the kernel matrix comuted from the signal art of the data Furthermore, the same argument shows that we get a valid oerator norm aroximation of M by this ure signal matrix as soon as R /n tends to 0 The same measurability issues as in the revious theorems might arise here and the statement should be understood as before: we can find a random variable going to 0 in robability that is larger than the random element su M f FC,Jη f M f F Finally, let us note that once again the theorem is stated for a fixed C and hence for an essentially fixed with n class of functions, though some changes in this class might come from varying J η, but the roof allows us to deal with a varying C n The adjustments are very similar to the ones we discussed after the roof of Theorem and we leave them to the interested reader

13 Proof The roof is quite similar to that of Theorem, so we mostly outline the differences and use the same notations as before We now have to focus on X ix j = Y i Y j + R i Z i Y j + R j Z j Y i + R i R j Z i Z j Z i The analysis of R Yj i is entirely similar to our analysis of α in the roof of Theorem The key remark now is that as function of Z i, when YR n LR, it is, with the new definition of M, R M /-Lischitz with resect to Euclidian norm So we immediately have, with the new definition of M : if r 0 = R M / / log n + log n ǫ /b /c 0 /b, and YR n LR, for some K > 0 which does not deend on YR n, P YRn R i Now, since PYR n / LR 0, we conclude as before that P R i Z i Y j > r 0 K ex logn ǫ Z i Y j > r 0 0 On the other hand, using the fact that 4R i R j Z i Z j = R i Z i +R j Z j R iz i R j Z j, and analyzing the concentration roerties of R i Z i + R j Z j in the same way as we did those of R i Z i R j Z j, we conclude that if u = R /c 0 /b log n + log n ǫ /b /, we can find a constant K such that P P R i Z i R j Z j R i Z i + R j Z j νri + R j > Ku 0, and νri + Rj > Ku 0 Similar arguments, relying on the fact that is obviously -Lischitz with resect to Euclidian norm, also lead to the fact that P Ri Z i i νri > Ku 0 Therefore, we can find K, greater than without loss of generality, such that P R Z i ir Z j j δ νri > Ku 0 We can therefore conclude that P X i X j Y i Y j + δ νri > Ku + r 0 0 If R M /, R log n + log n ǫ /b / 0, then both r 0 and u tend to 0 Therefore, under our assumtions, X i X j Y i Y j + δ νri 0 in robability So we have shown the first assertion of the theorem The final ste of the roof is now clear: we have, for all i, j, fx i X j fy i Y j + δ νri C X i X j Y i Y j + δ νri, when for all i, j, X i X j and Y i Y j + δ νri are in J η This event haens with robability going to under our assumtions So following the same aroach as before and dealing with measurability in the same way, we have, with robability going to, su f F C,Jη So we conclude that su f F C,Jη fx i X j fy i Y j + δ νri C X i X j Y i Y j + δ νri fx i X j fy i Y j + δ νri 0 in robability 3

14 From this statement, we get in the same manner as before, su M f M f F 0 in robability f F C,Jη As before, the equations above show that if C nu +r 0 0, the same aroximation result holds, with now a varying C n 3 Practical consequences of the results: case of sherical noise Our aim in giving aroximation results is naturally to use existing knowledge concerning the aroximating matrix to reach conclusions concerning the information+noise kernel matrices that are of interest here In articular, we have in mind situations where the signal art of the data, ie what we called {Y i } n i= in the theorems, and f or f + ν, with ν being as defined in Theorems or are such that the assumtions of Theorems 3 or 5 in Koltchinskii and Giné 000 are satisfied, in which case we can aroximate the eigenvalues of M by those of the corresonding oerator in L dp In this setting the matrix M, which is normalized so its entries are of order /n has a non-degenerate limit, which is why we considered for our kernel matrices the normalization f X i X j /n This normalization by /n makes our roofs considerably simler than the ones given in El Karoui 007 Another otentially interesting alication is the case where the signal art of the data is samled iid from a manifold with bounded Euclidian diameter, in which case our results are clearly alicable 3 Sectral roerties of information+noise kernel random matrices from ure signal kernel random matrices The ractical interest of the theorems we obtained above lie in the fact that the Frobenius norm is larger than the oerator norm, and therefore all of our results also hold in oerator norm Now we recall the discussion in El Karoui 008, Section 33, where we exlained that consistency in oerator norm imlies consistency of eigenvalues and consistency of eigensaces corresonding to searated eigenvalues as consequences of Weyl s inequality and the Davis-Kahane sinθ theorem - see Bhatia 997 and Stewart and Sun 990 Theorems,, 4 therefore imly that under the assumtions stated there, the sectral roerties of the matrix M can be deduced from those of the matrix M In articular, for techniques such as Kernel PCA, we exect, when it is a reasonable idea to use that technique, that M will have some searated eigenvalues, ie a few will be large and there will be a ga in the sectrum In that setting, it is enough to understand M, which corresonds, if i, R i =, to a ure signal matrix, with a ossibly slightly different kernel, to have a theoretical understanding of the roerties of the technique For instance, if i, R i =, if the assumtions underlying the first-order results of Koltchinskii and Giné 000 are satisfied for M, the first-order sectral roerties of M are the same as those of M, and hence of the corresonding oerator in L dp 3 On the Gaussian kernel Our analysis reveals a very interesting feature of the Gaussian kernel, ie the case where Mi, j = ex s X i X j /n, for some s > 0: when Theorem or Corollary 3 ie Theorem with i, R i = aly, the eigensaces corresonding to searated eigenvalues of the signal+noise kernel matrix converge to those of the ure signal matrix This is simly due to the fact that in that setting, if S is the matrix such that Si, j = ex νs n ex s Y i Y j, a rescaled version of the ure signal matrix M with i, j-th entry n ex s Y i Y j, we have S M 0 This latter statement is a simle consequence of the fact that S M is a diagonal matrix with entries ex νs /n on the diagonal, and therefore its oerator norm goes to 0 On the other hand, S clearly has the same eigenvectors as the ure signal matrix M Hence, because the eigensaces of M are consistent for the eigensaces of S corresonding to searated eigenvalues, they are also consistent 4

15 for those of M We note that our results are actually stronger and allow us to deal with a collection of matrices with varying s and not a single s, as we just discussed This is because we can deal with aroximations over a collection of functions in all our theorems Because of the ractical imortance of eigensaces in techniques such as Kernel PCA, these remarks can be seen as giving a theoretical justification for the use of the Gaussian kernel over other kernels in the situations where we think we might be in an information + noise setting, and the noise is sherical On the other hand, S underestimates the large eigenvalues of M because S = ex νsm, and obviously ex νs < Using Weyl s inequality see Bhatia 997, we have, if we denote by λ i M is the i-th eigenvalue of the symmetric matrix M, i, i n, λ i M λ i S M S Since the right hand-side goes to 0 asymtotically, the eigenvalues of M the ure signal matrix that stay asymtotically bounded away from 0 are underestimated by the corresonding eigenvalues of M When the noise is ellitical, ie R i s are not all equal to, the new matrix S we have to deal with has entries Si, j = ex sr iex sr j n ex s Y i Y j, so it can be written in matrix form S = DMD, where D is a diagonal matrix with Di, i = ex sr i By the same arguments as above, S M 0 in robability, but now S does not have the same eigenvectors as the ure signal matrix M So in this ellitical setting if we were to do kernel analysis on M, we would not be recovering the eigensaces of the ure signal matrix M 33 Variants of kernel matrices: Lalacian matrices and the issue of centering In various arts of statistics and machine learning, it has been argued that Lalacian matrices should be used instead of kernel matrices See for instance the very interesting Belkin and Niyogi 008, where various sectral roerties of Lalacian matrices have been studied, under a ure signal assumtion in our terminology For instance, it is assumed that the data is samled from a fixed-dimensional manifold In light of the theoretical and ractical success of these methods, it is natural to ask what haens in the information+noise case There are several definitions of Lalacian matrices A oular one see eg the work of Belkin and Niyogi, among other ublications Belkin and Niyogi 008, is derived from kernel matrices: given M a kernel matrix, the Lalacian matrix is defined as { Mi, j if i j Li, j = Mi, j otherwise When our Theorems or 4 aly, we have seen that, for relevant classes of functions F, su f F n M f i, j M f i, j 0 in robability Let us now focus on the case of a single function f If we call L the Lalacian matrix corresonding to M, we have n Li, j Li, j 0 in robability Li, i Li, i 0 in robability i We conclude that L L 0 in robability; we can therefore deduce that the sectral roerties of the Lalacian matrix L from those of L, which, when i, R i =, is a ure signal matrix, where we have slightly adjusted the kernel Here again, the Gaussian kernel lays a secial role, since when we use a Gaussian kernel, L is a scaled version of the Lalacian matrix comuted from the signal art of the data Finally, other versions of the Lalacian are also used in ractice In articular, a normalized version is sometimes advocated, and comuted as N L = D / L LD / L, if D is the diagonal of the matrix L defined above We have just seen that D L D L 0 in robability and L L 0 in robability Therefore, if the entries of D L are bounded away from 0 with robability going to, we conclude that D L stays bounded with high-robability and N L N L 0 in robability 5

16 So once again, understanding the sectral roerties of N L essentially boils down to understanding those of N L, which is, in the sherical setting where i, R i =, a ure signal matrix In the case of the Gaussian kernel, N L is equal to the normalized Lalacian matrix comuted from the ure signal data {Y i } n i= The question of centering In ractice, it is often the case that one works with centered versions of kernel matrices: either the row sums, the column sums or both are made to be equal zero These centering oerations amount to multilying resectively on the right, left or both our original kernel matrix by the matrix H = Id n /n, where is the n-dimensional vector whose entries are all equal to This matrix has oerator norm, so when M is such that M M 0, the same is true for H a MH b and H a MH b, where a and b are either 0 or This shows that our aroximations are therefore also informative when working with centered kernel matrices 3 Conclusions Our results aim to bridge the ga in the existing literature between the study of kernel random matrices in the resence of ure low-dimensional signal data see eg Koltchinskii and Giné 000 and the case of truly high-dimensional data see El Karoui 007 Our study of information+noise kernel random matrices shows that, to first order, kernel random matrices are somewhat sectrally robust to the corrution of signal by additive high-dimensional and sherical noise whose norm is controlled In articular, they tend to behave much more like a kernel matrix comuted from a low-dimensional signal than one comuted from high-dimensional data Some noteworthy results include the fact that dot-roduct kernel random matrices are, under reasonable assumtions on the kernel and the signal distribution sectrally robust for both eigenvalues and eigenvectors The Gaussian kernel also yields sectrally robust matrices at the level of eigenvectors, when the noise is sherical However, it will underestimate searated eigenvalues of the Gaussian kernel matrix corresonding to the signal art of the data On the other hand, Euclidian distance kernel random matrices are not, in general, robust to the resence of additive noise As our results show, under reasonably minimal assumtions on both the noise, the kernel and the signal distribution, a Euclidian distance kernel random matrix comuted from additively corruted data behaves like another Euclidian kernel matrix comuted from another kernel: in the case of sherical noise, it is a shifted version of f, the shift being twice the norm of the noise For sherical noise, this is bound to create excet for the Gaussian kernel otentially serious inconsistencies in both estimators of eigenvalues and eigenvectors, because the eigenroerties of the kernel matrix corresonding to the function f ν = f + ν are in general different from that of the kernel matrix corresonding to the function f The same remarks also aly to the case of ellitical noise, where the change of kernel is not deterministic and even more comlicated to describe Our study also highlights the imortance of the imlicit geometric assumtions that are made about the noise In articular, the results are qualitatively different if the noise is sherical eg multivariate Gaussian or ellitical eg multivariate t Interretation is more comlicated in the ellitical case and a number of nice roerties eg robustness or consistency which hold for sherical noise do not hold for ellitical noise Our results can therefore be seen as highlighting from a theoretical oint of view the strength and limitations of techniques which rely on kernel random matrices as a rimary element in a data analysis We hoe they shed light on an interesting issue and will hel refine our understanding of the behavior of kernel techniques and related methodologies for high-dimensional inut data References Anderson, T W 003 An introduction to multivariate statistical analysis Wiley Series in Probability and Statistics Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition Bach, F R and Jordan, M I 003 Kernel indeendent comonent analysis J Mach Learn Res 3, 48 Bai, Z D 999 Methodologies in sectral analysis of large-dimensional random matrices, a review Statist Sinica 9, With comments by G J Rodgers and Jack W Silverstein; and a rejoinder by the author 6

17 Belkin, M and Niyogi, P 003 Lalacian eigenmas for dimensionality reduction and data reresentation Neural Comutation 5, URL htt://wwwmitressjournalsorg/doi/abs/06/ Belkin, M and Niyogi, P 008 Convergence of lalacian eigenmas rerint Bhatia, R 997 Matrix analysis, volume 69 of Graduate Texts in Mathematics Sringer-Verlag, New York Cressie, N A C 993 Statistics for satial data Wiley Series in Probability and Mathematical Statistics: Alied Probability and Statistics John Wiley & Sons Inc, New York Revised rerint of the 99 edition, A Wiley-Interscience Publication El Karoui, N 007 The sectrum of kernel random matrices Technical Reort 748, Deartment of Statistics, UC Berkeley To aear in The Annals of Statistics El Karoui, N 008 Oerator norm consistent estimation of large dimensional sarse covariance matrices The Annals of Statistics 36, El Karoui, N 009 Concentration of measure and sectra of random matrices: with alications to correlation matrices, ellitical distributions and beyond The Annals of Alied Probability To Aear Izenman, A J 008 Modern multivariate statistical techniques Sringer Texts in Statistics Sringer, New York Regression, classification, and manifold learning Johnstone, I 00 On the distribution of the largest eigenvalue in rincial comonent analysis Ann Statist 9, Johnstone, I M 007 High dimensional statistical inference and random matrices In International Congress of Mathematicians Vol I, Eur Math Soc, Zürich Koltchinskii, V and Giné, E 000 Random matrix aroximation of sectra of integral oerators Bernoulli 6, 3 67 Ledoux, M 00 The concentration of measure henomenon, volume 89 of Mathematical Surveys and Monograhs American Mathematical Society, Providence, RI Marčenko, V A and Pastur, L A 967 Distribution of eigenvalues in certain sets of random matrices Mat Sb NS 7 4, Rasmussen, C E and Williams, C K I 006 Gaussian Processes for Machine Learning The MIT Press Schechtman, G and Zinn, J 000 Concentration on the l n ball In Geometric asects of functional analysis, volume 745 of Lecture Notes in Math, Sringer, Berlin Schölkof, B and Smola, A J 00 Learning with kernels The MIT Press, Cambridge, MA Stewart, G W and Sun, J G 990 Matrix erturbation theory Comuter Science and Scientific Comuting Academic Press Inc, Boston, MA van der Vaart, A W 998 Asymtotic statistics Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge von Luxburg, U, Belkin, M, and Bousquet, O 008 Consistency of sectral clustering Ann Statist 36, URL htt://dxdoiorg/04/ Williams, C and Seeger, M 000 The effect of the inut density distribution on kernel-based classifiers International Conference on Machine Learning 7, Zwald, L, Bousquet, O, and Blanchard, G 004 Statistical roerties of kernel rincial comonent analysis In Learning theory, volume 30 of Lecture Notes in Comut Sci, Sringer, Berlin 7

The spectrum of kernel random matrices

The spectrum of kernel random matrices The sectrum of kernel random matrices Noureddine El Karoui Deartment of Statistics, University of California, Berkeley Abstract We lace ourselves in the setting of high-dimensional statistical inference,