Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics, City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, CHINA E-mail: mazhou@cityu.edu.hk December 4, 2007 Preliminary Report Introduction Partial differential equations and the Laplacian operator on domains in Euclidean spaces have played a central role in understanding natural phenomena. However this avenue has been limited in many areas where calculus is obstructed as in singular spaces, and function spaces of functions on a space X where X itself is a function space. Examples of the last occur in vision and quantum field theory. In vision it would be useful to do analysis on the space of images and an image is a function on a patch. Moreover in analysis and geometry, the Lebesgue measure and its counterpart on manifolds are central. These measures are unavailable in the vision example and even in learning theory in general. There is one situation where in the last several decades, the problem has been studied with some success. That is when the underlying space is finite (or even discrete). The introduction of the graph Laplacian has been a major development in algorithm research
and is certainly useful for unsupervised learning theory. The point of view taken here is to benefit from both the classical research and the newer graph theoretic ideas to develop geometry on probability spaces. This starts with a space X equipped with a kernel (like a Mercer kernel) which gives a topology and geometry; X is to be equipped as well with a probability measure. The main focus is on a construction of a (normalized) Laplacian, an associated heat equation, diffusion distance, etc. In this setting the point estimates of calculus are replaced by integral quantities. One thinks of secants rather than tangents. Our main result, Theorem below, bounds the error of an empirical approximation to this Laplacian on X. 2 Kernels and Metric Spaces Let X be a set and K : X X R be a reproducing kernel, that is, K is symmetric and for any finite set of points {x,, x l } X, the matrix (K(x i, x j )) l i,j= is positive semidefinite. The Reproducing Kernel Hilbert Space (RKHS) H K associated with the kernel K is defined to be the completion of the linear span of the set of functions {K x = K(x, ) : x X} with the inner product denoted as, K satisfying K x, K y K = K(x, y). See [, 0, ]. We assume that the feature map F : X H K mapping x X to K x H K is injective. Definition. Define a metric d = d K induced by K on X as d(x, y) = K x K y K. We assume that (X, d) is complete and separable (if it is not complete, we may complete it). The feature map is an isometry. Observe that K(x, y) = K x, K y K = { Kx, K x K + K y, K y K (d(x, y)) 2}. 2 Then K is continuous on X X, and hence is a Mercer kernel. Let ρ X be a Borel probability measure on the metric space X. Example. Let X = R n and K be the reproducing kernel K(x, y) = x, y R n. Then d K (x, y) = x y R n and H K is the space of linear functions. Example 2. Let X be a closed subset of R n and K be the kernel given in the above example restricted to X X. Then X is complete and separable, and K is a Mercer kernel. Moreover, one may take ρ X to be any Borel probability measure on X. 2
Remark. In Examples and 2, one can replace R n by a separable Hilbert space. Example 3. In Example 2, let x = {x i } m i= be a finite subset of X drawn from ρ X. Then the restriction K x is a Mercer kernel on x and it is natural to take ρ x to be the uniform measure on x. 3 Approximation of Integral Operators The integral operator L K : L 2 ρ X H K associated with the pair (K, ρ X ) is defined by L K (f)(x) := K(x, y)f(y)dρ X (y), x X. (3.) X It may be considered as a self-adjoint operator on L 2 ρ X or H K. See [0, ]. We shall use the same notion L K for these operators defined on different domains. Assume that x := {x i } m i= is a sample independently drawn according to ρ X. Recall the sampling operator S x : H K l 2 (x) associated with x defined [7, 8] by S x (f) = ( f(x i ) ) m i=. By the reproducing property of H K taking the form K x, f K = f(x) with x X, f H K, the adjoint of the sampling operator, S T x : l 2 (x) H K, is given by S T x c = m c i K xi, i= c l 2 (x). As the sample size m tends to infinity, the operator m ST x S x converges to the integral operator L K. We prove the following convergence rate. Proposition. Assume κ := sup x X K(x, x) <. (3.2) Let x be a sample independently drawn from (X, ρ X ). With confidence δ, we have m ST x S x L K 4κ2 log ( 2/δ ). (3.3) HK H K m 3
Proof. The proof follows [8]. We need a probability inequality [5, 8] for a random variable ξ on (X, ρ X ) with values in a Hilbert space (H, ). It says that if ξ M < almost surely, then with confidence δ, m [ ξ(xi ) E(ξ) ] 4 M log ( 2/δ ). (3.4) m m i= We apply this probability inequality to the separable Hilbert space H of Hilbert-Schmidt operators on H K, denoted as HS(H K ). The inner product for A, A 2 HS(H K ) is defined as A, A 2 HS = j A e j, A 2 e j K where {e j } j is an orthonormal basis of H K. The space HS(H K ) is the subspace of the space of bounded linear operators on H K with the norm A HS <. The random variable ξ on (X, ρ X ) in (3.4) is given by ξ(x) = K x, K x K, x X. The values of ξ are rank-one operators in HS(H K ). For each x X, if we choose an orthonormal basis {e j } j of H K with e = K x / K(x, x), we see that ξ(x) 2 HS = j ξ(x)e j 2 K = (K(x, x)) 2 κ 4. Observe that m ST x S x = m m i= K x i, K xi K = m m i= ξ(x i) and E(ξ) = L K. Then we observe that the desired bound (3.3) follows from (3.4) with M = κ 2 and the norm relation A HK H K A HS. (3.5) This proves the proposition. Some analysis related to Proposition can be found in [9, 4, 5, 2]. Definition 2. Denote p = K X xdρ X H K. We assume that p is positive on X. The normalized Mercer kernel on X associated with the pair (K, ρ X ) is defined by K(x, y) = K(x, y) p(x) p(y), x, y X. (3.6) This construction is in [9], except that we have replaced their positive weighting function by a Mercer kernel. In the following, means expressions wrt the normalized kernel K, in particular Kx := ( K(xi, x j ) 4 ) m i,j=.
Consider the RKHS H K, and the integral operator L K : H K H K associated with the normalized Mercer kernel K. Denote the sampling operator H K l 2 (x) associated with the pair ( K, x) as S x. If we denote H K,x = span{ K xi } m i= in H K and its orthogonal complement as H K,x, then we see that S x (f) = 0 for any f H K,x. Hence S x T S x H = 0. Moreover, Kx is the matrix K,x representation of the operator S x T S x H K,x. Proposition yields the following bound with K replacing K in Proposition. Theorem. Denote p 0 = min x X p(x). For any 0 < δ <, with confidence δ, L K m S x T S H x 4κ2 log ( 2/δ ). (3.7) mp0 K H K Note that m S T x S x is given from data x by a linear algorithm [6]. So Theorem is an error estimate for that algorithm. The analysis in [9] deeply involves the constant min x,y X K(x, y). In the case of the Gaussian kernel K σ (x, y) = exp{ x y 2 }, this constant decays exponentially fast as the 2σ 2 variance σ of the gaussian becomes small, while in Theorem the constant p 0 decays polynomially fast, at least in the manifold setting [20]. 4 Tame Heat Equation Define a tame version of Laplacian as K = I L K. The tame heat equation takes the form u t = Ku. (4.) One can see that the solution to (4.) with the initial condition u(0) = u 0 is given by u(t) = exp { t K} u0. (4.2) 5 Diffusion Distances Let = λ λ 2... 0 be the eigenvalues of L K : L 2 ρ X L 2 ρ X and {ϕ (i) } i be corresponding normalized eigenfunctions which form an orthonormal basis of L 2 ρ X. The 5
Mercer Theorem asserts that K(x, y) = λ i ϕ (i) (x)ϕ (i) (y) i= where the convergence holds in L 2 ρ X and uniformly. Definition 3. For t > 0 we define a reproducing kernel on X as K t (x, y) = λ t iϕ (i) (x)ϕ (i) (y). (5.) i= Then the tame diffusion distance D t on X is defined as d Kt by D t (x, y) = K t x K t y Kt = { i= λ t i [ ϕ (i) (x) ϕ (i) (y) ] } /2 2. (5.2) The kernel Kt may be interpreted as a tame version of the heat kernel on a manifold. The functions ( K t ) x are in H K for t, but only in L 2 ρ X for t 0. The quantity corresponding to D t for the classical heat kernel and other varieties has been developed by Coifman et. al. [8, 9] where it is used in data analysis and more. Note that K t is continuous for t, in contrast to the classical kernel (Greens function) which is of course singular on the diagonal of X X. The integral operator L Kt the positive operator L K. associated with the reproducing kernel Kt is the tth power of Note that for t =, the distance D is the one induced by the kernel K. When t becomes large, the effect of the eigenfunctions ϕ (i) with small eigenvalues λ i is little. In particular, when λ has multiplicity, since ϕ () = p we have lim t Dt (x, y) = p(x) p(y). In the same way, since the tame Laplacian K has eigenpairs ( λ i, ϕ (i) ), we know that the solution (4.2) to the tame heat equation (4.) with the initial condition u(0) = u 0 L 2 ρ X satisfies lim u(t) = u 0, p L 2 t ρx p. 6
6 Graph Laplacian Consider the reproducing kernel K and the finite subset x = {x i } m i= of X. Definition 4. Let K x be the symmetric matrix K x = (K(x i, x j )) m i,j= and D x be the diagonal matrix with (D x ) i,i = m j= (K x) i,j = m j= K(x i, x j ). Define a discrete Laplacian matrix L = L x = D x K x. The normalized discrete Laplacian is defined by K,x = I D 2 x K x D 2 x = I m K x, (6.) where K x is the special case of the matrix K x when ρ X is the uniform distribution of the finite set x, that is, m K x := K(x i, x j ) m m l= K(x. (6.2) i, x l ) m m l= K(x j, x l ) The above construction is the same as that for graph Laplacians [7]. See [2] for a discussion in learning theory. But the setting here is different: the entries K(x i, x j ) are induced by a reproducing kernel, they might take negative values satisfying a positive definiteness condition, and are different from weights of a graph or entries of an adjacency matrix. i,j= In the following, means expressions with respect to an implicit sample x. 7 Approximation of Eigenfunctions In this section we study the approximation of eigenfunctions from the bound for approximation of linear operators. We need the following result which follows from the general discussion on perturbation of eigenvalues and eigenvectors, see e.g. [4]. For completeness, we give a detailed proof for the approximation of eigenvectors. Proposition 2. Let A and  be two compact positive self-adjoint operators on a Hilbert space H, with nondecreasing eigenvalues {λ j } and { λ j } with multiplicity. Then there holds max λ j λ j A Â. (7.) j Let w k be a normalized eigenvector of A associated with eigenvalue λ k. If r > 0 satisfies λ k λ k r, λ k λ k+ r, A  r 2, (7.2) 7
then w k ŵ k 4 A Â, r where ŵ k is a normalized eigenvector of  associated with eigenvalue λ k. Proof. The eigenvalue estimate (7.) follows from Weyl s Perturbation Theorem (see e.g. [4]). Let {w j } j be an orthonormal basis of H consisting of eigenvectors of A associated with eigenvalues {λ j }. Then ŵ k = j α jw j where α j = ŵ k, w j. Consider Âŵ k Aŵ k. It can be expressed as Âŵ k Aŵ k = λ k ŵ k α j Aw j = ) α j ( λk λ j w j. j j When j k, we see from (7.2) and (7.) that λ k λ j λ k λ j λ k λ k min{λ k λ k, λ k λ k+ } A  r 2. Hence Âŵ k Aŵ k 2 = j α 2 j ( λk λ j ) 2 j k α 2 j ) 2 r ( λk 2 λ j α 2 4 j. j k But Âŵ k Aŵ k  A. It follows that { j k From ŵ k 2 = j α2 j =, we also have α k = Therefore, α 2 j } /2 2  A. r { /2 { /2. j k j} α2 j k j} α2 w k ŵ k = ( α k )w k + α k w k ŵ k = ( α k )w k j k α j w j This proves the desired bound. α k + j k α j w j 2 { j k α 2 j } /2 4  A. r An immediate easy corollary of Proposition 2 and Theorem is about the approximation of eigenfunctions of L K by those of m S T x S x. 8
Recall the eigenpairs {λ i, ϕ (i) } i of the compact self-adjoint positive operator L K on L 2 ρ X. Observe that ϕ (k) L 2 ρx =, but ϕ (k) 2 K = λ k L Kϕ (k), ϕ (k) K = λ k ϕ (k) 2 L = 2 ρ X λ k. So λk ϕ (k) is a normalized eigenfunction of L K on H K associated with eigenvalue λ k. Also, m S T x S x H K,x = 0 and m K x is the matrix representation of m S T x S x H K,x. Corollary. Let k {,..., m} such that r := min{λ k λ k, λ k λ k+ } > 0. If 0 < δ < and m N satisfy then with confidence δ, we have ϕ (k) Φ (k) λk K 4κ 2 log ( 4/δ ) mp0 r 2, 6κ2 log ( 4/δ ) r mp 0, where Φ (k) is a normalized eigenfunction of m S T x S x associated with its kth largest eigenvalue λ k,x. Moreover, λ k,x is the kth largest eigenvalue of the matrix m K x, and with an associated normalized eigenvector v, we have Φ (k) = λk,x m ST x v. 8 Algorithmic Issue for Approximating Eigenfunctions Corollary provides quantitative understanding of the approximation of eigenfunctions of L K by those of the operator m S T x S x. The matrix representation m K x is given by the kernel K which involves K and the function p defined by ρ X. Since ρ X is unknown algorithmically, we want to study the approximation of eigenfunctions by methods involving only K and the sample x. This is done here. We shall apply Proposition 2 to the operator A = L K and  defined by means of the matrix K x given by (K, x) in (6.2), and derive estimates for the approximation of eigenfunctions. Let λ,x λ 2,x... λ m,x 0 be the eigenvalues of the matrix m K x. Theorem 2. Let k {,..., m} such that r := min{λ k λ k, λ k λ k+ } > 0. If 0 < δ < and m N satisfy 4κ 2 log ( 4/δ ) mp0 r 8, (8.) then with confidence δ, we have ϕ(k) ST m x v K 72 λ k κ 2 log ( 4/δ ) r mp 0, (8.2) 9
where v l 2 (x) is a normalized eigenvector of the matrix m K x with eigenvalue λ k,x. Remark 2. Since K = I L K and K,x = I D 2 x K x D 2 x = I m K x, Theorem 2 provides error bounds for the approximation of eigenfunctions of the tame Laplacian K by those of the normalized discrete Laplacian K,x. Note that m ST x v = m m i= v i K xi. We need some preliminary estimates for the approximation of linear operators. Applying the probability inequality (3.4) for the random variable ξ on (X, ρ X ) with values in the Hilbert space H K given by ξ(x) = K x, we have Lemma. Denote p x = m m j= K x j H K. With confidence δ/2, we have p x p K 4κ log( 4/δ ) m. (8.3) The error bound (8.3) implies ( ) max p x(x i ) p(x i ) 4κ2 log 4/δ. (8.4) i=,...,m m This yields the following error bounds between the matrices K x and K x. Lemma 2. Let x X m. When (8.3) holds, we have m K x m K x (2 + 4κ2 log ( 4/δ ) ) 4κ 2 log ( 4/δ ). (8.5) mp0 mp0 Proof. Let Σ = diag {Σ i } m i= with Σ i := px(x i ) p(xi ). Then m K x = Σ m K x Σ. Decompose m K x m K x = Σ m K x (Σ I) + (Σ I) m K x. We see that m K x m K x Σ m K x Σ I + Σ I m K x. Since K m x I, we have K m x. Also, for each i, px (x i ) p(xi ) p x(x i ) p(x i ). p(x i ) Hence Σ I max i=,...,m This in connection with (8.4) proves the statement. 0 p x (x i ) p(x i ). p(x i )
Now we can prove the estimates for the approximation of eigenfunctions. Proof of Theorem 2. First, we define a linear operator which plays the role of  in Proposition 2. This is an operator, denoted as L K,x, on H K such that L K,x (f) = 0 for any f H K,x, and H K,x is an invariant subspace of L K,x with the matrix representation equal to the matrix K m x given by (6.2). That is, in H K,x H, we have K,x [ L K,x Kx,..., K ] [ xm = Kx,..., K ] xm m K x, LK,x H K,x = 0. Second, we consider the difference between the operator L K and L K,x. By Theorem, there exists a subset X of X m of measure at least δ/2 such that (3.7) holds true for x X. By Lemmas and 2, we know that there exists another subset X 2 of X m of measure at least δ/2 such that (8.3) and (8.5) hold for x X 2. Recall that S x T S x H = 0 and K,x K x is the matrix representation of the operator S x T S x H K,x. This in connection with (8.5) and the definition of L K,x tells us that for x X 2, m S x T S x L K,x = m K x m K x (2 + 4κ2 log ( 4/δ ) ) 4κ 2 log ( 4/δ ). mp0 mp0 Together with (3.7), we have ( L K L K,x 3 + 4κ2 log ( 4/δ ) ) 4κ 2 log ( 4/δ ), x X X 2. (8.6) mp0 mp0 Third, we apply Proposition 2 to the operators A = L K and  = L K,x on the Hilbert space H K. Let x be in the subset X X 2 which has measure at least δ. Since r, the estimate (8.6) together with the assumption (8.) tells us that (7.2) holds. Then by Proposition 2, we have ϕ (k) ϕ (k) 4 L λk K r K L 50κ 2 log ( 4/δ ) K,x r, (8.7) mp 0 where ϕ (k) is a normalized eigenfunction of L K,x associated with eigenvalue λ k,x. Finally, we specify ϕ (k). From the definition of the operator L K,x, we see that for u l 2 (x), we have ( m ) L K,x u i Kxi = i= m i= ( ) m K x u i K xi.
Since m i= u i K xi 2 K = u T Kx u, we know that the normalized eigenfunction ϕ (k) of L K,x associated with eigenvalue λ k,x can be taken as ( ) /2 m ϕ (k) = v T Kx v v i Kxi = i= ( ) /2 m vt Kx v ST m x v where v l 2 (x) is a normalized eigenvector of the matrix m K x with eigenvalue λ k,x. Since (8.5) holds, we see that m vt Kx v λ k,x = m vt Kx v m vt Kx v = v T ( m K x m K x )v 9κ2 log ( 4/δ ). mp0 Hence m vt Kx v λ k m vt Kx v λ k,x + λ k,x λ k 22κ2 log ( 4/δ ). mp0 ( ) /2 Since m vt Kx v ST m x v =, we have K ST λk m x v ϕ (k) = K λk m vt K ST x v m x v K m vt Kx v λ k = λk + m vt K x v λk m vt K ST x v m x v 22κ2 log ( 4/δ ). K mp0 λ k This in connection with (8.7) tells us that for x X X 2 we have ϕ (k) ST λk λk m x v 50κ2 log ( 4/δ ) ( ) r + 22κ2 log 4/δ. mp 0 mp0 λ k Since r λ k λ k+ λ k, this proves the stated error bound in Theorem 2. K References [] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (950), 337 404. [2] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 5 (2003), 373 396. 2
[3] M. Belkin and P. Niyogi, Convergence of Laplacian eigenmaps, preprint, 2007. [4] R. Bhatia, Matrix Analysis, Graduate Texts in Mathematics 69, Springer-Verlag, New York, 997. [5] G. Blanchard, O. Bousquet, and L. Zwald, Statistical properties of kernel principal component analysis, Mach. Learn. 66 (2007), 259 294. [6] S. Bougleux, A. Elmoataz, and M. Melkemi, Discrete regularization on weighted graphs for image and mesh filtering, Lecture Notes in Computer Science 4485 (2007), pp. 28 39. [7] F. R. K. Chung, Spectral Graph Theory, Regional Conference Series in Mathematics 92, SIAM, Philadelphia, 997. [8] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, PNAS of USA 02 (2005), 7426 743. [9] R. Coifman and M. Maggiono, Diffusion wavelets, Appl. Comput. Harmonic Anal. 2 (2006), 53 94. [0] F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc. 39 (200), 49. [] F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, 2007. [2] E. De Vito, A. Caponnetto, and L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math. 5 (2005), 59 85. [3] G. Gilboa and S. Osher, Nonlocal operators with applications to image processing, UCLA CAM Report 07-23, July 2007. [4] V. Koltchinskii and E. Giné, Random matrix approximation of spectra of integral operators, Bernoulli 6 (2000), 3 67. [5] I. Pinelis, Optimum bounds for the distributions of martingales in Banach spaces, Ann. Probab. 22 (994), 679 706. 3
[6] S. Smale and D. X. Zhou, Shannon sampling and function reconstruction from point values, Bull. Amer. Math. Soc. 4 (2004), 279 305. [7] S. Smale and D. X. Zhou, Shannon sampling II. Connections to learning theory, Appl. Comput. Harmonic Anal. 9 (2005), 285 302. [8] S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007), 53 72. [9] U. von Luxburg, M. Belkin, and O. Bousquet, Consistency of spectral clustering, Ann. Stat., to appear. [20] G. B. Ye and D. X. Zhou, Learning and approximation by Gaussians on Riemannian manifolds, Adv. Comput. Math., in press. DOI 0.007/s0444-007-9049-0 [2] D. Zhou and B. Schölkopf, Regularization on discrete spaces, in Pattern Recognition, Proc. 27th DAGM Symposium, Berlin, 2005, pp. 36 368. 4