Geometry on Probability Spaces

Similar documents
Derivative reproducing properties for kernel methods in learning theory

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Limits of Spectral Clustering

From graph to manifold Laplacian: The convergence rate

Kernel Method: Data Analysis with Positive Definite Kernels

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Kernels for Multi task Learning

On Some Extensions of Bernstein s Inequality for Self-Adjoint Operators

Mercer s Theorem, Feature Maps, and Smoothing

Learning Theory of Randomized Kaczmarz Algorithm

Convergence of Eigenspaces in Kernel Principal Component Analysis

A spectral clustering algorithm based on Gram operators

Citation Osaka Journal of Mathematics. 41(4)

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Distributed Learning with Regularized Least Squares

Functional Analysis Review

arxiv: v2 [math.pr] 27 Oct 2015

Analysis of Spectral Kernel Design based Semi-supervised Learning

Graphs, Geometry and Semi-supervised Learning

Unsupervised dimensionality reduction

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Manifold Regularization

Hilbert Spaces. Contents

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

ON THE HÖLDER CONTINUITY OF MATRIX FUNCTIONS FOR NORMAL MATRICES

Spectral Regularization for Support Estimation

Learning gradients: prescriptive models

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Introduction to Spectral Theory

Your first day at work MATH 806 (Fall 2015)

Singular Value Inequalities for Real and Imaginary Parts of Matrices

Reproducing Kernel Hilbert Spaces

Nonlinear Dimensionality Reduction

1 Math 241A-B Homework Problem List for F2015 and W2016

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Learnability of Gaussians with flexible variances

Approximate Kernel Methods

Review and problem list for Applied Math I

Multiscale Analysis and Diffusion Semigroups With Applications

EXPLICIT UPPER BOUNDS FOR THE SPECTRAL DISTANCE OF TWO TRACE CLASS OPERATORS

Statistical and Computational Analysis of Locality Preserving Projection

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

Banach Journal of Mathematical Analysis ISSN: (electronic)

Approximate Kernel PCA with Random Features

The following definition is fundamental.

Learning sets and subspaces: a spectral approach

5 Compact linear operators

Five Mini-Courses on Analysis

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Ventral Visual Stream and Deep Networks

Bi-stochastic kernels via asymmetric affinity functions

Diffusion Wavelets and Applications

Online Gradient Descent Learning Algorithms

On the simplest expression of the perturbed Moore Penrose metric generalized inverse

Functional Analysis Review

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Takens embedding theorem for infinite-dimensional dynamical systems

Measures on spaces of Riemannian metrics CMS meeting December 6, 2015

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

The Solvability Conditions for the Inverse Eigenvalue Problem of Hermitian and Generalized Skew-Hamiltonian Matrices and Its Approximation

Reproducing Kernel Hilbert Spaces

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Basic Calculus Review

The spectral zeta function

Robust Laplacian Eigenmaps Using Global Information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Data dependent operators for the spatial-spectral fusion problem

Spectral Gap and Concentration for Some Spherically Symmetric Probability Measures

SPECTRAL THEORY EVAN JENKINS

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

A CONCISE PROOF OF THE GEOMETRIC CONSTRUCTION OF INERTIAL MANIFOLDS

Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Principal Component Analysis

1. Mathematical Foundations of Machine Learning

Error Estimates for Multi-Penalty Regularization under General Source Condition

Data-dependent representations: Laplacian Eigenmaps

Are Loss Functions All the Same?

Square roots of perturbed sub-elliptic operators on Lie groups

1. General Vector Spaces

On the bang-bang property of time optimal controls for infinite dimensional linear systems

Spectral Processing. Misha Kazhdan

Elementary linear algebra

The goal of this chapter is to study linear systems of ordinary differential equations: dt,..., dx ) T

Chapter 4 Euclid Space

A Concise Course on Stochastic Partial Differential Equations

Lecture 3: Review of Linear Algebra

Global vs. Multiscale Approaches

Spectral Theorem for Self-adjoint Linear Operators

Upper triangular forms for some classes of infinite dimensional operators

Lecture 3: Review of Linear Algebra

MODELS OF LEARNING AND THE POLAR DECOMPOSITION OF BOUNDED LINEAR OPERATORS

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

An Iterated Graph Laplacian Approach for Ranking on Manifolds

Real Variables # 10 : Hilbert Spaces II

A graph based approach to semi-supervised learning

Exercise Sheet 1.

Topics in Harmonic Analysis Lecture 1: The Fourier transform

Lecture 1: Graphs, Adjacency Matrices, Graph Laplacian

Transcription:

Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics, City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, CHINA E-mail: mazhou@cityu.edu.hk December 4, 2007 Preliminary Report Introduction Partial differential equations and the Laplacian operator on domains in Euclidean spaces have played a central role in understanding natural phenomena. However this avenue has been limited in many areas where calculus is obstructed as in singular spaces, and function spaces of functions on a space X where X itself is a function space. Examples of the last occur in vision and quantum field theory. In vision it would be useful to do analysis on the space of images and an image is a function on a patch. Moreover in analysis and geometry, the Lebesgue measure and its counterpart on manifolds are central. These measures are unavailable in the vision example and even in learning theory in general. There is one situation where in the last several decades, the problem has been studied with some success. That is when the underlying space is finite (or even discrete). The introduction of the graph Laplacian has been a major development in algorithm research

and is certainly useful for unsupervised learning theory. The point of view taken here is to benefit from both the classical research and the newer graph theoretic ideas to develop geometry on probability spaces. This starts with a space X equipped with a kernel (like a Mercer kernel) which gives a topology and geometry; X is to be equipped as well with a probability measure. The main focus is on a construction of a (normalized) Laplacian, an associated heat equation, diffusion distance, etc. In this setting the point estimates of calculus are replaced by integral quantities. One thinks of secants rather than tangents. Our main result, Theorem below, bounds the error of an empirical approximation to this Laplacian on X. 2 Kernels and Metric Spaces Let X be a set and K : X X R be a reproducing kernel, that is, K is symmetric and for any finite set of points {x,, x l } X, the matrix (K(x i, x j )) l i,j= is positive semidefinite. The Reproducing Kernel Hilbert Space (RKHS) H K associated with the kernel K is defined to be the completion of the linear span of the set of functions {K x = K(x, ) : x X} with the inner product denoted as, K satisfying K x, K y K = K(x, y). See [, 0, ]. We assume that the feature map F : X H K mapping x X to K x H K is injective. Definition. Define a metric d = d K induced by K on X as d(x, y) = K x K y K. We assume that (X, d) is complete and separable (if it is not complete, we may complete it). The feature map is an isometry. Observe that K(x, y) = K x, K y K = { Kx, K x K + K y, K y K (d(x, y)) 2}. 2 Then K is continuous on X X, and hence is a Mercer kernel. Let ρ X be a Borel probability measure on the metric space X. Example. Let X = R n and K be the reproducing kernel K(x, y) = x, y R n. Then d K (x, y) = x y R n and H K is the space of linear functions. Example 2. Let X be a closed subset of R n and K be the kernel given in the above example restricted to X X. Then X is complete and separable, and K is a Mercer kernel. Moreover, one may take ρ X to be any Borel probability measure on X. 2

Remark. In Examples and 2, one can replace R n by a separable Hilbert space. Example 3. In Example 2, let x = {x i } m i= be a finite subset of X drawn from ρ X. Then the restriction K x is a Mercer kernel on x and it is natural to take ρ x to be the uniform measure on x. 3 Approximation of Integral Operators The integral operator L K : L 2 ρ X H K associated with the pair (K, ρ X ) is defined by L K (f)(x) := K(x, y)f(y)dρ X (y), x X. (3.) X It may be considered as a self-adjoint operator on L 2 ρ X or H K. See [0, ]. We shall use the same notion L K for these operators defined on different domains. Assume that x := {x i } m i= is a sample independently drawn according to ρ X. Recall the sampling operator S x : H K l 2 (x) associated with x defined [7, 8] by S x (f) = ( f(x i ) ) m i=. By the reproducing property of H K taking the form K x, f K = f(x) with x X, f H K, the adjoint of the sampling operator, S T x : l 2 (x) H K, is given by S T x c = m c i K xi, i= c l 2 (x). As the sample size m tends to infinity, the operator m ST x S x converges to the integral operator L K. We prove the following convergence rate. Proposition. Assume κ := sup x X K(x, x) <. (3.2) Let x be a sample independently drawn from (X, ρ X ). With confidence δ, we have m ST x S x L K 4κ2 log ( 2/δ ). (3.3) HK H K m 3

Proof. The proof follows [8]. We need a probability inequality [5, 8] for a random variable ξ on (X, ρ X ) with values in a Hilbert space (H, ). It says that if ξ M < almost surely, then with confidence δ, m [ ξ(xi ) E(ξ) ] 4 M log ( 2/δ ). (3.4) m m i= We apply this probability inequality to the separable Hilbert space H of Hilbert-Schmidt operators on H K, denoted as HS(H K ). The inner product for A, A 2 HS(H K ) is defined as A, A 2 HS = j A e j, A 2 e j K where {e j } j is an orthonormal basis of H K. The space HS(H K ) is the subspace of the space of bounded linear operators on H K with the norm A HS <. The random variable ξ on (X, ρ X ) in (3.4) is given by ξ(x) = K x, K x K, x X. The values of ξ are rank-one operators in HS(H K ). For each x X, if we choose an orthonormal basis {e j } j of H K with e = K x / K(x, x), we see that ξ(x) 2 HS = j ξ(x)e j 2 K = (K(x, x)) 2 κ 4. Observe that m ST x S x = m m i= K x i, K xi K = m m i= ξ(x i) and E(ξ) = L K. Then we observe that the desired bound (3.3) follows from (3.4) with M = κ 2 and the norm relation A HK H K A HS. (3.5) This proves the proposition. Some analysis related to Proposition can be found in [9, 4, 5, 2]. Definition 2. Denote p = K X xdρ X H K. We assume that p is positive on X. The normalized Mercer kernel on X associated with the pair (K, ρ X ) is defined by K(x, y) = K(x, y) p(x) p(y), x, y X. (3.6) This construction is in [9], except that we have replaced their positive weighting function by a Mercer kernel. In the following, means expressions wrt the normalized kernel K, in particular Kx := ( K(xi, x j ) 4 ) m i,j=.

Consider the RKHS H K, and the integral operator L K : H K H K associated with the normalized Mercer kernel K. Denote the sampling operator H K l 2 (x) associated with the pair ( K, x) as S x. If we denote H K,x = span{ K xi } m i= in H K and its orthogonal complement as H K,x, then we see that S x (f) = 0 for any f H K,x. Hence S x T S x H = 0. Moreover, Kx is the matrix K,x representation of the operator S x T S x H K,x. Proposition yields the following bound with K replacing K in Proposition. Theorem. Denote p 0 = min x X p(x). For any 0 < δ <, with confidence δ, L K m S x T S H x 4κ2 log ( 2/δ ). (3.7) mp0 K H K Note that m S T x S x is given from data x by a linear algorithm [6]. So Theorem is an error estimate for that algorithm. The analysis in [9] deeply involves the constant min x,y X K(x, y). In the case of the Gaussian kernel K σ (x, y) = exp{ x y 2 }, this constant decays exponentially fast as the 2σ 2 variance σ of the gaussian becomes small, while in Theorem the constant p 0 decays polynomially fast, at least in the manifold setting [20]. 4 Tame Heat Equation Define a tame version of Laplacian as K = I L K. The tame heat equation takes the form u t = Ku. (4.) One can see that the solution to (4.) with the initial condition u(0) = u 0 is given by u(t) = exp { t K} u0. (4.2) 5 Diffusion Distances Let = λ λ 2... 0 be the eigenvalues of L K : L 2 ρ X L 2 ρ X and {ϕ (i) } i be corresponding normalized eigenfunctions which form an orthonormal basis of L 2 ρ X. The 5

Mercer Theorem asserts that K(x, y) = λ i ϕ (i) (x)ϕ (i) (y) i= where the convergence holds in L 2 ρ X and uniformly. Definition 3. For t > 0 we define a reproducing kernel on X as K t (x, y) = λ t iϕ (i) (x)ϕ (i) (y). (5.) i= Then the tame diffusion distance D t on X is defined as d Kt by D t (x, y) = K t x K t y Kt = { i= λ t i [ ϕ (i) (x) ϕ (i) (y) ] } /2 2. (5.2) The kernel Kt may be interpreted as a tame version of the heat kernel on a manifold. The functions ( K t ) x are in H K for t, but only in L 2 ρ X for t 0. The quantity corresponding to D t for the classical heat kernel and other varieties has been developed by Coifman et. al. [8, 9] where it is used in data analysis and more. Note that K t is continuous for t, in contrast to the classical kernel (Greens function) which is of course singular on the diagonal of X X. The integral operator L Kt the positive operator L K. associated with the reproducing kernel Kt is the tth power of Note that for t =, the distance D is the one induced by the kernel K. When t becomes large, the effect of the eigenfunctions ϕ (i) with small eigenvalues λ i is little. In particular, when λ has multiplicity, since ϕ () = p we have lim t Dt (x, y) = p(x) p(y). In the same way, since the tame Laplacian K has eigenpairs ( λ i, ϕ (i) ), we know that the solution (4.2) to the tame heat equation (4.) with the initial condition u(0) = u 0 L 2 ρ X satisfies lim u(t) = u 0, p L 2 t ρx p. 6

6 Graph Laplacian Consider the reproducing kernel K and the finite subset x = {x i } m i= of X. Definition 4. Let K x be the symmetric matrix K x = (K(x i, x j )) m i,j= and D x be the diagonal matrix with (D x ) i,i = m j= (K x) i,j = m j= K(x i, x j ). Define a discrete Laplacian matrix L = L x = D x K x. The normalized discrete Laplacian is defined by K,x = I D 2 x K x D 2 x = I m K x, (6.) where K x is the special case of the matrix K x when ρ X is the uniform distribution of the finite set x, that is, m K x := K(x i, x j ) m m l= K(x. (6.2) i, x l ) m m l= K(x j, x l ) The above construction is the same as that for graph Laplacians [7]. See [2] for a discussion in learning theory. But the setting here is different: the entries K(x i, x j ) are induced by a reproducing kernel, they might take negative values satisfying a positive definiteness condition, and are different from weights of a graph or entries of an adjacency matrix. i,j= In the following, means expressions with respect to an implicit sample x. 7 Approximation of Eigenfunctions In this section we study the approximation of eigenfunctions from the bound for approximation of linear operators. We need the following result which follows from the general discussion on perturbation of eigenvalues and eigenvectors, see e.g. [4]. For completeness, we give a detailed proof for the approximation of eigenvectors. Proposition 2. Let A and  be two compact positive self-adjoint operators on a Hilbert space H, with nondecreasing eigenvalues {λ j } and { λ j } with multiplicity. Then there holds max λ j λ j A Â. (7.) j Let w k be a normalized eigenvector of A associated with eigenvalue λ k. If r > 0 satisfies λ k λ k r, λ k λ k+ r, A  r 2, (7.2) 7

then w k ŵ k 4 A Â, r where ŵ k is a normalized eigenvector of  associated with eigenvalue λ k. Proof. The eigenvalue estimate (7.) follows from Weyl s Perturbation Theorem (see e.g. [4]). Let {w j } j be an orthonormal basis of H consisting of eigenvectors of A associated with eigenvalues {λ j }. Then ŵ k = j α jw j where α j = ŵ k, w j. Consider Âŵ k Aŵ k. It can be expressed as Âŵ k Aŵ k = λ k ŵ k α j Aw j = ) α j ( λk λ j w j. j j When j k, we see from (7.2) and (7.) that λ k λ j λ k λ j λ k λ k min{λ k λ k, λ k λ k+ } A  r 2. Hence Âŵ k Aŵ k 2 = j α 2 j ( λk λ j ) 2 j k α 2 j ) 2 r ( λk 2 λ j α 2 4 j. j k But Âŵ k Aŵ k  A. It follows that { j k From ŵ k 2 = j α2 j =, we also have α k = Therefore, α 2 j } /2 2  A. r { /2 { /2. j k j} α2 j k j} α2 w k ŵ k = ( α k )w k + α k w k ŵ k = ( α k )w k j k α j w j This proves the desired bound. α k + j k α j w j 2 { j k α 2 j } /2 4  A. r An immediate easy corollary of Proposition 2 and Theorem is about the approximation of eigenfunctions of L K by those of m S T x S x. 8

Recall the eigenpairs {λ i, ϕ (i) } i of the compact self-adjoint positive operator L K on L 2 ρ X. Observe that ϕ (k) L 2 ρx =, but ϕ (k) 2 K = λ k L Kϕ (k), ϕ (k) K = λ k ϕ (k) 2 L = 2 ρ X λ k. So λk ϕ (k) is a normalized eigenfunction of L K on H K associated with eigenvalue λ k. Also, m S T x S x H K,x = 0 and m K x is the matrix representation of m S T x S x H K,x. Corollary. Let k {,..., m} such that r := min{λ k λ k, λ k λ k+ } > 0. If 0 < δ < and m N satisfy then with confidence δ, we have ϕ (k) Φ (k) λk K 4κ 2 log ( 4/δ ) mp0 r 2, 6κ2 log ( 4/δ ) r mp 0, where Φ (k) is a normalized eigenfunction of m S T x S x associated with its kth largest eigenvalue λ k,x. Moreover, λ k,x is the kth largest eigenvalue of the matrix m K x, and with an associated normalized eigenvector v, we have Φ (k) = λk,x m ST x v. 8 Algorithmic Issue for Approximating Eigenfunctions Corollary provides quantitative understanding of the approximation of eigenfunctions of L K by those of the operator m S T x S x. The matrix representation m K x is given by the kernel K which involves K and the function p defined by ρ X. Since ρ X is unknown algorithmically, we want to study the approximation of eigenfunctions by methods involving only K and the sample x. This is done here. We shall apply Proposition 2 to the operator A = L K and  defined by means of the matrix K x given by (K, x) in (6.2), and derive estimates for the approximation of eigenfunctions. Let λ,x λ 2,x... λ m,x 0 be the eigenvalues of the matrix m K x. Theorem 2. Let k {,..., m} such that r := min{λ k λ k, λ k λ k+ } > 0. If 0 < δ < and m N satisfy 4κ 2 log ( 4/δ ) mp0 r 8, (8.) then with confidence δ, we have ϕ(k) ST m x v K 72 λ k κ 2 log ( 4/δ ) r mp 0, (8.2) 9

where v l 2 (x) is a normalized eigenvector of the matrix m K x with eigenvalue λ k,x. Remark 2. Since K = I L K and K,x = I D 2 x K x D 2 x = I m K x, Theorem 2 provides error bounds for the approximation of eigenfunctions of the tame Laplacian K by those of the normalized discrete Laplacian K,x. Note that m ST x v = m m i= v i K xi. We need some preliminary estimates for the approximation of linear operators. Applying the probability inequality (3.4) for the random variable ξ on (X, ρ X ) with values in the Hilbert space H K given by ξ(x) = K x, we have Lemma. Denote p x = m m j= K x j H K. With confidence δ/2, we have p x p K 4κ log( 4/δ ) m. (8.3) The error bound (8.3) implies ( ) max p x(x i ) p(x i ) 4κ2 log 4/δ. (8.4) i=,...,m m This yields the following error bounds between the matrices K x and K x. Lemma 2. Let x X m. When (8.3) holds, we have m K x m K x (2 + 4κ2 log ( 4/δ ) ) 4κ 2 log ( 4/δ ). (8.5) mp0 mp0 Proof. Let Σ = diag {Σ i } m i= with Σ i := px(x i ) p(xi ). Then m K x = Σ m K x Σ. Decompose m K x m K x = Σ m K x (Σ I) + (Σ I) m K x. We see that m K x m K x Σ m K x Σ I + Σ I m K x. Since K m x I, we have K m x. Also, for each i, px (x i ) p(xi ) p x(x i ) p(x i ). p(x i ) Hence Σ I max i=,...,m This in connection with (8.4) proves the statement. 0 p x (x i ) p(x i ). p(x i )

Now we can prove the estimates for the approximation of eigenfunctions. Proof of Theorem 2. First, we define a linear operator which plays the role of  in Proposition 2. This is an operator, denoted as L K,x, on H K such that L K,x (f) = 0 for any f H K,x, and H K,x is an invariant subspace of L K,x with the matrix representation equal to the matrix K m x given by (6.2). That is, in H K,x H, we have K,x [ L K,x Kx,..., K ] [ xm = Kx,..., K ] xm m K x, LK,x H K,x = 0. Second, we consider the difference between the operator L K and L K,x. By Theorem, there exists a subset X of X m of measure at least δ/2 such that (3.7) holds true for x X. By Lemmas and 2, we know that there exists another subset X 2 of X m of measure at least δ/2 such that (8.3) and (8.5) hold for x X 2. Recall that S x T S x H = 0 and K,x K x is the matrix representation of the operator S x T S x H K,x. This in connection with (8.5) and the definition of L K,x tells us that for x X 2, m S x T S x L K,x = m K x m K x (2 + 4κ2 log ( 4/δ ) ) 4κ 2 log ( 4/δ ). mp0 mp0 Together with (3.7), we have ( L K L K,x 3 + 4κ2 log ( 4/δ ) ) 4κ 2 log ( 4/δ ), x X X 2. (8.6) mp0 mp0 Third, we apply Proposition 2 to the operators A = L K and  = L K,x on the Hilbert space H K. Let x be in the subset X X 2 which has measure at least δ. Since r, the estimate (8.6) together with the assumption (8.) tells us that (7.2) holds. Then by Proposition 2, we have ϕ (k) ϕ (k) 4 L λk K r K L 50κ 2 log ( 4/δ ) K,x r, (8.7) mp 0 where ϕ (k) is a normalized eigenfunction of L K,x associated with eigenvalue λ k,x. Finally, we specify ϕ (k). From the definition of the operator L K,x, we see that for u l 2 (x), we have ( m ) L K,x u i Kxi = i= m i= ( ) m K x u i K xi.

Since m i= u i K xi 2 K = u T Kx u, we know that the normalized eigenfunction ϕ (k) of L K,x associated with eigenvalue λ k,x can be taken as ( ) /2 m ϕ (k) = v T Kx v v i Kxi = i= ( ) /2 m vt Kx v ST m x v where v l 2 (x) is a normalized eigenvector of the matrix m K x with eigenvalue λ k,x. Since (8.5) holds, we see that m vt Kx v λ k,x = m vt Kx v m vt Kx v = v T ( m K x m K x )v 9κ2 log ( 4/δ ). mp0 Hence m vt Kx v λ k m vt Kx v λ k,x + λ k,x λ k 22κ2 log ( 4/δ ). mp0 ( ) /2 Since m vt Kx v ST m x v =, we have K ST λk m x v ϕ (k) = K λk m vt K ST x v m x v K m vt Kx v λ k = λk + m vt K x v λk m vt K ST x v m x v 22κ2 log ( 4/δ ). K mp0 λ k This in connection with (8.7) tells us that for x X X 2 we have ϕ (k) ST λk λk m x v 50κ2 log ( 4/δ ) ( ) r + 22κ2 log 4/δ. mp 0 mp0 λ k Since r λ k λ k+ λ k, this proves the stated error bound in Theorem 2. K References [] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (950), 337 404. [2] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 5 (2003), 373 396. 2

[3] M. Belkin and P. Niyogi, Convergence of Laplacian eigenmaps, preprint, 2007. [4] R. Bhatia, Matrix Analysis, Graduate Texts in Mathematics 69, Springer-Verlag, New York, 997. [5] G. Blanchard, O. Bousquet, and L. Zwald, Statistical properties of kernel principal component analysis, Mach. Learn. 66 (2007), 259 294. [6] S. Bougleux, A. Elmoataz, and M. Melkemi, Discrete regularization on weighted graphs for image and mesh filtering, Lecture Notes in Computer Science 4485 (2007), pp. 28 39. [7] F. R. K. Chung, Spectral Graph Theory, Regional Conference Series in Mathematics 92, SIAM, Philadelphia, 997. [8] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, PNAS of USA 02 (2005), 7426 743. [9] R. Coifman and M. Maggiono, Diffusion wavelets, Appl. Comput. Harmonic Anal. 2 (2006), 53 94. [0] F. Cucker and S. Smale, On the mathematical foundations of learning, Bull. Amer. Math. Soc. 39 (200), 49. [] F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University Press, 2007. [2] E. De Vito, A. Caponnetto, and L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math. 5 (2005), 59 85. [3] G. Gilboa and S. Osher, Nonlocal operators with applications to image processing, UCLA CAM Report 07-23, July 2007. [4] V. Koltchinskii and E. Giné, Random matrix approximation of spectra of integral operators, Bernoulli 6 (2000), 3 67. [5] I. Pinelis, Optimum bounds for the distributions of martingales in Banach spaces, Ann. Probab. 22 (994), 679 706. 3

[6] S. Smale and D. X. Zhou, Shannon sampling and function reconstruction from point values, Bull. Amer. Math. Soc. 4 (2004), 279 305. [7] S. Smale and D. X. Zhou, Shannon sampling II. Connections to learning theory, Appl. Comput. Harmonic Anal. 9 (2005), 285 302. [8] S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007), 53 72. [9] U. von Luxburg, M. Belkin, and O. Bousquet, Consistency of spectral clustering, Ann. Stat., to appear. [20] G. B. Ye and D. X. Zhou, Learning and approximation by Gaussians on Riemannian manifolds, Adv. Comput. Math., in press. DOI 0.007/s0444-007-9049-0 [2] D. Zhou and B. Schölkopf, Regularization on discrete spaces, in Pattern Recognition, Proc. 27th DAGM Symposium, Berlin, 2005, pp. 36 368. 4