From graph to manifold Laplacian: The convergence rate

Similar documents
Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Uniform convergence of adaptive graph-based regularization

Manifold Regularization

APPROXIMATING COARSE RICCI CURVATURE ON METRIC MEASURE SPACES WITH APPLICATIONS TO SUBMANIFOLDS OF EUCLIDEAN SPACE

Graphs, Geometry and Semi-supervised Learning

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Multiscale bi-harmonic Analysis of Digital Data Bases and Earth moving distances.

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

The crucial role of statistics in manifold learning

Geometry on Probability Spaces

An Analysis of the Convergence of Graph Laplacians

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Intrinsic Dimensionality Estimation of Submanifolds in R d

Diffusion maps, spectral clustering and reaction coordinates of dynamical systems

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Nonlinear Dimensionality Reduction

Unsupervised dimensionality reduction

Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of large data sets.

Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

Self-Tuning Semantic Image Segmentation

Data-dependent representations: Laplacian Eigenmaps

Manifold Denoising. Matthias Hein Markus Maier Max Planck Institute for Biological Cybernetics Tübingen, Germany

Global vs. Multiscale Approaches

Limits of Spectral Clustering

Using the computer to select the right variables

Locality Preserving Projections

An Iterated Graph Laplacian Approach for Ranking on Manifolds

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Regularizing objective functionals in semi-supervised learning

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Robust Laplacian Eigenmaps Using Global Information

Approximation Theory on Manifolds

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

IFT LAPLACIAN APPLICATIONS. Mikhail Bessmeltsev

Unlabeled Data: Now It Helps, Now It Doesn t

From Graphs to Manifolds Weak and Strong Pointwise Consistency of Graph Laplacians

Justin Solomon MIT, Spring 2017

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Heat Kernel Signature: A Concise Signature Based on Heat Diffusion. Leo Guibas, Jian Sun, Maks Ovsjanikov

Learning gradients: prescriptive models

Graph-Based Semi-Supervised Learning

Bi-stochastic kernels via asymmetric affinity functions

Regularization on Discrete Spaces

Diffusion Maps, Spectral Clustering and Reaction Coordinates of Dynamical Systems

Stable MMPI-2 Scoring: Introduction to Kernel. Extension Techniques

EECS 275 Matrix Computation

Data dependent operators for the spatial-spectral fusion problem

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

How to learn from very few examples?

Analysis of Spectral Kernel Design based Semi-supervised Learning

arxiv: v1 [math.dg] 25 Nov 2009

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

An example of a convex body without symmetric projections.

Tobias Holck Colding: Publications

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Diffusion Wavelets and Applications

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Boundedly complete weak-cauchy basic sequences in Banach spaces with the PCP

GEOMETRY-ADAPTED GAUSSIAN RANDOM FIELD REGRESSION. Zhen Zhang, Mianzhi Wang, Yijian Xiang, and Arye Nehorai

A graph based approach to semi-supervised learning

Diffusion/Inference geometries of data features, situational awareness and visualization. Ronald R Coifman Mathematics Yale University

Iterative Laplacian Score for Feature Selection

On the Convergence of Spectral Clustering on Random Samples: the Normalized Case

The spectral zeta function

Definition and basic properties of heat kernels I, An introduction

Improved Graph Laplacian via Geometric Consistency

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Graph Metrics and Dimension Reduction

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

The oblique derivative problem for general elliptic systems in Lipschitz domains

Tangent bundles, vector fields

Multiscale Manifold Learning

HEAVY-TRAFFIC EXTREME-VALUE LIMITS FOR QUEUES

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

On extensions of Myers theorem

Note on the Chen-Lin Result with the Li-Zhang Method

Regression on Manifolds Using Kernel Dimension Reduction

The Curse of Dimensionality for Local Kernel Machines

Nodal Sets of High-Energy Arithmetic Random Waves

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Tobias Holck Colding: Publications. 1. T.H. Colding and W.P. Minicozzi II, Dynamics of closed singularities, preprint.

Manifold Coarse Graining for Online Semi-supervised Learning

Convergence Rates of Kernel Quadrature Rules

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Strictly Positive Definite Functions on a Real Inner Product Space

SMOOTH OPTIMAL TRANSPORTATION ON HYPERBOLIC SPACE

Local Learning Projections

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

arxiv: v1 [math.ca] 23 Oct 2018

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Hot Spots Conjecture and Its Application to Modeling Tubular Structures

Transcription:

Appl. Comput. Harmon. Anal. 2 (2006) 28 34 www.elsevier.com/locate/acha Letter to the Editor From graph to manifold Laplacian: The convergence rate A. Singer Department of athematics, Yale University, 0 Hillhouse Ave., P.O. Box 208283, New Haven, CT 06520-8283, USA Available online 26 ay 2006 Communicated by Ronald R. Coifman on 2 arch 2006 Abstract The convergence of the discrete graph Laplacian to the continuous manifold Laplacian in the limit of sample size N while the kernel bandwidth ε 0, is the justification for the success of Laplacian based algorithms in machine learning, such as dimensionality reduction, semi-supervised learning and spectral clustering. In this paper we improve the convergence rate of the variance term recently obtained by Hein et al. [From graphs to manifolds Weak and strong pointwise consistency of graph Laplacians, in: P. Auer, R. eir (Eds.), Proc. 8th Conf. Learning Theory (COLT), Lecture Notes Comput. Sci., vol. 3559, Springer- Verlag, Berlin, 2005, pp. 470 485], improve the bias term error, and find an optimal criteria to determine the parameter ε given N. 2006 Elsevier Inc. All rights reserved.. Introduction Graph Laplacians are widely used in machine learning for dimensionality reduction, semi-supervised learning and spectral clustering ([3 9] and references therein). In these setups one is usually given a set of N data points x, x 2,...,x N, where R m is a Riemannian manifold with dim = d<m. The points are given as vectors in the ambient space R m and the task is to find the unknown underlying manifold, its geometry and its low-dimensional representation. The starting point of spectral methods is to extract an N N weight matrix W from a suitable semi-positive kernel k as follows W ij = k ( x i x j 2 / ), (.) where is the Euclidean distance in the ambient space R m and ε>0 is the bandwidth of the kernel. A popular choice of kernel is the exponential kernel k(x) = e x, though other choices are also possible. The weight matrix W is then normalized to be row stochastic, by dividing it by a diagonal matrix D whose elements are the row sums of W D ii = W ij, j= (.2) and the (negative defined) graph Laplacian L is given by E-mail address: amit.singer@yale.edu. 063-5203/$ see front matter 2006 Elsevier Inc. All rights reserved. doi:0.06/j.acha.2006.03.004

A. Singer / Appl. Comput. Harmon. Anal. 2 (2006) 28 34 29 L = D W I, (.3) where I is an N N identity matrix. In the case where the data points x i N i= are independently uniformly distributed over the manifold the graph Laplacian converges to the continuous Laplace Beltrami operator Δ of the manifold. This statement has two manifestations. First, if f : R is a smooth function, then the following result concerning the bias term has been established by various authors [,5,6] (among others) ε lim N j= L ij f(x j ) = 2 Δ f(x i ) + O ( ε /2). (.4) Second, Hein et al. [] established a uniform estimate of the error in both N and ε (the variance term), that decreases as / N as N, but increases as as ε 0, ε +d/4 ε j= L ij f(x j ) = ( ) 2 Δ f(x i ) + O N /2,ε/2. (.5) ε+d/4 In practice, one cannot choose ε to be too small, even though smaller ε decreases the bias error (.4), because diverges with ε. It was reasoned [] that the convergence rate is unlikely to be improved, since N /2 ε +d/4 N /2 ε the rates for estimating second derivatives in nonparametric regression are the same. +d/4 In this paper we show that the variance error is O ( ) N /2 ε, which improves the convergence rate (.5) by an /2+d/4 asymptotic factor ε. This improvement stems from the observation that the noise terms of Wf and Df are highly correlated, with a correlation coefficient of O(ε). At points where f = 0 the convergence rate is even faster. oreover, we refine the bias error to be O(ε) instead of O(ε /2 ) in (.4), by a symmetry argument. Finally, balancing the two error terms leads to a natural optimal choice of ε, ε = C() N /(3+d/2), (.6) that gives a minimal error between the continuous Laplace Beltrami and its discrete graph Laplacian approximation. The constant C() is a function of the manifold, that depends on its geometry, for example, its dimension, curvature and volume, but is independent of the number of sample points N. The variance error depends on the intrinsic dimension of the manifold and its volume, that can be estimated [] with O(/N) accuracy, rather than O(/ N). However, the bias error depends on the local curvature of the manifold which is unknown in advance. Therefore, when dealing with real life applications one still needs to use some trial and error experimentation in order to find the optimal ε for a given N, because C() is unknown. When a different N is introduced, the parameter ε could be chosen according to (.6). Our results are summarized in the following: ε j= L ij f(x j ) = ( ) 2 Δ f(x i ) + O N /2,ε. (.7) ε/2+d/4 We remark that if the points are not uniformly distributed, then Lafon et al. [6,7] showed that the limiting operator contains an additional drift term, and suggested a different kernel normalization that separates the manifold geometry from the distribution of points on it, and recovers the Laplace Beltrami operator. Therefore, our analysis assumes that the initial data set is already uniformly distributed. Finally, we illustrate the improved convergence rates by computer simulations of spherical harmonics. The notation O(, ) means that there exist positive constants C,C 2 (independent of N and ε) such that O ( N /2 ε +d/4,ε/2) C N /2 ε +d/4 + C /2 for large N and small ε.

30 A. Singer / Appl. Comput. Harmon. Anal. 2 (2006) 28 34 2. The bias term Let x,...,x N be N independent uniformly distributed points over a d-dimensional compact Riemannian manifold R m of volume. Suppose f : R is a smooth function. In this section we investigate limiting properties of the graph Laplacian (.) (.3) at a given fixed point x = x i Nj= exp x x j 2 /f(x j ) (Lf )(x) = L ij f(x j ) = Nj= f(x). (2.) exp x x j 2 / j= We rewrite the graph Laplacian (2.) as Nj= (Lf )(x) = Nj= G(x j ) f(x), where = exp x x j 2 G(x j ) = exp x x j 2 f(x j ),. We shall see that excluding the diagonal terms j = i from the sums in (2.2) results in an O ( ) Nε error, which is even smaller than the variance error squared (see Eq. (.7)), therefore it is negligible d/2 (Lf )(x) = G(x j ) f(x) ( + O ( Nε d/2 (2.2) (2.3) (2.4) )). (2.5) The points x j are independent identically distributed (i.i.d.), therefore, (j i) are also i.i.d., and by the law of large numbers one should expect G(x j ) EF EG, (2.6) where EF = EG = exp exp f(y)dy, (2.7) dy, (2.8) are the expected value of F and G, respectively. Our goal is to make Eq. (2.6) precise by obtaining the error estimate as a function of N and ε. The expansion of the integral (2.7) in a Taylor series in ε is given by [2,6,0] (2πε) d/2 exp where E(x) is a scalar function of the curvature of at x. Hence, f(y)dy = f(x) + ε 2[ E(x)f (x) + Δ f(x) ] + O ( ε 3/2), (2.9) EF EG f(x) = ε 2 Δ f(x) + O ( ε 3/2). (2.0) In other words, the finite sums in (2.5) are onte Carlo estimators of the kernel integrals in (2.7) and (2.8), whose deterministic expansion in ε gives the Laplace Beltrami operator as the leading order term. oreover, Eqs. (2.7) and (2.9) mean that the error in excluding the diagonal terms i = j is indeed O ( Nε d/2 ).

A. Singer / Appl. Comput. Harmon. Anal. 2 (2006) 28 34 3 Our first observation is that for smooth manifolds and smooth functions, Taylor expansions of the form (2.9) contain only integer powers of ε, whereas fractional powers, such as, ε 3/2 must vanish. The kernel integral (2.9) is evaluated by changing the integration variables to the local coordinates, and the fractional powers of ε that appear in the asymptotic expansion are due to odd monomials. However, the integrals of odd monomials vanish, because the kernel is symmetric. We conclude that a finer error estimate of (2.9) is (2πε) d/2 exp f(y)dy = f(x) + ε 2[ E(x)f (x) + Δ f(x) ] + O ( ε 2), (2.) where the O(ε 2 ) depends on the curvature of the manifold, but the O(ε 3/2 ) vanishes as long as the manifold is smooth. 3. The variance error The error term in onte Carlo integration stems from the variance (noise) term which we evaluate below. Interestingly, the noise terms in the integrals of F and G are highly correlated, so we obtain a faster convergence rate. In what follows we show that the estimator EF Nj i G(x j ) EG in probability faster than what could be expected [], because F and G are highly correlated random variables. To this end, we use the Chernoff inequality to establish an upper bound for the probability p(n,α) of having an α-error p(n,α) Pr G(x j ) EF EG >α, (3.) while an upper bound for Pr EF Nj i G(x j ) EG < α can be obtained similarly. The random variables G(x j ) are positive, thus N [ p(n,α) = Pr (EG)F (xj ) (EF + αeg)g(x j ) ] > 0, (3.2) j i which we rewrite as N p(n,α) = Pr Y j >(N )α(eg), 2 (3.3) where j i Y j = (EG)F (x j ) (EF)G(x j ) + α(eg) ( EG G(x j ) ) (3.4) are zero mean (EY j = 0) i.i.d. random variables. To ease the notation we replace and G(x j ) by F and G, respectively, because these are identical random variables. The variance of Y j depends on the second moments E(F 2 ), E(G 2 ) and E(F G) EYj 2 = (EG)2 E ( F 2) 2(EG)(EF)E(F G) + (EF) 2 E ( G 2) + 2α(EG) [ (EF)E ( G 2) (EG)E(F G) ] + α 2 (EG) 2[ E ( G 2) (EG) 2], (3.5) which are calculated by Eq. (2.) EF = (2πε)d/2 EG = (2πε)d/2 E ( F 2) = (πε)d/2 f(x) + ε 2[ E(x)f (x) + Δ f(x) ] + O ( ε 2), (3.6) + ε 2 E(x) + O( ε 2), (3.7) f 2 (x) + ε 4[ E(x)f 2 (x) + Δ f 2 (x) ] + O ( ε 2), (3.8)

32 A. Singer / Appl. Comput. Harmon. Anal. 2 (2006) 28 34 E ( G 2) = (πε)d/2 + ε 4 E(x) + O( ε 2), (3.9) E(F G) = (πε)d/2 Therefore, the variance of Y j is EY 2 j = 2d (πε) 3d/2 3 + 2α 2d (πε) 3d/2 3 f(x) + ε 4[ E(x)f (x) + Δ f(x) ] + O ( ε 2). (3.0) ε [ Δ f 2 (x) 2f(x)Δ f(x) + O(ε) ] 4 ε [ Δ f(x) + O(ε) ] + α 2 2d (πε) 3d/2 [ ( + O ε, ε d/2 )] 4 3. (3.) Clearly, we put our focus in the regime where α ε, because we are estimating an O(ε) quantity, namely 2 ε Δ f(x), so there is no point of having an error α which is larger than the estimate itself. Therefore, the α and α 2 terms of the variance (3.) are negligible, and we obtain EYj 2 = 2d (πε) 3d/2 ε [ Δ 3 f 2 (x) 2f(x)Δ f(x) + O(ε) ]. (3.2) 4 Furthermore, Δ f 2 2fΔ f = 2 f 2, hence EYj 2 = 2d (πε) 3d/2 ε [ 3 f 2 + O(ε) ]. (3.3) 2 Note that Y j are bounded random variables, because F and G are bounded (2.3) (2.4), and the definition of Y j (3.4). Therefore, the inequality of Chernoff gives p(n,α) 2exp (N )α2 2 d (πε) d/2 [ f 2. (3.4) + O(ε)] The practical meaning of the inequality (3.4) is that the noise error in the estimation of 2 ε Δ f(x) is of the order of f α 2 + O(ε) N2 d/2 (πε) d/4. (3.5) To extract the Laplacian itself we further divide by ε (see Eq. (.4)), therefore, the noise error is α ε f 2 ( ) + O(ε) N2 d/2 π d/4 ε /2+d/4 = O N /2 ε /2+d/4. (3.6) This improves the convergence rate of [] by the asymptotic factor ε /2. oreover, if f(x) = 0, (3.7) Eq. (3.5) implies that the convergence rate is even faster by another asymptotic factor of ε /2, and is given by O ( ) N /2 ε. d/4 4. Example: spherical harmonics In this section we illustrate the convergence rate (.7) for the case of the unit sphere = S 2 R 3,atwodimensional compact manifold (d = 2). The eigenfunctions of the Laplacian on the unit sphere are known as the spherical harmonics. The first nontrivial spherical harmonic function is f(x,y,z) = z which satisfies Δ f = 2f. In particular, at the north pole Δ f (0,0,) = 2. Furthermore, the tangent plane at the north pole is the xy-plane, so f (0,0,) = 0. The gradient condition (3.7) implies that the convergence rate should be ε /2, and indeed the computer simulation yielded ε 0.52 (see Figs. and 2). Next, we examined the function g = z + x at the north pole. Since x is also a spherical harmonic, and x = 0 at the north pole, it follows that Δ g (0,0,) = 2 as well. However, g 0, because g changes at the x direction. Therefore, by Eq. (.7) we expect the convergence rate to be ε, and indeed a linear fit gives a slope 0.998 (see Fig. 3).

A. Singer / Appl. Comput. Harmon. Anal. 2 (2006) 28 34 33 Fig.. The error between the discrete and continuous Laplace Beltrami of the unit sphere S 2 of the spherical harmonic f = z at the north pole as a function of ε. N = 3000 points were uniformly distributed over the sphere, and the error is averaged over m = 000 independent trials (for m mi= ( Lf ( 2) ) 2.Forε 0.3 aminimal each ε) as error = error is obtained, where the sum of the bias and the variance errors is minimal. Fig. 2. A logarithmic plot of Fig.. A linear fit of the small ε region, where the variance error is dominant, gives a slope 0.52 in agreement with the gradient condition (3.7) that predicts ε /2 asymptotic. Fig. 3. A logarithmic plot of the graph Laplacian error against ε, forg = z + x at the north pole. A linear fit in the variance error dominant regime gives a slope 0.998 in accordance with the expected ε (Eq. (.7)). References []. Hein, J. Audibert, U. von Luxburg, From graphs to manifolds Weak and strong pointwise consistency of graph Laplacians, in: P. Auer, R. eir (Eds.), Proc. 8th Conf. Learning Theory (COLT), Lecture Notes Comput. Sci., vol. 3559, Springer-Verlag, Berlin, 2005, pp. 470 485. [2]. Belkin, Problems of learning on manifolds, Ph.D. dissertation, University of Chicago, 2003. [3]. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Adv. Neural Inform. Process. Syst., vol. 4, IT Press, Cambridge, A, 2002. [4]. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 5 (6) (2003) 373 396. [5]. Belkin, P. Niyogi, Towards a theoretical foundation for Laplacian-based manifold methods, in: P. Auer, R. eir (Eds.), Proc. 8th Conf. Learning Theory (COLT), Lecture Notes Comput. Sci., vol. 3559, Springer-Verlag, Berlin, 2005, pp. 486 500. [6] S. Lafon, Diffusion maps and geometric harmonics, Ph.D. dissertation, Yale University, 2004. [7] B. Nadler, S. Lafon, R.R. Coifman, I.G. Kevrekidis, Diffusion maps, spectral clustering and eigenfunctions of Fokker Planck operators, in: Y. Weiss, B. Schölkopf, J. Platt (Eds.), Adv. Neural Inform. Process. Syst. (NIPS), vol. 8, IT Press, Cambridge, A, 2006.

34 A. Singer / Appl. Comput. Harmon. Anal. 2 (2006) 28 34 [8] R.R. Coifman, S. Lafon, A.B. Lee,. aggioni, B. Nadler, F. Warner, S.W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, Proc. Natl. Acad. Sci. 02 (2) (2005) 7426 743. [9] R.R. Coifman, S. Lafon, A.B. Lee,. aggioni, B. Nadler, F. Warner, S.W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: ultiscale methods, Proc. Natl. Acad. Sci. 02 (2) (2005) 7432 7437. [0] O.G. Smolyanov, H.V. Weizsäcker, O. Wittich, Brownian motion on a manifold as limit of stepwise conditioned standard Brownian motions, in: F. Gesztesy (Ed.), Stochastic Processes, Physics and Geometry, New Interplays, II, CS Conf. Proc., vol. 29, Amer. ath. Soc., Providence, RI, 2000. []. Hein, Y. Audibert, Intrinsic dimensionality estimation of submanifolds in R d, in: L. De Raedt, S. Wrobel (Eds.), Proc. 22nd Int. Conf. achine Learning, AC, 2005, pp. 289 296.