Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of large data sets. R.R. Coifman, S. Lafon, MM Mathematics Department Program of Applied Mathematics. Yale University
Motivations The main problem is to analyse lots of data in high dimensions. Paradigm: we have a large number of documents (e.g.: web pages, gene array data, (hyper)spectral data, molecular dynamics data etc...) and a way of measuring similarity between pairs. Model: a graph (G,E,W) In some cases: vertices are points in high-dimensional Euclidean space, weights are a function of Euclidean distance. Problems Understand data sets in high-dimensions, and classes of functions on them Approximation and learning of such functions Parametrize low dimensional data sets embedded in high-dimension Fast algorithms
Biotech data (Gene arrays, proteomic data) Customer databases: companies collect and process information on (potential) customers Financial data High dimensional data: examples Web searching Satellite imagery however... In many situations constraints force the data to lie on sets which a very small intrinsic dimensionality compared to that of the ambient space. In the case of graphs, or arbitrary metric spaces, there are notions of intrinsic complexity, or of embeddability in low dimensional Hilbert spaces.
Curse of dimensionality The high dimension is an obstacle to the processing of the data: Approximation of functions: to represent C 1 functions on a grid with accuracy, one needs -n grid points Density estimation difficult: one needs a lot of data points, otherwise most bins are empty Computational cost of many algorithms grows exponentially with the dimension (e.g. Nearest neighbor search, Fast Multipole Method)
Diffusion Geometries RR Coifman & S. Lafon Geodesic distance ---> Diffusion distance Diffusion distance is more stable, uses a preponderance of evidence
On the graph of documents with similarities there is a natural random walk: we get a Markov chain represented by a matrix P(x,y). If P is symmetric and positive semidefinite, we can define the diffusion distance by D 2 m ( x, y) p m ( x, x) p m ( y, y) 2 p m ( x, y) m m p ( x,.) p ( y,.) 2 m j ( ( x) ( y)) j Geometric Diffusion map j j 2 x X(x) { i i ( x)} l 2. Embeds the graph in Euclidean space, up to precision, via the eigenfunctions, mapping diffusion distance into Euclidean distance. For a set of points in Euclidean space, sampled from a Riemannian manifold, one can build a discretized Laplace-Beltrami operator (associated to the canonical Brownian motion constrained on the manifold) and map the manifold with diffusion distance isometrically in Euclidean space.
Original points Embeddings
Phi1 Phi2 Phi3
Diffusion Wavelets RR Coifman & MM Eigenfunctions are like global Fourier Analysis on the data set, they live in different frequency bands but are not localized. We would like to have elements localized both in frequency and space (compatibly with Heisenberg principles), and critically sampled at the rate corresponding to the frequency band. Where are the frequencies? 1.9.8.7.6.5 (T2 ).4 (T4 ).3 (T8 ).2 (T16 ).1 5 1 15 2 25 3... V V V V 3 2 1
Multiresolution diffusion wavelet construction of orthonormal diffusion scaling functions.
All this can be done in n log(n), n cardinality of the space!
Fast multipole method for generalized potentials
8 12 1.2.5 1.4.3.8.2.6.1.4.2 -.1 -.2 -.2 5 1 15 12 2 25 3 -.3.2.3.15.2.1.1.5 -.1 -.5 -.2 -.1 -.3 -.15 5 1 15 1 15 2 25 3 15.4 -.4 5 2 25 3 -.2 5 1 15 2 25 3
( 16 (x), 16,2 16,3 (x)).15.25.2.1.15.5.1.5 -.5 -.5 -.1 -.1 -.15 -.15 -.2 5 1 15 2 25 3 -.2 -.5.5.1.15.2.25.3
Comments, Applications, etc... This is wavelet analysis on manifolds (and more, e.g. fractals), graphs, markov chains, while Laplacian eigenfunctions do Fourier Analysis on manifolds (and fractals, etc...). We are compressing powers of the operator, functions of the operators, subspaces of the function subspaces on which its powers act (Heisenberg principle...), and the space itself (sampling theorems, quadrature formulas...) We are constructing a biorthogonal version of the transform (better adapted to studying Markov chains) and wavelet packets: this will allow efficient denoising, compression, discrimination on all the spaces mentioned above. The multiscale spaces are a natural scale of complexity spaces for learning empirical functions on the data set. Diffusion wavelets extend outside the set, in a natural multiscale fashion. To be tied with measure-geometric considerations used to embed metric spaces in Euclidean spaces with small distortion. Study and compression of dynamical systems.